diff --git a/keps/sig-node/0014-runtime-class.md b/keps/sig-node/0014-runtime-class.md new file mode 100644 index 000000000..1370875f5 --- /dev/null +++ b/keps/sig-node/0014-runtime-class.md @@ -0,0 +1,323 @@ +--- +kep-number: 14 FIXME(13) +title: Runtime Class +authors: + - "@tallclair" +owning-sig: sig-node +participating-sigs: + - sig-architecture +reviewers: + - TBD +approvers: + - TBD +editor: TBD +creation-date: 2018-06-19 +status: provisional +--- + +# Runtime Class + +## Table of Contents + +* [Summary](#summary) +* [Motivation](#motivation) + * [Goals](#goals) + * [Non\-Goals](#non-goals) + * [User Stories](#user-stories) +* [Proposal](#proposal) + * [API](#api) + * [Runtime Handler](#runtime-handler) + * [Versioning, Updates, and Rollouts](#versioning-updates-and-rollouts) + * [Implementation Details](#implementation-details) + * [Risks and Mitigations](#risks-and-mitigations) +* [Graduation Criteria](#graduation-criteria) +* [Implementation History](#implementation-history) +* [Appendix](#appendix) + * [Examples of runtime variation](#examples-of-runtime-variation) + +## Summary + +`RuntimeClass` is a new cluster-scoped resource that surfaces container runtime properties to the +control plane. RuntimeClasses are assigned to pods through a `runtimeClass` field on the +`PodSpec`. This provides a new mechanism for supporting multiple runtimes in a cluster and/or node. + +## Motivation + +There is growing interest in using different runtimes within a cluster. [Sandboxes][] are the +primary motivator for this right now, with both Kata containers and gVisor looking to integrate with +Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also +require support in the future. RuntimeClass provides a way to select between different runtimes +configured in the cluster and surface their properties (both to the cluster & the user). + +In addition to selecting the runtime to use, supporting multiple runtimes raises other problems to +the control plane level, including: accounting for runtime overhead, scheduling to nodes that +support the runtime, and surfacing which optional features are supported by different +runtimes. Although these problems are not tackled by this initial proposal, RuntimeClass provides a +cluster-scoped resource tied to the runtime that can help solve these problems in a future update. + +[Sandboxes]: https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRThsOeBSts_imYEoRyw8A/edit + +### Goals + +- Provide a mechanism for surfacing container runtime properties to the control plane +- Support multiple runtimes per-cluster, and provide a mechanism for users to select the desired + runtime + +### Non-Goals + +- RuntimeClass is NOT RuntimeComponentConfig. +- RuntimeClass is NOT a general policy mechanism. +- RuntimeClass is NOT "NodeClass". Although different nodes may run different runtimes, in general + RuntimeClass should not be a cross product of runtime properties and node properties. + +The following goals are out-of-scope for the initial implementation, but may be explored in a future +iteration: + +- Surfacing support for optional features by runtimes, and surfacing errors caused by + incompatible features & runtimes earlier. +- Automatic runtime or feature discovery - initially RuntimeClasses are manually defined (by the + cluster admin or provider), and are asserted to be an accurate representation of the runtime. +- Scheduling in heterogeneous clusters - it is possible to operate a heterogeneous cluster + (different runtime configurations on different nodes) through scheduling primitives like + `NodeAffinity` and `Taints+Tolerations`, but the user is responsible for setting these up and + automatic runtime-aware scheduling is out-of-scope. +- Define standardized or conformant runtime classes - although I would like to declare some + predefined RuntimeClasses with specific properties, doing so is out-of-scope for this initial KEP. +- [Pod Overhead][] - Although RuntimeClass is likely to be the configuration mechanism of choice, + the details of how pod resource overhead will be implemented is out of scope for this KEP. +- Provide a mechanism to dynamically register or provision additional runtimes. +- Requiring specific RuntimeClasses according to policy. This should be addressed by other + cluster-level policy mechanisms, such as PodSecurityPolicy. +- "Fitting" a RuntimeClass to pod requirements - In other words, specifying runtime properties and + letting the system match an appropriate RuntimeClass, rather than explicitly assigning a + RuntimeClass by name. This approach can increase portability, but can be added seamlessly in a + future iteration. + +[Pod Overhead]: https://docs.google.com/document/d/1EJKT4gyl58-kzt2bnwkv08MIUZ6lkDpXcxkHqCvvAp4/edit + +### User Stories + +- As a cluster operator, I want to provide multiple runtime options to support a wide variety of + workloads. Examples include native linux containers, "sandboxed" containers, and windows + containers. +- As a cluster operator, I want to provide stable rolling upgrades of runtimes. For + example, rolling out an update with backwards incompatible changes or previously unsupported + features. +- As an application developer, I want to select the runtime that best fits my workload. +- As an application developer, I don't want to study the nitty-gritty details of different runtime + implementations, but rather choose from pre-configured classes. +- As an application developer, I want my application to be portable across clusters that use similar + but different variants of a "class" of runtimes. + +## Proposal + +The initial design includes: + +- `RuntimeClass` API resource definition +- `RuntimeClass` pod field for specifying the RuntimeClass the pod should be run with +- Kubelet implementation for fetching & interpreting the RuntimeClass +- CRI API & implementation for passing along the [RuntimeHandler](#runtime-handler). + +### API + +`RuntimeClass` is a new cluster-scoped resource in the `node.k8s.io` API group. + +> _The `node.k8s.io` API group would eventually hold the Node resource when `core` is retired. +> Alternatives considered: `runtime.k8s.io`, `cluster.k8s.io`_ + +_(This is a simplified declaration, syntactic details will be covered in the API PR review)_ + +```go +type RuntimeClass struct { + metav1.TypeMeta + // ObjectMeta minimally includes the RuntimeClass name, which is used to reference the class. + // Namespace should be left blank. + metav1.ObjectMeta + + Spec RuntimeClassSpec +} + +type RuntimeClassSpec struct { + // RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container + // creation. The possible values are specific to a given configuration & CRI implementation. + // The empty string is equivalent to the default behavior. + // +optional + RuntimeHandler string +} +``` + +The runtime is selected by the pod by specifying the RuntimeClass in the PodSpec. Once the pod is +scheduled, the RuntimeClass cannot be changed. + +```go +type PodSpec struct { + ... + // RuntimeClassName refers to a RuntimeClass object with the same name, + // which should be used to run this pod. + // +optional + RuntimeClassName string + ... +} +``` + +The `legacy` RuntimeClass name is reserved. The legacy RuntimeClass is defined to be fully backwards +compatible with current Kubernetes. This means that the legacy runtime does not specify any +RuntimeHandler or perform any feature validation (all features are "supported"). + +```go +const ( + // RuntimeClassNameLegacy is a reserved RuntimeClass name. The legacy + // RuntimeClass does not specify a runtime handler or perform any + // feature validation. + RuntimeClassNameLegacy = "legacy" +) +``` + +An unspecified RuntimeClassName `""` is equivalent to the `legacy` RuntimeClass, though the field is +not defaulted to `legacy` (to leave room for configurable defaults in a future update). + +#### Runtime Handler + +The `RuntimeHandler` is passed to the CRI as part of the `RunPodSandboxRequest`: + +```proto +message RunPodSandboxRequest { + // Configuration for creating a PodSandbox. + PodSandboxConfig config = 1; + // Named runtime configuration to use for this PodSandbox. + string RuntimeHandler = 2; +} +``` + +The RuntimeHandler is provided as a mechanism for CRI implementations to select between different +predetermined configurations. The initial use case is replacing the experimental pod annotations +currently used for selecting a sandboxed runtime by various CRI implementations: + +| CRI Runtime | Pod Annotation | +| ------------|-------------------------------------------------------------| +| CRIO | io.kubernetes.cri-o.TrustedSandbox: "false" | +| containerd | io.kubernetes.cri.untrusted-workload: "true" | +| frakti | runtime.frakti.alpha.kubernetes.io/OSContainer: "true"
runtime.frakti.alpha.kubernetes.io/Unikernel: "true" | +| windows | experimental.windows.kubernetes.io/isolation-type: "hyperv" | + +These implementations could stick with scheme ("trusted" and "untrusted"), but the preferred +approach is a non-binary one wherein arbitrary handlers can be configured with a name that can be +matched against the specified RuntimeHandler. For example, containerd might have a configuration +corresponding to a "kata-runtime" handler: + +``` +[plugins.cri.containerd.kata-runtime] + runtime_type = "io.containerd.runtime.v1.linux" + runtime_engine = "/opt/kata/bin/kata-runtime" + runtime_root = "" +``` + +This non-binary approach is more flexible: it can still map to a binary RuntimeClass selection +(e.g. `sandboxed` or `untrusted` RuntimeClasses), but can also support multiple parallel sandbox +types (e.g. `kata-containers` or `gvisor` RuntimeClasses). + +### Versioning, Updates, and Rollouts + +Getting upgrades and rollouts right is a very nuanced and complicated problem. For the initial alpha +implementation, we will kick the can down the road by making the `RuntimeClassSpec` **immutable**, +thereby requiring changes to be pushed as a newly named RuntimeClass instance. This means that pods +must be updated to reference the new RuntimeClass, and comes with the advantage of native support +for rolling updates through the same mechanisms as any other application update. The +`RuntimeClassName` pod field is also immutable post scheduling. + +This conservative approach is preferred since it's much easier to relax constraints in a backwards +compatible way than tighten them. We should revisit this decision prior to graduating RuntimeClass +to beta. + +### Implementation Details + +The Kubelet uses an Informer to keep a local cache of all RuntimeClass objects. When a new pod is +added, the Kubelet resolves the Pod's RuntimeClass against the local RuntimeClass cache. Once +resolved, the RuntimeHandler field is passed to the CRI as part of the +[`RunPodSandboxRequest`][]. At that point, the interpretation of the RuntimeHandler is left to the +CRI implementation, but it should be cached if needed for subsequent calls. + +If the RuntimeClass cannot be resolved (e.g. doesn't exist) at Pod creation, then the request will +be rejected in admission (controller to be detailed in a following update). If the RuntimeClass +cannot be resolved by the Kubelet when `RunPodSandbox` should be called, then the Kubelet will fail +the Pod. The admission check on a replica recreation will prevent the scheduler from thrashing. If +the `RuntimeHandler` is not recognized by the CRI implementation, then `RunPodSandbox` will return +an error. + +[RunPodSandboxRequest]: https://github.com/kubernetes/kubernetes/blob/b05a61e299777c2030fbcf27a396aff21b35f01b/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L344 + +### Risks and Mitigations + +**Scope creep.** RuntimeClass has a fairly broad charter, but it should not become a default +dumping ground for every new feature exposed by the node. For each feature, careful consideration +should be made about whether it belongs on the Pod, Node, RuntimeClass, or some other resource. The +[non-goals](#non-goals) should be kept in mind when considering RuntimeClass features. + +**Becoming a general policy mechanism.** RuntimeClass should not be used a replacement for +PodSecurityPolicy. The use cases for defining multiple RuntimeClasses for the same underlying +runtime implementation should be extremely limited (generally only around updates & rollouts). To +enforce this, no authorization or restrictions are placed directly on RuntimeClass use; in order to +restrict a user to a specific RuntimeClass, you must use another policy mechanism such as +PodSecurityPolicy. + +**Pushing complexity to the user.** RuntimeClass is a new resource in order to hide the complexity +of runtime configuration from most users (aside from the cluster admin or provisioner). However, we +are still side-stepping the issue of precisely defining specific types of runtimes like +"Sandboxed". However, it is still up for debate whether precisely defining such runtime categories +is even possible. RuntimeClass allows us to decouple this specification from the implementation, but +it is still something I hope we can address in a future iteration through the concept of pre-defined +or "conformant" RuntimeClasses. + +**Non-portability.** We are already in a world of non-portability for many features (see [examples +of runtime variation](#examples-of-runtime-variation). Future improvements to RuntimeClass can help +address this issue by formally declaring supported features, or matching the runtime that supports a +given workload automitaclly. Another issue is that pods need to refer to a RuntimeClass by name, +which may not be defined in every cluster. This is something that can be addressed through +pre-defined runtime classes (see previous risk), and/or by "fitting" pod requirements to compatible +RuntimeClasses. + +## Graduation Criteria + +Alpha: + +- Everything described in the current proposal +- [CRI validation test][cri-validation] + +[cri-validation]: https://github.com/kubernetes-incubator/cri-tools/blob/master/docs/validation.md + +Beta: + +- Major runtimes support RuntimeClass +- RuntimeClasses are configured in the E2E environment with test coverage of a non-legacy RuntimeClass +- The update & upgrade story is revisited, and a longer-term approach is implemented as necessary. +- The cluster admin can choose which RuntimeClass is the default in a cluster. +- Additional requirements TBD + +## Implementation History + +- 2018-06-11: SIG-Node decision to move forward with proposal +- 2018-06-19: Initial KEP published. + +## Appendix + +### Examples of runtime variation + +- Linux Security Module (LSM) choice - Kubernetes supports both AppArmor & SELinux options on pods, + but those are mutually exclusive, and support of either is not required by the runtime. The + default configuration is also not well defined. +- Seccomp-bpf - Kubernetes has alpha support for specifying a seccomp profile, but the default is + defined by the runtime, and support is not guaranteed. +- Windows containers - isolation features are very OS-specific, and most of the current features are + limited to linux. As we build out Windows container support, we'll need to add windows-specific + features as well. +- Host namespaces (Network,PID,IPC) may not be supported by virtualization-based runtimes + (e.g. Kata-containers & gVisor). +- Per-pod and Per-container resource overhead varies by runtime. +- Device support (e.g. GPUs) varies wildly by runtime & nodes. +- Supported volume types varies by node - it remains TBD whether this information belongs in + RuntimeClass. +- The list of default capabilities is defined in Docker, but not Kubernetes. Future runtimes may + have differing defaults, or support a subset of capabilities. +- `Privileged` mode is not well defined, and thus may have differing implementations. +- Support for resource over-commit and dynamic resource sizing (e.g. Burstable vs Guaranteed + workloads)