commit
9dd9ae0058
|
@ -0,0 +1,323 @@
|
|||
---
|
||||
kep-number: 14 FIXME(13)
|
||||
title: Runtime Class
|
||||
authors:
|
||||
- "@tallclair"
|
||||
owning-sig: sig-node
|
||||
participating-sigs:
|
||||
- sig-architecture
|
||||
reviewers:
|
||||
- TBD
|
||||
approvers:
|
||||
- TBD
|
||||
editor: TBD
|
||||
creation-date: 2018-06-19
|
||||
status: provisional
|
||||
---
|
||||
|
||||
# Runtime Class
|
||||
|
||||
## Table of Contents
|
||||
|
||||
* [Summary](#summary)
|
||||
* [Motivation](#motivation)
|
||||
* [Goals](#goals)
|
||||
* [Non\-Goals](#non-goals)
|
||||
* [User Stories](#user-stories)
|
||||
* [Proposal](#proposal)
|
||||
* [API](#api)
|
||||
* [Runtime Handler](#runtime-handler)
|
||||
* [Versioning, Updates, and Rollouts](#versioning-updates-and-rollouts)
|
||||
* [Implementation Details](#implementation-details)
|
||||
* [Risks and Mitigations](#risks-and-mitigations)
|
||||
* [Graduation Criteria](#graduation-criteria)
|
||||
* [Implementation History](#implementation-history)
|
||||
* [Appendix](#appendix)
|
||||
* [Examples of runtime variation](#examples-of-runtime-variation)
|
||||
|
||||
## Summary
|
||||
|
||||
`RuntimeClass` is a new cluster-scoped resource that surfaces container runtime properties to the
|
||||
control plane. RuntimeClasses are assigned to pods through a `runtimeClass` field on the
|
||||
`PodSpec`. This provides a new mechanism for supporting multiple runtimes in a cluster and/or node.
|
||||
|
||||
## Motivation
|
||||
|
||||
There is growing interest in using different runtimes within a cluster. [Sandboxes][] are the
|
||||
primary motivator for this right now, with both Kata containers and gVisor looking to integrate with
|
||||
Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also
|
||||
require support in the future. RuntimeClass provides a way to select between different runtimes
|
||||
configured in the cluster and surface their properties (both to the cluster & the user).
|
||||
|
||||
In addition to selecting the runtime to use, supporting multiple runtimes raises other problems to
|
||||
the control plane level, including: accounting for runtime overhead, scheduling to nodes that
|
||||
support the runtime, and surfacing which optional features are supported by different
|
||||
runtimes. Although these problems are not tackled by this initial proposal, RuntimeClass provides a
|
||||
cluster-scoped resource tied to the runtime that can help solve these problems in a future update.
|
||||
|
||||
[Sandboxes]: https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRThsOeBSts_imYEoRyw8A/edit
|
||||
|
||||
### Goals
|
||||
|
||||
- Provide a mechanism for surfacing container runtime properties to the control plane
|
||||
- Support multiple runtimes per-cluster, and provide a mechanism for users to select the desired
|
||||
runtime
|
||||
|
||||
### Non-Goals
|
||||
|
||||
- RuntimeClass is NOT RuntimeComponentConfig.
|
||||
- RuntimeClass is NOT a general policy mechanism.
|
||||
- RuntimeClass is NOT "NodeClass". Although different nodes may run different runtimes, in general
|
||||
RuntimeClass should not be a cross product of runtime properties and node properties.
|
||||
|
||||
The following goals are out-of-scope for the initial implementation, but may be explored in a future
|
||||
iteration:
|
||||
|
||||
- Surfacing support for optional features by runtimes, and surfacing errors caused by
|
||||
incompatible features & runtimes earlier.
|
||||
- Automatic runtime or feature discovery - initially RuntimeClasses are manually defined (by the
|
||||
cluster admin or provider), and are asserted to be an accurate representation of the runtime.
|
||||
- Scheduling in heterogeneous clusters - it is possible to operate a heterogeneous cluster
|
||||
(different runtime configurations on different nodes) through scheduling primitives like
|
||||
`NodeAffinity` and `Taints+Tolerations`, but the user is responsible for setting these up and
|
||||
automatic runtime-aware scheduling is out-of-scope.
|
||||
- Define standardized or conformant runtime classes - although I would like to declare some
|
||||
predefined RuntimeClasses with specific properties, doing so is out-of-scope for this initial KEP.
|
||||
- [Pod Overhead][] - Although RuntimeClass is likely to be the configuration mechanism of choice,
|
||||
the details of how pod resource overhead will be implemented is out of scope for this KEP.
|
||||
- Provide a mechanism to dynamically register or provision additional runtimes.
|
||||
- Requiring specific RuntimeClasses according to policy. This should be addressed by other
|
||||
cluster-level policy mechanisms, such as PodSecurityPolicy.
|
||||
- "Fitting" a RuntimeClass to pod requirements - In other words, specifying runtime properties and
|
||||
letting the system match an appropriate RuntimeClass, rather than explicitly assigning a
|
||||
RuntimeClass by name. This approach can increase portability, but can be added seamlessly in a
|
||||
future iteration.
|
||||
|
||||
[Pod Overhead]: https://docs.google.com/document/d/1EJKT4gyl58-kzt2bnwkv08MIUZ6lkDpXcxkHqCvvAp4/edit
|
||||
|
||||
### User Stories
|
||||
|
||||
- As a cluster operator, I want to provide multiple runtime options to support a wide variety of
|
||||
workloads. Examples include native linux containers, "sandboxed" containers, and windows
|
||||
containers.
|
||||
- As a cluster operator, I want to provide stable rolling upgrades of runtimes. For
|
||||
example, rolling out an update with backwards incompatible changes or previously unsupported
|
||||
features.
|
||||
- As an application developer, I want to select the runtime that best fits my workload.
|
||||
- As an application developer, I don't want to study the nitty-gritty details of different runtime
|
||||
implementations, but rather choose from pre-configured classes.
|
||||
- As an application developer, I want my application to be portable across clusters that use similar
|
||||
but different variants of a "class" of runtimes.
|
||||
|
||||
## Proposal
|
||||
|
||||
The initial design includes:
|
||||
|
||||
- `RuntimeClass` API resource definition
|
||||
- `RuntimeClass` pod field for specifying the RuntimeClass the pod should be run with
|
||||
- Kubelet implementation for fetching & interpreting the RuntimeClass
|
||||
- CRI API & implementation for passing along the [RuntimeHandler](#runtime-handler).
|
||||
|
||||
### API
|
||||
|
||||
`RuntimeClass` is a new cluster-scoped resource in the `node.k8s.io` API group.
|
||||
|
||||
> _The `node.k8s.io` API group would eventually hold the Node resource when `core` is retired.
|
||||
> Alternatives considered: `runtime.k8s.io`, `cluster.k8s.io`_
|
||||
|
||||
_(This is a simplified declaration, syntactic details will be covered in the API PR review)_
|
||||
|
||||
```go
|
||||
type RuntimeClass struct {
|
||||
metav1.TypeMeta
|
||||
// ObjectMeta minimally includes the RuntimeClass name, which is used to reference the class.
|
||||
// Namespace should be left blank.
|
||||
metav1.ObjectMeta
|
||||
|
||||
Spec RuntimeClassSpec
|
||||
}
|
||||
|
||||
type RuntimeClassSpec struct {
|
||||
// RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container
|
||||
// creation. The possible values are specific to a given configuration & CRI implementation.
|
||||
// The empty string is equivalent to the default behavior.
|
||||
// +optional
|
||||
RuntimeHandler string
|
||||
}
|
||||
```
|
||||
|
||||
The runtime is selected by the pod by specifying the RuntimeClass in the PodSpec. Once the pod is
|
||||
scheduled, the RuntimeClass cannot be changed.
|
||||
|
||||
```go
|
||||
type PodSpec struct {
|
||||
...
|
||||
// RuntimeClassName refers to a RuntimeClass object with the same name,
|
||||
// which should be used to run this pod.
|
||||
// +optional
|
||||
RuntimeClassName string
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `legacy` RuntimeClass name is reserved. The legacy RuntimeClass is defined to be fully backwards
|
||||
compatible with current Kubernetes. This means that the legacy runtime does not specify any
|
||||
RuntimeHandler or perform any feature validation (all features are "supported").
|
||||
|
||||
```go
|
||||
const (
|
||||
// RuntimeClassNameLegacy is a reserved RuntimeClass name. The legacy
|
||||
// RuntimeClass does not specify a runtime handler or perform any
|
||||
// feature validation.
|
||||
RuntimeClassNameLegacy = "legacy"
|
||||
)
|
||||
```
|
||||
|
||||
An unspecified RuntimeClassName `""` is equivalent to the `legacy` RuntimeClass, though the field is
|
||||
not defaulted to `legacy` (to leave room for configurable defaults in a future update).
|
||||
|
||||
#### Runtime Handler
|
||||
|
||||
The `RuntimeHandler` is passed to the CRI as part of the `RunPodSandboxRequest`:
|
||||
|
||||
```proto
|
||||
message RunPodSandboxRequest {
|
||||
// Configuration for creating a PodSandbox.
|
||||
PodSandboxConfig config = 1;
|
||||
// Named runtime configuration to use for this PodSandbox.
|
||||
string RuntimeHandler = 2;
|
||||
}
|
||||
```
|
||||
|
||||
The RuntimeHandler is provided as a mechanism for CRI implementations to select between different
|
||||
predetermined configurations. The initial use case is replacing the experimental pod annotations
|
||||
currently used for selecting a sandboxed runtime by various CRI implementations:
|
||||
|
||||
| CRI Runtime | Pod Annotation |
|
||||
| ------------|-------------------------------------------------------------|
|
||||
| CRIO | io.kubernetes.cri-o.TrustedSandbox: "false" |
|
||||
| containerd | io.kubernetes.cri.untrusted-workload: "true" |
|
||||
| frakti | runtime.frakti.alpha.kubernetes.io/OSContainer: "true"<br>runtime.frakti.alpha.kubernetes.io/Unikernel: "true" |
|
||||
| windows | experimental.windows.kubernetes.io/isolation-type: "hyperv" |
|
||||
|
||||
These implementations could stick with scheme ("trusted" and "untrusted"), but the preferred
|
||||
approach is a non-binary one wherein arbitrary handlers can be configured with a name that can be
|
||||
matched against the specified RuntimeHandler. For example, containerd might have a configuration
|
||||
corresponding to a "kata-runtime" handler:
|
||||
|
||||
```
|
||||
[plugins.cri.containerd.kata-runtime]
|
||||
runtime_type = "io.containerd.runtime.v1.linux"
|
||||
runtime_engine = "/opt/kata/bin/kata-runtime"
|
||||
runtime_root = ""
|
||||
```
|
||||
|
||||
This non-binary approach is more flexible: it can still map to a binary RuntimeClass selection
|
||||
(e.g. `sandboxed` or `untrusted` RuntimeClasses), but can also support multiple parallel sandbox
|
||||
types (e.g. `kata-containers` or `gvisor` RuntimeClasses).
|
||||
|
||||
### Versioning, Updates, and Rollouts
|
||||
|
||||
Getting upgrades and rollouts right is a very nuanced and complicated problem. For the initial alpha
|
||||
implementation, we will kick the can down the road by making the `RuntimeClassSpec` **immutable**,
|
||||
thereby requiring changes to be pushed as a newly named RuntimeClass instance. This means that pods
|
||||
must be updated to reference the new RuntimeClass, and comes with the advantage of native support
|
||||
for rolling updates through the same mechanisms as any other application update. The
|
||||
`RuntimeClassName` pod field is also immutable post scheduling.
|
||||
|
||||
This conservative approach is preferred since it's much easier to relax constraints in a backwards
|
||||
compatible way than tighten them. We should revisit this decision prior to graduating RuntimeClass
|
||||
to beta.
|
||||
|
||||
### Implementation Details
|
||||
|
||||
The Kubelet uses an Informer to keep a local cache of all RuntimeClass objects. When a new pod is
|
||||
added, the Kubelet resolves the Pod's RuntimeClass against the local RuntimeClass cache. Once
|
||||
resolved, the RuntimeHandler field is passed to the CRI as part of the
|
||||
[`RunPodSandboxRequest`][]. At that point, the interpretation of the RuntimeHandler is left to the
|
||||
CRI implementation, but it should be cached if needed for subsequent calls.
|
||||
|
||||
If the RuntimeClass cannot be resolved (e.g. doesn't exist) at Pod creation, then the request will
|
||||
be rejected in admission (controller to be detailed in a following update). If the RuntimeClass
|
||||
cannot be resolved by the Kubelet when `RunPodSandbox` should be called, then the Kubelet will fail
|
||||
the Pod. The admission check on a replica recreation will prevent the scheduler from thrashing. If
|
||||
the `RuntimeHandler` is not recognized by the CRI implementation, then `RunPodSandbox` will return
|
||||
an error.
|
||||
|
||||
[RunPodSandboxRequest]: https://github.com/kubernetes/kubernetes/blob/b05a61e299777c2030fbcf27a396aff21b35f01b/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L344
|
||||
|
||||
### Risks and Mitigations
|
||||
|
||||
**Scope creep.** RuntimeClass has a fairly broad charter, but it should not become a default
|
||||
dumping ground for every new feature exposed by the node. For each feature, careful consideration
|
||||
should be made about whether it belongs on the Pod, Node, RuntimeClass, or some other resource. The
|
||||
[non-goals](#non-goals) should be kept in mind when considering RuntimeClass features.
|
||||
|
||||
**Becoming a general policy mechanism.** RuntimeClass should not be used a replacement for
|
||||
PodSecurityPolicy. The use cases for defining multiple RuntimeClasses for the same underlying
|
||||
runtime implementation should be extremely limited (generally only around updates & rollouts). To
|
||||
enforce this, no authorization or restrictions are placed directly on RuntimeClass use; in order to
|
||||
restrict a user to a specific RuntimeClass, you must use another policy mechanism such as
|
||||
PodSecurityPolicy.
|
||||
|
||||
**Pushing complexity to the user.** RuntimeClass is a new resource in order to hide the complexity
|
||||
of runtime configuration from most users (aside from the cluster admin or provisioner). However, we
|
||||
are still side-stepping the issue of precisely defining specific types of runtimes like
|
||||
"Sandboxed". However, it is still up for debate whether precisely defining such runtime categories
|
||||
is even possible. RuntimeClass allows us to decouple this specification from the implementation, but
|
||||
it is still something I hope we can address in a future iteration through the concept of pre-defined
|
||||
or "conformant" RuntimeClasses.
|
||||
|
||||
**Non-portability.** We are already in a world of non-portability for many features (see [examples
|
||||
of runtime variation](#examples-of-runtime-variation). Future improvements to RuntimeClass can help
|
||||
address this issue by formally declaring supported features, or matching the runtime that supports a
|
||||
given workload automitaclly. Another issue is that pods need to refer to a RuntimeClass by name,
|
||||
which may not be defined in every cluster. This is something that can be addressed through
|
||||
pre-defined runtime classes (see previous risk), and/or by "fitting" pod requirements to compatible
|
||||
RuntimeClasses.
|
||||
|
||||
## Graduation Criteria
|
||||
|
||||
Alpha:
|
||||
|
||||
- Everything described in the current proposal
|
||||
- [CRI validation test][cri-validation]
|
||||
|
||||
[cri-validation]: https://github.com/kubernetes-incubator/cri-tools/blob/master/docs/validation.md
|
||||
|
||||
Beta:
|
||||
|
||||
- Major runtimes support RuntimeClass
|
||||
- RuntimeClasses are configured in the E2E environment with test coverage of a non-legacy RuntimeClass
|
||||
- The update & upgrade story is revisited, and a longer-term approach is implemented as necessary.
|
||||
- The cluster admin can choose which RuntimeClass is the default in a cluster.
|
||||
- Additional requirements TBD
|
||||
|
||||
## Implementation History
|
||||
|
||||
- 2018-06-11: SIG-Node decision to move forward with proposal
|
||||
- 2018-06-19: Initial KEP published.
|
||||
|
||||
## Appendix
|
||||
|
||||
### Examples of runtime variation
|
||||
|
||||
- Linux Security Module (LSM) choice - Kubernetes supports both AppArmor & SELinux options on pods,
|
||||
but those are mutually exclusive, and support of either is not required by the runtime. The
|
||||
default configuration is also not well defined.
|
||||
- Seccomp-bpf - Kubernetes has alpha support for specifying a seccomp profile, but the default is
|
||||
defined by the runtime, and support is not guaranteed.
|
||||
- Windows containers - isolation features are very OS-specific, and most of the current features are
|
||||
limited to linux. As we build out Windows container support, we'll need to add windows-specific
|
||||
features as well.
|
||||
- Host namespaces (Network,PID,IPC) may not be supported by virtualization-based runtimes
|
||||
(e.g. Kata-containers & gVisor).
|
||||
- Per-pod and Per-container resource overhead varies by runtime.
|
||||
- Device support (e.g. GPUs) varies wildly by runtime & nodes.
|
||||
- Supported volume types varies by node - it remains TBD whether this information belongs in
|
||||
RuntimeClass.
|
||||
- The list of default capabilities is defined in Docker, but not Kubernetes. Future runtimes may
|
||||
have differing defaults, or support a subset of capabilities.
|
||||
- `Privileged` mode is not well defined, and thus may have differing implementations.
|
||||
- Support for resource over-commit and dynamic resource sizing (e.g. Burstable vs Guaranteed
|
||||
workloads)
|
Loading…
Reference in New Issue