Merge pull request #2290 from tallclair/runtime-class

RuntimeClass KEP
This commit is contained in:
k8s-ci-robot 2018-07-31 07:11:38 -07:00 committed by GitHub
commit 9dd9ae0058
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 323 additions and 0 deletions

View File

@ -0,0 +1,323 @@
---
kep-number: 14 FIXME(13)
title: Runtime Class
authors:
- "@tallclair"
owning-sig: sig-node
participating-sigs:
- sig-architecture
reviewers:
- TBD
approvers:
- TBD
editor: TBD
creation-date: 2018-06-19
status: provisional
---
# Runtime Class
## Table of Contents
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non\-Goals](#non-goals)
* [User Stories](#user-stories)
* [Proposal](#proposal)
* [API](#api)
* [Runtime Handler](#runtime-handler)
* [Versioning, Updates, and Rollouts](#versioning-updates-and-rollouts)
* [Implementation Details](#implementation-details)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)
* [Appendix](#appendix)
* [Examples of runtime variation](#examples-of-runtime-variation)
## Summary
`RuntimeClass` is a new cluster-scoped resource that surfaces container runtime properties to the
control plane. RuntimeClasses are assigned to pods through a `runtimeClass` field on the
`PodSpec`. This provides a new mechanism for supporting multiple runtimes in a cluster and/or node.
## Motivation
There is growing interest in using different runtimes within a cluster. [Sandboxes][] are the
primary motivator for this right now, with both Kata containers and gVisor looking to integrate with
Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also
require support in the future. RuntimeClass provides a way to select between different runtimes
configured in the cluster and surface their properties (both to the cluster & the user).
In addition to selecting the runtime to use, supporting multiple runtimes raises other problems to
the control plane level, including: accounting for runtime overhead, scheduling to nodes that
support the runtime, and surfacing which optional features are supported by different
runtimes. Although these problems are not tackled by this initial proposal, RuntimeClass provides a
cluster-scoped resource tied to the runtime that can help solve these problems in a future update.
[Sandboxes]: https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRThsOeBSts_imYEoRyw8A/edit
### Goals
- Provide a mechanism for surfacing container runtime properties to the control plane
- Support multiple runtimes per-cluster, and provide a mechanism for users to select the desired
runtime
### Non-Goals
- RuntimeClass is NOT RuntimeComponentConfig.
- RuntimeClass is NOT a general policy mechanism.
- RuntimeClass is NOT "NodeClass". Although different nodes may run different runtimes, in general
RuntimeClass should not be a cross product of runtime properties and node properties.
The following goals are out-of-scope for the initial implementation, but may be explored in a future
iteration:
- Surfacing support for optional features by runtimes, and surfacing errors caused by
incompatible features & runtimes earlier.
- Automatic runtime or feature discovery - initially RuntimeClasses are manually defined (by the
cluster admin or provider), and are asserted to be an accurate representation of the runtime.
- Scheduling in heterogeneous clusters - it is possible to operate a heterogeneous cluster
(different runtime configurations on different nodes) through scheduling primitives like
`NodeAffinity` and `Taints+Tolerations`, but the user is responsible for setting these up and
automatic runtime-aware scheduling is out-of-scope.
- Define standardized or conformant runtime classes - although I would like to declare some
predefined RuntimeClasses with specific properties, doing so is out-of-scope for this initial KEP.
- [Pod Overhead][] - Although RuntimeClass is likely to be the configuration mechanism of choice,
the details of how pod resource overhead will be implemented is out of scope for this KEP.
- Provide a mechanism to dynamically register or provision additional runtimes.
- Requiring specific RuntimeClasses according to policy. This should be addressed by other
cluster-level policy mechanisms, such as PodSecurityPolicy.
- "Fitting" a RuntimeClass to pod requirements - In other words, specifying runtime properties and
letting the system match an appropriate RuntimeClass, rather than explicitly assigning a
RuntimeClass by name. This approach can increase portability, but can be added seamlessly in a
future iteration.
[Pod Overhead]: https://docs.google.com/document/d/1EJKT4gyl58-kzt2bnwkv08MIUZ6lkDpXcxkHqCvvAp4/edit
### User Stories
- As a cluster operator, I want to provide multiple runtime options to support a wide variety of
workloads. Examples include native linux containers, "sandboxed" containers, and windows
containers.
- As a cluster operator, I want to provide stable rolling upgrades of runtimes. For
example, rolling out an update with backwards incompatible changes or previously unsupported
features.
- As an application developer, I want to select the runtime that best fits my workload.
- As an application developer, I don't want to study the nitty-gritty details of different runtime
implementations, but rather choose from pre-configured classes.
- As an application developer, I want my application to be portable across clusters that use similar
but different variants of a "class" of runtimes.
## Proposal
The initial design includes:
- `RuntimeClass` API resource definition
- `RuntimeClass` pod field for specifying the RuntimeClass the pod should be run with
- Kubelet implementation for fetching & interpreting the RuntimeClass
- CRI API & implementation for passing along the [RuntimeHandler](#runtime-handler).
### API
`RuntimeClass` is a new cluster-scoped resource in the `node.k8s.io` API group.
> _The `node.k8s.io` API group would eventually hold the Node resource when `core` is retired.
> Alternatives considered: `runtime.k8s.io`, `cluster.k8s.io`_
_(This is a simplified declaration, syntactic details will be covered in the API PR review)_
```go
type RuntimeClass struct {
metav1.TypeMeta
// ObjectMeta minimally includes the RuntimeClass name, which is used to reference the class.
// Namespace should be left blank.
metav1.ObjectMeta
Spec RuntimeClassSpec
}
type RuntimeClassSpec struct {
// RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container
// creation. The possible values are specific to a given configuration & CRI implementation.
// The empty string is equivalent to the default behavior.
// +optional
RuntimeHandler string
}
```
The runtime is selected by the pod by specifying the RuntimeClass in the PodSpec. Once the pod is
scheduled, the RuntimeClass cannot be changed.
```go
type PodSpec struct {
...
// RuntimeClassName refers to a RuntimeClass object with the same name,
// which should be used to run this pod.
// +optional
RuntimeClassName string
...
}
```
The `legacy` RuntimeClass name is reserved. The legacy RuntimeClass is defined to be fully backwards
compatible with current Kubernetes. This means that the legacy runtime does not specify any
RuntimeHandler or perform any feature validation (all features are "supported").
```go
const (
// RuntimeClassNameLegacy is a reserved RuntimeClass name. The legacy
// RuntimeClass does not specify a runtime handler or perform any
// feature validation.
RuntimeClassNameLegacy = "legacy"
)
```
An unspecified RuntimeClassName `""` is equivalent to the `legacy` RuntimeClass, though the field is
not defaulted to `legacy` (to leave room for configurable defaults in a future update).
#### Runtime Handler
The `RuntimeHandler` is passed to the CRI as part of the `RunPodSandboxRequest`:
```proto
message RunPodSandboxRequest {
// Configuration for creating a PodSandbox.
PodSandboxConfig config = 1;
// Named runtime configuration to use for this PodSandbox.
string RuntimeHandler = 2;
}
```
The RuntimeHandler is provided as a mechanism for CRI implementations to select between different
predetermined configurations. The initial use case is replacing the experimental pod annotations
currently used for selecting a sandboxed runtime by various CRI implementations:
| CRI Runtime | Pod Annotation |
| ------------|-------------------------------------------------------------|
| CRIO | io.kubernetes.cri-o.TrustedSandbox: "false" |
| containerd | io.kubernetes.cri.untrusted-workload: "true" |
| frakti | runtime.frakti.alpha.kubernetes.io/OSContainer: "true"<br>runtime.frakti.alpha.kubernetes.io/Unikernel: "true" |
| windows | experimental.windows.kubernetes.io/isolation-type: "hyperv" |
These implementations could stick with scheme ("trusted" and "untrusted"), but the preferred
approach is a non-binary one wherein arbitrary handlers can be configured with a name that can be
matched against the specified RuntimeHandler. For example, containerd might have a configuration
corresponding to a "kata-runtime" handler:
```
[plugins.cri.containerd.kata-runtime]
runtime_type = "io.containerd.runtime.v1.linux"
runtime_engine = "/opt/kata/bin/kata-runtime"
runtime_root = ""
```
This non-binary approach is more flexible: it can still map to a binary RuntimeClass selection
(e.g. `sandboxed` or `untrusted` RuntimeClasses), but can also support multiple parallel sandbox
types (e.g. `kata-containers` or `gvisor` RuntimeClasses).
### Versioning, Updates, and Rollouts
Getting upgrades and rollouts right is a very nuanced and complicated problem. For the initial alpha
implementation, we will kick the can down the road by making the `RuntimeClassSpec` **immutable**,
thereby requiring changes to be pushed as a newly named RuntimeClass instance. This means that pods
must be updated to reference the new RuntimeClass, and comes with the advantage of native support
for rolling updates through the same mechanisms as any other application update. The
`RuntimeClassName` pod field is also immutable post scheduling.
This conservative approach is preferred since it's much easier to relax constraints in a backwards
compatible way than tighten them. We should revisit this decision prior to graduating RuntimeClass
to beta.
### Implementation Details
The Kubelet uses an Informer to keep a local cache of all RuntimeClass objects. When a new pod is
added, the Kubelet resolves the Pod's RuntimeClass against the local RuntimeClass cache. Once
resolved, the RuntimeHandler field is passed to the CRI as part of the
[`RunPodSandboxRequest`][]. At that point, the interpretation of the RuntimeHandler is left to the
CRI implementation, but it should be cached if needed for subsequent calls.
If the RuntimeClass cannot be resolved (e.g. doesn't exist) at Pod creation, then the request will
be rejected in admission (controller to be detailed in a following update). If the RuntimeClass
cannot be resolved by the Kubelet when `RunPodSandbox` should be called, then the Kubelet will fail
the Pod. The admission check on a replica recreation will prevent the scheduler from thrashing. If
the `RuntimeHandler` is not recognized by the CRI implementation, then `RunPodSandbox` will return
an error.
[RunPodSandboxRequest]: https://github.com/kubernetes/kubernetes/blob/b05a61e299777c2030fbcf27a396aff21b35f01b/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L344
### Risks and Mitigations
**Scope creep.** RuntimeClass has a fairly broad charter, but it should not become a default
dumping ground for every new feature exposed by the node. For each feature, careful consideration
should be made about whether it belongs on the Pod, Node, RuntimeClass, or some other resource. The
[non-goals](#non-goals) should be kept in mind when considering RuntimeClass features.
**Becoming a general policy mechanism.** RuntimeClass should not be used a replacement for
PodSecurityPolicy. The use cases for defining multiple RuntimeClasses for the same underlying
runtime implementation should be extremely limited (generally only around updates & rollouts). To
enforce this, no authorization or restrictions are placed directly on RuntimeClass use; in order to
restrict a user to a specific RuntimeClass, you must use another policy mechanism such as
PodSecurityPolicy.
**Pushing complexity to the user.** RuntimeClass is a new resource in order to hide the complexity
of runtime configuration from most users (aside from the cluster admin or provisioner). However, we
are still side-stepping the issue of precisely defining specific types of runtimes like
"Sandboxed". However, it is still up for debate whether precisely defining such runtime categories
is even possible. RuntimeClass allows us to decouple this specification from the implementation, but
it is still something I hope we can address in a future iteration through the concept of pre-defined
or "conformant" RuntimeClasses.
**Non-portability.** We are already in a world of non-portability for many features (see [examples
of runtime variation](#examples-of-runtime-variation). Future improvements to RuntimeClass can help
address this issue by formally declaring supported features, or matching the runtime that supports a
given workload automitaclly. Another issue is that pods need to refer to a RuntimeClass by name,
which may not be defined in every cluster. This is something that can be addressed through
pre-defined runtime classes (see previous risk), and/or by "fitting" pod requirements to compatible
RuntimeClasses.
## Graduation Criteria
Alpha:
- Everything described in the current proposal
- [CRI validation test][cri-validation]
[cri-validation]: https://github.com/kubernetes-incubator/cri-tools/blob/master/docs/validation.md
Beta:
- Major runtimes support RuntimeClass
- RuntimeClasses are configured in the E2E environment with test coverage of a non-legacy RuntimeClass
- The update & upgrade story is revisited, and a longer-term approach is implemented as necessary.
- The cluster admin can choose which RuntimeClass is the default in a cluster.
- Additional requirements TBD
## Implementation History
- 2018-06-11: SIG-Node decision to move forward with proposal
- 2018-06-19: Initial KEP published.
## Appendix
### Examples of runtime variation
- Linux Security Module (LSM) choice - Kubernetes supports both AppArmor & SELinux options on pods,
but those are mutually exclusive, and support of either is not required by the runtime. The
default configuration is also not well defined.
- Seccomp-bpf - Kubernetes has alpha support for specifying a seccomp profile, but the default is
defined by the runtime, and support is not guaranteed.
- Windows containers - isolation features are very OS-specific, and most of the current features are
limited to linux. As we build out Windows container support, we'll need to add windows-specific
features as well.
- Host namespaces (Network,PID,IPC) may not be supported by virtualization-based runtimes
(e.g. Kata-containers & gVisor).
- Per-pod and Per-container resource overhead varies by runtime.
- Device support (e.g. GPUs) varies wildly by runtime & nodes.
- Supported volume types varies by node - it remains TBD whether this information belongs in
RuntimeClass.
- The list of default capabilities is defined in Docker, but not Kubernetes. Future runtimes may
have differing defaults, or support a subset of capabilities.
- `Privileged` mode is not well defined, and thus may have differing implementations.
- Support for resource over-commit and dynamic resource sizing (e.g. Burstable vs Guaranteed
workloads)