Update EphemeralContainers in apiserver not kubelet

This commit is contained in:
Lee Verberne 2018-03-21 15:54:52 +01:00
parent caa072ead7
commit 946532db82
1 changed files with 85 additions and 139 deletions

View File

@ -1,6 +1,6 @@
# Troubleshoot Running Pods
* Status: Pending
* Status: Pending Implementation
* Version: Alpha
* Implementation Owner: @verb
@ -102,33 +102,22 @@ subsequently be used to reattach and is reported by `kubectl describe`.
### Kubernetes API Changes
There has been much discussion about how this fits best into the Kubernetes API.
The consensus is for an imperative "debug this pod" action that's implemented
mostly in the kubelet. In order to avoid new dependencies in the kubelet, this
will be implemented in the Core API. Three possible implementations follow, and
additional implementations that were evaluated and dismissed are at the end of
this document.
The consensus is for an imperative "debug this pod" action whereby Kubernetes
creates a new, temporary container in a pod on command. In order to avoid new
dependencies in the kubelet, this will be implemented in the Core API. Three
possible implementations follow, and additional implementations that were
evaluated and dismissed are at the end of this document.
All of the proposed solutions implement the user-level concept of a _Debug
Container_ using the API-level concept of an _Ephemeral Container_. The API
doesn't prescribe how an Ephemeral Container is used. It could conceivably see
use other than Debug Containers, but we don't currently have other use cases.
#### Chosen Solution: POST to an Existing Pod
#### Chosen Solution: Subresource to Update PodStatus
We're modifying an existing pod, so this fits as a subresource of the target
pod. We will create a new top-level object that contains a `v1.Container` and
`POST` to the subresource to create the Debug Container. Since this is a `POST`,
we cannot upgrade the connection to streaming if we want to continue supporting
web socket clients.
Rather than using an existing subresource like `/exec` and conditionally stream
based on `PodExecOptions`, we will create a new subresource with consistent
streaming behavior. A new subresource has the added benefit of being able to
entirely hide interface entirely behind a feature flag by conditionally
registering the new subresource.
A `v1.Container` by itself lacks type and object metadata, so we will create a
new type:
An Ephemeral Container is not part of the pod specification as it's not part of
the intended state of the pod, but we describe it using the same primitives as
`PodSpec`. An `EphemeralContainer` contains a Spec, a Status and a Target:
```
// EphemeralContainer describes a container to attach to a running pod for troubleshooting.
@ -152,26 +141,37 @@ type EphemeralContainer struct {
}
```
All Ephemeral Containers that have been created in a pod are listed in the pod's
status:
```
type PodStatus struct {
...
// List of user-initiated ephemeral containers that have been run in this pod.
// +optional
EphemeralContainers []EphemeralContainer `json:"commands,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
}
```
To create a new Ephemeral Container, one appends a new `EphemeralContainer` with
the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod` in
the API. This is accomplished via a new subresource, `/ephemeralcontainers`,
which enforces the append-only semantics and authorization. This is similar to
the `/status` subresource used by the kubelet to modify pod status.
**Note that Ephemeral Containers are not regular containers and should not be
used to build services.** They lack guarantees for resources or execution, and
many of the fields of `v1.Container` will not be allowed for Debug Containers. A
request will be rejected if any field is set other than the following
pod update will fail validation if any field is set other than the following
whitelisted fields: `Name`, `Image`, `Command`, `Args`, `WorkingDir`, `Env`,
`EnvFrom`, `ImagePullPolicy`, `SecurityContext`. `TTY` and `Stdin` are always
enabled for Debug Containers and will be ignored.
The new `/ephemeralcontainers` subresource allows the following:
1. A `POST` of a `EphemeralContainer` to
`/api/v1/namespaces/$NS/pods/$POD_NAME/ephemeralcontainers` to create an
Ephemeral Container running in pod `$POD_NAME`.
1. Support for stopping an Ephemeral Container **could be supported in the
future** by a `DELETE` of
`/api/v1/namespaces/$NS/pods/$POD_NAME/ephemeralcontainers/$NAME`.
Once created, it is the responsibility of the client to watch for the
`EphemeralContainer` to appear in the `PodStatus` and then attach to the console
of a debug container using the existing attach endpoint,
Once the pod object is updated, the kubelet worker watching this pod will launch
the Ephemeral Container and update its status. The client is expected to watch
for the creation of the container status and then attach to the console of a
debug container using the existing attach endpoint,
`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new
container between its creation and subsequent attach will not be replayed and
can only be viewed using `kubectl log`.
@ -211,7 +211,7 @@ exec`. To kill an Ephemeral Container one would `attach` and exit the process
interactively or create a new Ephemeral Container to send a signal with
`kill(1)` to the original process.
#### Alternative 2: Declarative Configuration
#### Alternative 2: Ephemeral Container Controller
Using subresources is an imperative style API where the client instructs the
kubelet to perform an action, but in general Kubernetes prefers declarative APIs
@ -244,31 +244,16 @@ enforce, and SIG Node strongly prefers to minimize kubelet complexity.
### Ephemeral Container Status
`EphemeralContainer` is included in a new field in `PodStatus`:
```
type PodStatus struct {
...
// List of user-initiated ephemeral containers that have been run in this pod.
// +optional
EphemeralContainers []EphemeralContainer `json:"commands,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
}
```
The kubelet should be able to construct a complete `PodStatus` with no prior
state using information stored in the container runtime.
`EphemeralContainer.Status` introduces no new data, but the kubelet must also
now populate `EphemeralContainer.Spec` &
`EphemeralContainer.TargetContainerName`.
The kubelet already persists container metadata as CRI
We wish for the kubelet to be able to construct `PodStatus` without relying on
prior state, so we will store `EphemeralContainer.Spec` &
`EphemeralContainer.TargetContainerName` as runtime metadata. The kubelet
currently persists container metadata as CRI
[labels](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L606)
and
[annotations](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L613).
The entire v1.Container used in the request will be serialized and stored as a
runtime annotation. The value of `TargetContainerName` will be stored as a
runtime label. Persisting this data in the runtime means it will survive kubelet
runtime label. Persisting this data in the runtime means it survives kubelet
restarts.
At least for the Docker runtime, this is [an intended use of docker
@ -279,72 +264,45 @@ size. Because the container spec may be examined in security sensitive contexts
like admission control, we will conservatively limit the size of the spec to 32K
and add a 32K minimum label length test to runtime qualification.
`EphemeralContainer` is populated by the kubelet in the same way as regular
container statuses. This is sent to the API server and displayed by `kubectl
describe pod`.
`EphemeralContainer.Status` is populated by the kubelet in the same way as
regular container statuses. This is sent to the API server and displayed by
`kubectl describe pod`.
### Creating Debug Containers
1. `kubectl` invokes the new API as described in the preceding section.
1. The API server checks for name collisions with existing running containers
(in both `PodSpec` and `PodStatus.EphemeralContainers`), performs
validation, admission control and proxies the connection to the kubelet's
`/ephemeralContainers/$NS/$POD_NAME` endpoint.
1. Since a name collision could happen in the interval between container
creation and PodStatus being published to the API server, the kubelet
will perform an additional check for name collision.
1. It is permissible to replace an exited container with one of the same
name.
1. The kubelet request handler opens an error channel and signals the pod's
sync worker with `UpdatePodOptions` that include the `EphemeralContainer` in
a new field and a callback in the existing
`UpdatePodOptions.OnCompleteFunc`.
1. The pod sync worker runs the existing `syncPod()` with a new
`SyncPodType` of `SyncPodDebug`.
1. The request handler blocks (with timeout) on receiving an error from
`syncPod()` via the callback. During this time, `syncPod()` starts the
ephemeral container, including fetching an image if necessary, and
publishes a new `PodStatus`.
1. Timeout is configured by the cluster administrator and defaults to 2
minutes.
1. `syncPod()` again checks for container name collision and starts an
ephemeral container via the new `kuberuntime.StartEphemeralContainer()`.
1. `StartEphemeralContainer()` call uses the existing `startContainer()`
method, which gains support for targeting the namespaces of a container
by name.
1. `syncPod()` runs only from a dedicated pod worker, resolving any races
for container creation.
1. `kubectl` constructs `EphemeralContainer.Spec` and
`EphemeralContainer.TargetContainerName` based on command line arguments. It
`PUT`s the modified pod to the pod's `/ephemeralcontainers`.
1. The apiserver discards changes to fields other than
`Pod.Status.EphemeralContainers` and validates the pod update.
1. Update validation fails if existing Ephemeral Containers are removed or
changed, or if the new Ephemeral Container has a non-empty status.
1. Pod validation fails if container spec contains fields disallowed for
Ephemeral Containers, has the same name as a container in the spec, or
has the same name as another running Ephemeral Container. (see below)
1. API resource versioning resolves update races.
1. The kubelet's pod watcher notices the update and triggers a `syncPod()`.
During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()`
for any Ephemeral Container with an empty status.
1. `StartEphemeralContainer()` uses the existing `startContainer()` method,
which gains support for targeting the namespaces of a container by name.
1. After initial creation, future invocations of `syncPod()` will publish
its ContainerStatus but otherwise ignore the Ephemeral Container. It
will exist for the life of the pod sandbox or it exits and is garbage
collected. In no event will it be restarted.
1. `syncPod()` then finishes a regular sync, publishing an updated PodStatus
(which includes the new `EphemeralContainer`) by its normal, existing means.
The pod worker sends its exit status to the request worker.
1. The request worker receives a (hopefully `nil`) `error` and returns it to
the client.
1. `OnCompleteFunc` is not guaranteed to be called, so if the request
worker times out it will check the pod's `PodStatus` to see if the Debug
Container was started prior to returning an error. If this happens the
`EphemeralContainer.Spec` must be compared to verify it was the same one
as requested.
1. `syncPod()` finishes a regular sync, publishing an updated PodStatus (which
includes the new `EphemeralContainer`) by its normal, existing means.
1. The client performs an attach to the debug container's console.
The apiserver detects container name collisions with both containers in the pod
spec and other running Debug Containers by checking
`PodStatus.EphemeralContainers`. In a race to create two Debug Containers with
the same name, the API server will pass both requests and the kubelet will
reject one in the synchronized pod worker.
There are no limits on the number of Debug Containers that can be created in a
pod, but exceeding a pod's resource allocation may cause the pod to be evicted.
### Restarting and Reattaching Debug Containers
Debug Containers will never be restarted automatically. It is possible to
replace a Debug Container that has exited by re-using a Debug Container name. It
is an error to attempt to replace a Debug Container that is still running, which
is detected by both the API server and the kubelet.
"restart" a Debug Container by by re-using the name of a Debug Container that
has exited. It is an error to re-use the name of a Debug Container that is still
running, which is detected by API server validation.
One can reattach to a Debug Container using `kubectl attach`. When supported by
a runtime, multiple clients can attach to a single debug container and share the
@ -352,7 +310,7 @@ terminal. This is supported by Docker.
### Killing Debug Containers
Debug containers will not be killed automatically until the pod (specifically,
Debug containers will not be killed automatically unless the pod (specifically,
the pod sandbox) is destroyed. Debug Containers will stop when their command
exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug
Containers will not receive an EOF if their connection is interrupted.
@ -367,23 +325,16 @@ are necessary in the kubelet:
the pod spec.
1. As an exception to the above, `SyncPod()` will kill Debug Containers when
the pod sandbox changes since a lone Debug Container in an abandoned sandbox
is not useful. Debug Containers are not automatically started in the new
is not useful. Debug Containers are not started automatically in the new
sandbox.
1. `convertStatusToAPIStatus()` must sort Debug Containers status into
`EphemeralContainerStatuses` similar to as it does for
`EphemeralContainer.Status` similar to as it does for
`InitContainerStatuses`
1. The kubelet must preserve `ContainerStatus` on debug containers for
reporting.
1. Debug Containers must be excluded from calculation of pod phase and
condition
It's worth noting some things that do not change:
1. `KillPod()` already operates on all running containers returned by the
runtime.
1. Containers created prior to this feature being enabled will have a
`containerType` of `""`. Since this does not match `"EPHEMERAL"` the special
handling of Debug Containers is backwards compatible.
`KillPod()` already operates on all running containers returned by the runtime
and requires no changes
### Security Considerations
@ -392,10 +343,8 @@ Debug Containers have no additional privileges above what is available to any
spec except that it is created on demand.
Admission plugins must be updated to guard `/ephemeralcontainers`. In
particular, they should enforce the same container image policy on the `Image`
parameter as is enforced for regular containers. During the alpha phase we will
additionally support a container image whitelist as a kubelet flag to allow
cluster administrators to easily constrain debug container images.
particular, they should enforce the same container image policy on the
`EphemeralContainer.Spec` parameter as is enforced for regular containers.
### Additional Consideration
@ -421,7 +370,7 @@ cluster administrators to easily constrain debug container images.
#### Goals and Non-Goals for Alpha Release
We're targeting an alpha release in Kubernetes 1.10 that includes the following
We're targeting an alpha release in Kubernetes 1.11 that includes the following
basic functionality:
* Support in the kubelet for creating debug containers in a running pod
@ -429,14 +378,15 @@ basic functionality:
* `kubectl describe pod` will list status of debug containers running in a pod
Functionality will be hidden behind an alpha feature flag and disabled by
default. The following are explicitly out of scope for the alpha release, but
must be resolved prior to beta release:
default.
* There's no guarantee that exited Debug Containers won't be garbage collected
as regular containers, so they may disappear from the list of
`EphemeralContainers`.
* We could improve reliability of `UpdatePodOptions.OnCompleteFunc` by
prioritizing based on `SyncPodType`.
Since the kubelet stores Debug Container metadata as runtime labels, it's lost
when Debug Containers are garbage collected. For the alpha release we will rely
on the apiserver to store the `EphemeralContainer` for garbage collected
containers. The kubelet will preserve any `EphemeralContainer` it doesn't
recognize when updating status. In the event that a `PodStatus` is lost and we
need to regenerate it from scratch, `EphemeralContainers` will only contain
Debug Containers that have not been garbage collected.
#### Kubernetes API Changes
@ -444,8 +394,7 @@ The following changes must be implemented in the API:
1. `v1.EphemeralContainer` will be added and `v1.PodStatus` will be extended as
described above.
1. The new subresource will be added to the pods API, validation added and
proxied to the kubelet.
1. The new subresource will be added to the pods API.
1. The API server must check for Ephemeral Containers when validating `attach`.
#### kubelet Implementation
@ -462,19 +411,16 @@ following steps:
be identified as a debug container.
1. `kuberuntimemanager` gains a new `StartEphemeralContainer()` which calls the
existing `startContainer()`.
1. The kubelet gains a `RunDebugContainer()` method which accepts a
`v1.EphemeralContainer` and triggers a pod sync to create the debug
container. The existing `generateAPIPodStatus()` will be update to also
populate `EphemeralContainers`.
1. The kubelet API gains the new `/ephemeralContainers/` endpoint to create the
Debug Container.
1. `syncPod()` will call `StartEphemeralContainer()` to start the Debug
Container. The existing `generateAPIPodStatus()` will be updated to also
populate `EphemeralContainers.Status`.
#### kubectl changes
In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a
`kubectl alpha` command to contain alpha features. We will add `kubectl alpha
debug` to invoke Debug Containers. `kubectl` does not use feature gates, so
`kubectl alpha debug` will be visible by default in `kubectl` 1.10 and return an
`kubectl alpha debug` will be visible by default in `kubectl` 1.11 and return an
error when used on a cluster with the feature disabled.
`kubectl describe pod` will report the contents of `EphemeralContainers` when