Merge pull request #1269 from verb/pod-troubleshooting-use-container
Use v1.Container in Debug Containers API
This commit is contained in:
commit
f1d6261f21
|
|
@ -1,6 +1,6 @@
|
|||
# Troubleshoot Running Pods
|
||||
|
||||
* Status: Pending
|
||||
* Status: Implementing
|
||||
* Version: Alpha
|
||||
* Implementation Owner: @verb
|
||||
|
||||
|
|
@ -16,9 +16,9 @@ Many developers of native Kubernetes applications wish to treat Kubernetes as an
|
|||
execution platform for custom binaries produced by a build system. These users
|
||||
can forgo the scripted OS install of traditional Dockerfiles and instead `COPY`
|
||||
the output of their build system into a container image built `FROM scratch` or
|
||||
a [distroless container
|
||||
image](https://github.com/GoogleCloudPlatform/distroless). This confers several
|
||||
advantages:
|
||||
a
|
||||
[distroless container image](https://github.com/GoogleCloudPlatform/distroless).
|
||||
This confers several advantages:
|
||||
|
||||
1. **Minimal images** lower operational burden and reduce attack vectors.
|
||||
1. **Immutable images** improve correctness and reliability.
|
||||
|
|
@ -45,26 +45,25 @@ A solution to troubleshoot arbitrary container images MUST:
|
|||
* fetch troubleshooting utilities at debug time rather than at the time of pod
|
||||
creation
|
||||
* be compatible with admission controllers and audit logging
|
||||
* allow discovery of debugging status
|
||||
* allow discovery of current debugging status
|
||||
* support arbitrary runtimes via the CRI (possibly with reduced feature set)
|
||||
* require no administrative access to the node
|
||||
* have an excellent user experience (i.e. should be a feature of the platform
|
||||
rather than config-time trickery)
|
||||
* have no *inherent* side effects to the running container image
|
||||
* have no _inherent_ side effects to the running container image
|
||||
* v1.Container must be available for inspection by admission controllers
|
||||
|
||||
## Feature Summary
|
||||
|
||||
Any new debugging functionality will require training users. We can ease the
|
||||
transition by building on an existing usage pattern. We will create a new
|
||||
command, `kubectl debug`, which parallels an existing command, `kubectl exec`.
|
||||
Whereas `kubectl exec` runs a *process* in a *container*, `kubectl debug` will
|
||||
be similar but run a *container* in a *pod*.
|
||||
Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will
|
||||
be similar but run a _container_ in a _pod_.
|
||||
|
||||
A container created by `kubectl debug` is a *Debug Container*. Just like a
|
||||
process run by `kubectl exec`, a Debug Container is not part of the pod spec and
|
||||
has no resource stored in the API. Unlike `kubectl exec`, a Debug Container
|
||||
*does* have status that is reported in `v1.PodStatus` and displayed by `kubectl
|
||||
describe pod`.
|
||||
A container created by `kubectl debug` is a _Debug Container_. Unlike `kubectl
|
||||
exec`, Debug Containers have status that is reported in `PodStatus` and
|
||||
displayed by `kubectl describe pod`.
|
||||
|
||||
For example, the following command would attach to a newly created container in
|
||||
a pod:
|
||||
|
|
@ -82,22 +81,16 @@ kubectl debug target-pod
|
|||
|
||||
This creates an interactive shell in a pod which can examine and signal other
|
||||
processes in the pod. It has access to the same network and IPC as processes in
|
||||
the pod. It can access the filesystem of other processes by `/proc/$PID/root`.
|
||||
As is already the case with regular containers, Debug Containers can enter
|
||||
arbitrary namespaces of another container via `nsenter` when run with
|
||||
`CAP_SYS_ADMIN`.
|
||||
the pod. When [process namespace sharing](https://features.k8s.io/495) is
|
||||
enabled, it can access the filesystem of other processes by `/proc/$PID/root`.
|
||||
Debug Containers can enter arbitrary namespaces of another visible container via
|
||||
`nsenter` when run with `CAP_SYS_ADMIN`.
|
||||
|
||||
*Please see the User Stories section for additional examples and Alternatives
|
||||
Considered for the considerable list of other solutions we considered.*
|
||||
_Please see the User Stories section for additional examples and Alternatives
|
||||
Considered for the considerable list of other solutions we considered._
|
||||
|
||||
## Implementation Details
|
||||
|
||||
The implementation of `kubectl debug` closely mirrors the implementation of
|
||||
`kubectl exec`, with most of the complexity implemented in the `kubelet`. How
|
||||
functionality like this best fits into Kubernetes API has been contentious. In
|
||||
order to make progress, we will start with the smallest possible API change,
|
||||
extending `/exec` to support Debug Containers, and iterate.
|
||||
|
||||
From the perspective of the user, there's a new command, `kubectl debug`, that
|
||||
creates a Debug Container and attaches to its console. We believe a new command
|
||||
will be less confusing for users than overloading `kubectl exec` with a new
|
||||
|
|
@ -106,192 +99,154 @@ subsequently be used to reattach and is reported by `kubectl describe`.
|
|||
|
||||
### Kubernetes API Changes
|
||||
|
||||
#### Chosen Solution: "exec++"
|
||||
This will be implemented in the Core API to avoid new dependencies in the
|
||||
kubelet. The user-level concept of a _Debug Container_ implemented with the
|
||||
API-level concept of an _Ephemeral Container_. The API doesn't require an
|
||||
Ephemeral Container to be used as a Debug Container. It's intended as a general
|
||||
purpose construct for running a short-lived process in a pod.
|
||||
|
||||
We will extend `v1.Pod`'s `/exec` subresource to support "executing" container
|
||||
images. The current `/exec` endpoint must implement `GET` to support streaming
|
||||
for all clients. We don't want to encode a (potentially large) `v1.Container` as
|
||||
an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific
|
||||
fields required for creating a Debug Container:
|
||||
#### Pod Changes
|
||||
|
||||
Ephemeral Containers are represented in `PodSpec` and `PodStatus`:
|
||||
|
||||
```
|
||||
// PodExecOptions is the query options to a Pod's remote exec call
|
||||
type PodExecOptions struct {
|
||||
...
|
||||
// EphemeralContainerName is the name of an ephemeral container in which the
|
||||
// command ought to be run. Either both EphemeralContainerName and
|
||||
// EphemeralContainerImage fields must be set, or neither.
|
||||
EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
|
||||
|
||||
// EphemeralContainerImage is the image of an ephemeral container in which the command
|
||||
// ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
|
||||
// fields must be set, or neither.
|
||||
EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
|
||||
type PodSpec struct {
|
||||
...
|
||||
// List of user-initiated ephemeral containers to run in this pod.
|
||||
// This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
|
||||
// +optional
|
||||
EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,29,opt,name=ephemeralContainers"`
|
||||
}
|
||||
```
|
||||
|
||||
After creating the Debug Container, the kubelet will upgrade the connection to
|
||||
streaming and perform an attach to the container's console. If disconnected, the
|
||||
Debug Container can be reattached using the pod's `/attach` endpoint with
|
||||
`EphemeralContainerName`.
|
||||
|
||||
Debug Containers cannot be removed via the API and instead the process must
|
||||
terminate. While not ideal, this parallels existing behavior of `kubectl exec`.
|
||||
To kill a Debug Container one would `attach` and exit the process interactively
|
||||
or create a new Debug Container to send a signal with `kill(1)` to the original
|
||||
process.
|
||||
|
||||
#### Alternative 1: Debug Subresource
|
||||
|
||||
Rather than extending an existing subresource, we could create a new,
|
||||
non-streaming `debug` subresource. We would create a new API Object:
|
||||
|
||||
```
|
||||
// DebugContainer describes a container to attach to a running pod for troubleshooting.
|
||||
type DebugContainer struct {
|
||||
metav1.TypeMeta
|
||||
metav1.ObjectMeta
|
||||
|
||||
// Name is the name of the Debug Container. Its presence will cause
|
||||
// exec to create a Debug Container rather than performing a runtime exec.
|
||||
Name string `json:"name,omitempty" ...`
|
||||
|
||||
// Image is an optional container image name that will be used to for the Debug
|
||||
// Container in the specified Pod with Command as ENTRYPOINT. If omitted a
|
||||
// default image will be used.
|
||||
Image string `json:"image,omitempty" ...`
|
||||
}
|
||||
```
|
||||
|
||||
The pod would gain a new `/debug` subresource that allows the following:
|
||||
|
||||
1. A `POST` of a `PodDebugContainer` to
|
||||
`/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` to create Debug
|
||||
Container named `$NAME` running in pod `$POD_NAME`.
|
||||
1. A `DELETE` of `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` will stop
|
||||
the Debug Container `$NAME` in pod `$POD_NAME`.
|
||||
|
||||
Once created, a client would attach to the console of a debug container using
|
||||
the existing attach endpoint, `/api/v1/namespaces/$NS/pods/$POD_NAME/attach`.
|
||||
|
||||
However, this pattern does not resemble any other current usage of the API, so
|
||||
we prefer to start with exec++ and reevaluate if we discover a compelling
|
||||
reason.
|
||||
|
||||
#### Alternative 2: Declarative Configuration
|
||||
|
||||
Using subresources is an imperative style API where the client instructs the
|
||||
kubelet to perform an action, but in general Kubernetes prefers declarative APIs
|
||||
where the client declares a state for Kubernetes to enact.
|
||||
|
||||
We could implement this in a declarative manner by creating a new
|
||||
`EphemeralContainer` type:
|
||||
|
||||
```
|
||||
type EphemeralContainer struct {
|
||||
metav1.TypeMeta
|
||||
metav1.ObjectMeta
|
||||
|
||||
Spec EphemeralContainerSpec
|
||||
Status v1.ContainerStatus
|
||||
}
|
||||
```
|
||||
|
||||
`EphemeralContainerSpec` is similar to `v1.Container`, but contains only fields
|
||||
relevant to Debug Containers:
|
||||
|
||||
```
|
||||
type EphemeralContainerSpec struct {
|
||||
// Target is the pod in which to run the EphemeralContainer
|
||||
// Required.
|
||||
Target v1.ObjectReference
|
||||
|
||||
Name string
|
||||
Image String
|
||||
Command []string
|
||||
Args []string
|
||||
ImagePullPolicy PullPolicy
|
||||
SecurityContext *SecurityContext
|
||||
}
|
||||
```
|
||||
|
||||
A new controller in the kubelet would watch for EphemeralContainers and
|
||||
create/delete debug containers. `EphemeralContainer.Status` would be updated by
|
||||
the kubelet at the same time it updates `ContainerStatus` for regular and init
|
||||
containers. Clients would create a new `EphemeralContainer` object, wait for it
|
||||
to be started and then attach using the pod's attach subresource and the name of
|
||||
the `EphemeralContainer`.
|
||||
|
||||
Debugging is inherently imperative, however, rather than a state for Kubernetes
|
||||
to enforce. Once a Debug Container is started it should not be automatically
|
||||
restarted, for example. This solution imposes additionally complexity and
|
||||
dependencies on the kubelet, but it's not yet clear if the complexity is
|
||||
justified.
|
||||
|
||||
### Debug Container Status
|
||||
|
||||
The status of a Debug Container is reported in a new field in `v1.PodStatus`:
|
||||
|
||||
```
|
||||
type PodStatus struct {
|
||||
...
|
||||
EphemeralContainerStatuses []v1.ContainerStatus
|
||||
...
|
||||
// Status for any Ephemeral Containers that running in this pod.
|
||||
// This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
|
||||
// +optional
|
||||
EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,12,rep,name=ephemeralContainerStatuses"`
|
||||
}
|
||||
```
|
||||
|
||||
This status is only populated for Debug Containers, but there's interest in
|
||||
tracking status for traditional exec in a similar manner.
|
||||
`EphemeralContainerStatuses` resembles the existing `ContainerStatuses` and
|
||||
`InitContainerStatuses`, but `EphemeralContainers` introduces a new type:
|
||||
|
||||
Note that `Command` and `Args` would have to be tracked in the status object
|
||||
because there is no spec for Debug Containers or exec. These must either be made
|
||||
available by the runtime or tracked by the kubelet. For Debug Containers this
|
||||
could be stored as runtime labels, but the kubelet currently has no method of
|
||||
storing state across restarts for exec. Solving this problem for exec is out of
|
||||
scope for Debug Containers, but we will look for a solution as we implement this
|
||||
feature.
|
||||
```
|
||||
// An EphemeralContainer is a container which runs temporarily in a pod for human-initiated actions
|
||||
// such as troubleshooting. This is an alpha feature enabled by the EphemeralContainers feature flag.
|
||||
type EphemeralContainer struct {
|
||||
// Spec describes the Ephemeral Container to be created.
|
||||
Spec Container `json:"spec,omitempty" protobuf:"bytes,1,opt,name=spec"`
|
||||
|
||||
`EphemeralContainerStatuses` is populated by the kubelet in the same way as
|
||||
regular and init container statuses. This is sent to the API server and
|
||||
displayed by `kubectl describe pod`.
|
||||
// If set, the name of the container from PodSpec that this ephemeral container targets.
|
||||
// The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container.
|
||||
// If not set then the ephemeral container is run in whatever namespaces are shared
|
||||
// for the pod.
|
||||
// +optional
|
||||
TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,2,opt,name=targetContainerName"`
|
||||
}
|
||||
```
|
||||
|
||||
Much of the utility of Ephemeral Containers comes from the ability to run a
|
||||
container within the PID namespace of another container. `TargetContainerName`
|
||||
allows targeting a container that doesn't share its PID namespace with the rest
|
||||
of the pod. We must modify the CRI to enable this functionality (see below).
|
||||
|
||||
##### Alternative Considered: Omitting TargetContainerName
|
||||
|
||||
It would be simpler for the API, kubelet and kubectl if `EphemeralContainers`
|
||||
was a `[]Container`, but as isolated PID namespaces will be the default for some
|
||||
time, being able to target a container will provide a better user experience.
|
||||
|
||||
#### Updates
|
||||
|
||||
Most fields of `Pod.Spec` are immutable once created. There is a short whitelist
|
||||
of fields which may be updated, and we could extend this to include
|
||||
`EphemeralContainers`. The ability to add new containers is a large change for
|
||||
Pod, however, and we'd like to begin conservatively by enforcing the following
|
||||
best practices:
|
||||
|
||||
1. Ephemeral Containers lack guarantees for resources or execution, and they
|
||||
will never be automatically restarted. To avoid pods that depend on
|
||||
Ephemeral Containers, we allow their addition only in pod updates and
|
||||
disallow them during pod create.
|
||||
1. Some fields of `v1.Container` imply a fundamental role in a pod. We will
|
||||
disallow the following fields in Ephemeral Containers: `resources`, `ports`,
|
||||
`livenessProbe`, `readinessProbe`, and `lifecycle.`
|
||||
1. Cluster administrators may want to restrict access to Ephemeral Containers
|
||||
independent of other pod updates.
|
||||
|
||||
To enforce these restrictions and new permissions, we will introduce a new Pod
|
||||
subresource, `/ephemeralcontainers`. `EphemeralContainers` can only be modified
|
||||
via this subresource. `EphemeralContainerStatuses` is updated with everything
|
||||
else in `Pod.Status` via `/status`.
|
||||
|
||||
To create a new Ephemeral Container, one appends a new `EphemeralContainer` with
|
||||
the desired `v1.Container` as `Spec` in `Pod.Spec.EphemeralContainers` and
|
||||
`PUT`s the pod to `/ephemeralcontainers`.
|
||||
|
||||
The subresources `attach`, `exec`, `log`, and `portforward` are available for
|
||||
Ephemeral Containers and will be forwarded by the apiserver. This means `kubectl
|
||||
attach`, `kubelet exec`, `kubectl log`, and `kubectl port-forward` will work for
|
||||
Ephemeral Containers.
|
||||
|
||||
Once the pod is updated, the kubelet worker watching this pod will launch the
|
||||
Ephemeral Container and update its status. The client is expected to watch for
|
||||
the creation of the container status and then attach to the console of a debug
|
||||
container using the existing attach endpoint,
|
||||
`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new
|
||||
container occurring between its creation and attach will not be replayed, but it
|
||||
can be viewed using `kubectl log`.
|
||||
|
||||
##### Alternative Considered: Standard Pod Updates
|
||||
|
||||
It would simplify initial implementation if we updated the pod spec via the
|
||||
normal means, and switched to a new update subresource if required at a future
|
||||
date. It's easier to begin with a too-restrictive policy than a too-permissive
|
||||
one on which users come to rely, and we expect to be able to remove the
|
||||
`/ephemeralcontainers` subresource prior to exiting alpha should it prove
|
||||
unnecessary.
|
||||
|
||||
### Container Runtime Interface (CRI) changes
|
||||
|
||||
The CRI requires no changes for basic functionality, but it will need to be
|
||||
updated to support container namespace targeting, as described in the
|
||||
[Shared PID Namespace Proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md#targeting-a-specific-containers-namespace).
|
||||
|
||||
### Creating Debug Containers
|
||||
|
||||
1. `kubectl` invokes the exec API as described in the preceding section.
|
||||
1. The API server checks for name collisions with existing containers, performs
|
||||
admission control and proxies the connection to the kubelet's
|
||||
`/exec/$NS/$POD_NAME/$CONTAINER_NAME` endpoint.
|
||||
1. The kubelet instructs the Runtime Manager to create a Debug Container.
|
||||
1. The runtime manager uses the existing `startContainer()` method to create a
|
||||
container in an existing pod. `startContainer()` has one modification for
|
||||
Debug Containers: it creates a new runtime label (e.g. a docker label) that
|
||||
identifies this container as a Debug Container.
|
||||
1. After creating the container, the kubelet schedules an asynchronous update
|
||||
of `PodStatus`. The update publishes the debug container status to the API
|
||||
server at which point the Debug Container becomes visible via `kubectl
|
||||
describe pod`.
|
||||
1. The kubelet will upgrade the connection to streaming and attach to the
|
||||
container's console.
|
||||
To create a debug container, kubectl will take the following steps:
|
||||
|
||||
Rather than performing the implicit attach the kubelet could return success to
|
||||
the client and require the client to perform an explicit attach, but the
|
||||
implicit attach maintains consistent semantics across `/exec` rather than
|
||||
varying behavior based on parameters.
|
||||
|
||||
The apiserver detects container name collisions with both containers in the pod
|
||||
spec and other running Debug Containers by checking
|
||||
`EphemeralContainerStatuses`. In a race to create two Debug Containers with the
|
||||
same name, the API server will pass both requests and the kubelet must return an
|
||||
error to all but one request.
|
||||
1. `kubectl` constructs an `EphemeralContainer` based on command line arguments
|
||||
and appends it to `Pod.Spec.EphemeralContainers`. It `PUT`s the modified pod
|
||||
to the pod's `/ephemeralcontainers`.
|
||||
1. The apiserver discards changes other than additions to
|
||||
`Pod.Spec.EphemeralContainers` and validates the pod update.
|
||||
1. Pod validation fails if container spec contains fields disallowed for
|
||||
Ephemeral Containers or the same name as a container in the spec or
|
||||
`EphemeralContainers`.
|
||||
1. API resource versioning resolves update races.
|
||||
1. The kubelet's pod watcher notices the update and triggers a `syncPod()`.
|
||||
During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()`
|
||||
for any new Ephemeral Container.
|
||||
1. `StartEphemeralContainer()` uses the existing `startContainer()` to
|
||||
start the Ephemeral Container.
|
||||
1. After initial creation, future invocations of `syncPod()` will publish
|
||||
its ContainerStatus but otherwise ignore the Ephemeral Container. It
|
||||
will exist for the life of the pod sandbox or it exits. In no event will
|
||||
it be restarted.
|
||||
1. `syncPod()` finishes a regular sync, publishing an updated PodStatus (which
|
||||
includes the new `EphemeralContainer`) by its normal, existing means.
|
||||
1. The client performs an attach to the debug container's console.
|
||||
|
||||
There are no limits on the number of Debug Containers that can be created in a
|
||||
pod, but exceeding a pod's resource allocation may cause the pod to be evicted.
|
||||
|
||||
### Restarting and Reattaching Debug Containers
|
||||
|
||||
Debug Containers will never be restarted automatically. It is possible to
|
||||
replace a Debug Container that has exited by re-using a Debug Container name. It
|
||||
is an error to attempt to replace a Debug Container that is still running, which
|
||||
is detected by both the API server and the kubelet.
|
||||
Debug Containers will not be restarted.
|
||||
|
||||
We want to be more user friendly by allowing re-use of the name of an exited
|
||||
debug container, but this will be left for a future improvement.
|
||||
|
||||
One can reattach to a Debug Container using `kubectl attach`. When supported by
|
||||
a runtime, multiple clients can attach to a single debug container and share the
|
||||
|
|
@ -299,50 +254,25 @@ terminal. This is supported by Docker.
|
|||
|
||||
### Killing Debug Containers
|
||||
|
||||
Debug containers will not be killed automatically until the pod (specifically,
|
||||
the pod sandbox) is destroyed. Debug Containers will stop when their command
|
||||
exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug
|
||||
Containers will not receive an EOF if their connection is interrupted.
|
||||
Debug containers will not be killed automatically unless the pod is destroyed.
|
||||
Debug Containers will stop when their command exits, such as exiting a shell.
|
||||
Unlike `kubectl exec`, processes in Debug Containers will not receive an EOF if
|
||||
their connection is interrupted.
|
||||
|
||||
### Container Lifecycle Changes
|
||||
|
||||
Implementing debug requires no changes to the Container Runtime Interface as
|
||||
it's the same operation as creating a regular container. The following changes
|
||||
are necessary in the kubelet:
|
||||
|
||||
1. `SyncPod()` must not kill any Debug Container even though it is not part of
|
||||
the pod spec.
|
||||
1. As an exception to the above, `SyncPod()` will kill Debug Containers when
|
||||
the pod sandbox changes since a lone Debug Container in an abandoned sandbox
|
||||
is not useful. Debug Containers are not automatically started in the new
|
||||
sandbox.
|
||||
1. `convertStatusToAPIStatus()` must sort Debug Containers status into
|
||||
`EphemeralContainerStatuses` similar to as it does for
|
||||
`InitContainerStatuses`
|
||||
1. The kubelet must preserve `ContainerStatus` on debug containers for
|
||||
reporting.
|
||||
1. Debug Containers must be excluded from calculation of pod phase and
|
||||
condition
|
||||
|
||||
It's worth noting some things that do not change:
|
||||
|
||||
1. `KillPod()` already operates on all running containers returned by the
|
||||
runtime.
|
||||
1. Containers created prior to this feature being enabled will have a
|
||||
`containerType` of `""`. Since this does not match `"EPHEMERAL"` the special
|
||||
handling of Debug Containers is backwards compatible.
|
||||
A future improvement to Ephemeral Containers could allow killing Debug
|
||||
Containers when they're removed the `EphemeralContainers`, but it's not clear
|
||||
that we want to allow this. Removing an Ephemeral Container spec makes it
|
||||
unavailable for future authorization decisions (e.g. whether to authorize exec
|
||||
in a pod that had a privileged Ephemeral Container).
|
||||
|
||||
### Security Considerations
|
||||
|
||||
Debug Containers have no additional privileges above what is available to any
|
||||
`v1.Container`. It's the equivalent of configuring an shell container in a pod
|
||||
spec but created on demand.
|
||||
spec except that it is created on demand.
|
||||
|
||||
Admission plugins that guard `/exec` must be updated for the new parameters. In
|
||||
particular, they should enforce the same container image policy on the `Image`
|
||||
parameter as is enforced for regular containers. During the alpha phase we will
|
||||
additionally support a container image whitelist as a kubelet flag to allow
|
||||
cluster administrators to easily constraint debug container images.
|
||||
Admission plugins must be updated to guard `/ephemeralcontainers`. They should
|
||||
apply the same container image and security policy as for regular containers.
|
||||
|
||||
### Additional Consideration
|
||||
|
||||
|
|
@ -352,116 +282,33 @@ cluster administrators to easily constraint debug container images.
|
|||
troubleshooting causes a pod to exceed its resource limit it may be evicted.
|
||||
1. There's an output stream race inherent to creating then attaching a
|
||||
container which causes output generated between the start and attach to go
|
||||
to the log rather than the client. This is not specific to Debug Containers
|
||||
and exists because Kubernetes has no mechanism to attach a container prior
|
||||
to starting it. This larger issue will not be addressed by Debug Containers,
|
||||
but Debug Containers would benefit from future improvements or work arounds.
|
||||
1. We do not want to describe Debug Containers using `v1.Container`. This is to
|
||||
reinforce that Debug Containers are not general purpose containers by
|
||||
limiting their configurability. Debug Containers should not be used to build
|
||||
services.
|
||||
1. Debug Containers are of limited usefulness without a shared PID namespace.
|
||||
If a pod is configured with isolated PID namespaces, the Debug Container
|
||||
will join the PID namespace of the target container. Debug Containers will
|
||||
not be available with runtimes that do not implement PID namespace sharing
|
||||
in some form.
|
||||
to the log rather than the client. This is not specific to Ephemeral
|
||||
Containers and exists because Kubernetes has no mechanism to attach a
|
||||
container prior to starting it. This larger issue will not be addressed by
|
||||
Ephemeral Containers, but Ephemeral Containers would benefit from future
|
||||
improvements or work arounds.
|
||||
1. Ephemeral Containers should not be used to build services, which we've
|
||||
attempted to reflect in the API.
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Alpha Release
|
||||
### 1.12: Initial Alpha Release
|
||||
|
||||
#### Goals and Non-Goals for Alpha Release
|
||||
|
||||
We're targeting an alpha release in Kubernetes 1.9 that includes the following
|
||||
We're targeting an alpha release in Kubernetes 1.12 that includes the following
|
||||
basic functionality:
|
||||
|
||||
* Support in the kubelet for creating debug containers in a running pod
|
||||
* A `kubectl debug` command to initiate a debug container
|
||||
* `kubectl describe pod` will list status of debug containers running in a pod
|
||||
1. Approval for basic core API changes to Pod
|
||||
1. Basic support in the kubelet for creating Ephemeral Containers
|
||||
|
||||
Functionality out of scope for 1.12:
|
||||
|
||||
* Killing running Ephemeral Containers by removing them from the Pod Spec.
|
||||
* Updating `pod.Spec.EphemeralContainers` when containers are garbage
|
||||
collected.
|
||||
* `kubectl` commands for creating Ephemeral Containers
|
||||
|
||||
Functionality will be hidden behind an alpha feature flag and disabled by
|
||||
default. The following are explicitly out of scope for the 1.9 alpha release:
|
||||
|
||||
* Exited Debug Containers will be garbage collected as regular containers and
|
||||
may disappear from the list of Debug Container Statuses.
|
||||
* Security Context for the Debug Container is not configurable. It will always
|
||||
be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`.
|
||||
* Image pull policy for the Debug Container is not configurable. It will
|
||||
always be run with `PullAlways`.
|
||||
|
||||
#### kubelet Implementation
|
||||
|
||||
Debug Containers are implemented in the kubelet's generic runtime manager.
|
||||
Performing this operation with a legacy (non-CRI) runtime will result in a not
|
||||
implemented error. Implementation in the kubelet will be split into the
|
||||
following steps:
|
||||
|
||||
##### Step 1: Container Type
|
||||
|
||||
The first step is to add a feature gate to ensure all changes are off by
|
||||
default. This will be added in the `pkg/features` `DefaultFeatureGate`.
|
||||
|
||||
The runtime manager stores metadata about containers in the runtime via labels
|
||||
(e.g. docker labels). These labels are used to populate the fields of
|
||||
`kubecontainer.ContainerStatus`. Since the runtime manager needs to handle Debug
|
||||
Containers differently in a few situations, we must add a new piece of metadata
|
||||
to distinguish Debug Containers from regular containers.
|
||||
|
||||
`startContainer()` will be updated to write a new label
|
||||
`io.kubernetes.container.type` to the runtime. Existing containers will be
|
||||
started with a type of `REGULAR` or `INIT`. When added in a subsequent step,
|
||||
Debug Containers will start with the type `EPHEMERAL`.
|
||||
|
||||
##### Step 2: Creation and Handling of Debug Containers
|
||||
|
||||
This step adds methods for creating debug containers, but doesn't yet modify the
|
||||
kubelet API. Since the runtime manager discards runtime (e.g. docker) labels
|
||||
after populating `kubecontainer.ContainerStatus`, the label value will be stored
|
||||
in a the new field `ContainerStatus.Type` so it can be used by `SyncPod()`.
|
||||
|
||||
The kubelet gains a `RunDebugContainer()` method which accepts a `v1.Container`
|
||||
and passes it on to the Runtime Manager's `RunDebugContainer()` if implemented.
|
||||
Currently only the Generic Runtime Manager (i.e. the CRI) implements the
|
||||
`DebugContainerRunner` interface.
|
||||
|
||||
The Generic Runtime Manager's `RunDebugContainer()` calls `startContainer()` to
|
||||
create the Debug Container. Additionally, `SyncPod()` is modified to skip Debug
|
||||
Containers unless the sandbox is restarted.
|
||||
|
||||
##### Step 3: kubelet API changes
|
||||
|
||||
The kubelet exposes the new functionality in its existing `/exec/` endpoint.
|
||||
`ServeExec()` constructs a `v1.Container` based on `PodExecOptions`, calls
|
||||
`RunDebugContainer()`, and performs the attach.
|
||||
|
||||
##### Step 4: Reporting EphemeralContainerStatus
|
||||
|
||||
The last major change to the kubelet is to populate
|
||||
v1.`PodStatus.EphemeralContainerStatuses` based on the
|
||||
`kubecontainer.ContainerStatus` for the Debug Container.
|
||||
|
||||
#### Kubernetes API Changes
|
||||
|
||||
There are two changes to be made to the Kubernetes, which will be made
|
||||
independently:
|
||||
|
||||
1. `v1.PodExecOptions` must be extended with new fields.
|
||||
1. `v1.PodStatus` gains a new field to hold Debug Container statuses.
|
||||
|
||||
In all cases, new fields will be prepended with `Alpha` for the duration of this
|
||||
feature's alpha status.
|
||||
|
||||
#### kubectl changes
|
||||
|
||||
In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a
|
||||
`kubectl alpha` command to contain alpha features. We will add `kubectl alpha
|
||||
debug` to invoke Debug Containers. `kubectl` does not use feature gates, so
|
||||
`kubectl alpha debug` will be visible by default in `kubectl` 1.9 and return an
|
||||
error when used on a cluster with the feature disabled.
|
||||
|
||||
`kubectl describe pod` will report the contents of `EphemeralContainerStatuses`
|
||||
when not empty as it means the feature is enabled. The field will be hidden when
|
||||
empty.
|
||||
default.
|
||||
|
||||
## Appendices
|
||||
|
||||
|
|
@ -592,10 +439,10 @@ container image distribution mechanisms to fetch images when the debug command
|
|||
is run.
|
||||
|
||||
**Respect admission restrictions.** Requests from kubectl are proxied through
|
||||
the apiserver and so are available to existing [admission
|
||||
controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins
|
||||
already exist to intercept `exec` and `attach` calls, but extending this to
|
||||
support `debug` has not yet been scoped.
|
||||
the apiserver and so are available to existing
|
||||
[admission controllers](https://kubernetes.io/docs/admin/admission-controllers/).
|
||||
Plugins already exist to intercept `exec` and `attach` calls, but extending this
|
||||
to support `debug` has not yet been scoped.
|
||||
|
||||
**Allow introspection of pod state using existing tools**. The list of
|
||||
`EphemeralContainerStatuses` is never truncated. If a debug container has run in
|
||||
|
|
@ -629,26 +476,146 @@ active debug container.
|
|||
|
||||
### Appendix 3: Alternatives Considered
|
||||
|
||||
#### Mutable Pod Spec
|
||||
#### Container Spec in PodStatus
|
||||
|
||||
Rather than adding an operation to have Kubernetes attach a pod we could instead
|
||||
make the pod spec mutable so the client can generate an update adding a
|
||||
container. `SyncPod()` has no issues adding the container to the pod at that
|
||||
point, but an immutable pod spec has been a basic assumption in Kubernetes thus
|
||||
far and changing it carries risk. It's preferable to keep the pod spec immutable
|
||||
as a best practice.
|
||||
Originally there was a desire to keep the pod spec immutable, so we explored
|
||||
modifying only the pod status. An `EphemeralContainer` would contain a Spec, a
|
||||
Status and a Target:
|
||||
|
||||
#### Ephemeral container
|
||||
```
|
||||
// EphemeralContainer describes a container to attach to a running pod for troubleshooting.
|
||||
type EphemeralContainer struct {
|
||||
metav1.TypeMeta `json:",inline"`
|
||||
|
||||
An earlier version of this proposal suggested running an ephemeral container in
|
||||
the pod namespaces. The container would not be added to the pod spec and would
|
||||
exist only as long as the process it ran. This has the advantage of behaving
|
||||
similarly to the current kubectl exec, but it is opaque and likely violates
|
||||
design assumptions. We could add constructs to track and report on both
|
||||
traditional exec process and exec containers, but this would probably be more
|
||||
work than adding to the pod spec. Both are generally useful, and neither
|
||||
precludes the other in the future, so we chose mutating the pod spec for
|
||||
expedience.
|
||||
// Spec describes the Ephemeral Container to be created.
|
||||
Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
|
||||
|
||||
// Most recently observed status of the container.
|
||||
// This data may not be up to date.
|
||||
// Populated by the system.
|
||||
// Read-only.
|
||||
// +optional
|
||||
Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
|
||||
|
||||
// If set, the name of the container from PodSpec that this ephemeral container targets.
|
||||
// If not set then the ephemeral container is run in whatever namespaces are shared
|
||||
// for the pod.
|
||||
TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"`
|
||||
}
|
||||
```
|
||||
|
||||
Ephemeral Containers for a pod would be listed in the pod's status:
|
||||
|
||||
```
|
||||
type PodStatus struct {
|
||||
...
|
||||
// List of user-initiated ephemeral containers that have been run in this pod.
|
||||
// +optional
|
||||
EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
|
||||
|
||||
}
|
||||
```
|
||||
|
||||
To create a new Ephemeral Container, one would append a new `EphemeralContainer`
|
||||
with the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod`
|
||||
in the API. Users cannot normally modify the pod status, so we'd create a new
|
||||
subresource `/ephemeralcontainers` that allows an update of solely
|
||||
`EphemeralContainers` and enforces append-only semantics.
|
||||
|
||||
Since we have a requirement to describe the Ephemeral Container with a
|
||||
`v1.Container`, this lead to a "spec in status" that seemed to violate API best
|
||||
practices. It was confusing, and it required added complexity in the kubelet to
|
||||
persist and publish user intent, which is rightfully the job of the apiserver.
|
||||
|
||||
#### Extend the Existing Exec API ("exec++")
|
||||
|
||||
A simpler change is to extend `v1.Pod`'s `/exec` subresource to support
|
||||
"executing" container images. The current `/exec` endpoint must implement `GET`
|
||||
to support streaming for all clients. We don't want to encode a (potentially
|
||||
large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions`
|
||||
with the specific fields required for creating a Debug Container:
|
||||
|
||||
```
|
||||
// PodExecOptions is the query options to a Pod's remote exec call
|
||||
type PodExecOptions struct {
|
||||
...
|
||||
// EphemeralContainerName is the name of an ephemeral container in which the
|
||||
// command ought to be run. Either both EphemeralContainerName and
|
||||
// EphemeralContainerImage fields must be set, or neither.
|
||||
EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
|
||||
|
||||
// EphemeralContainerImage is the image of an ephemeral container in which the command
|
||||
// ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
|
||||
// fields must be set, or neither.
|
||||
EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
|
||||
}
|
||||
```
|
||||
|
||||
After creating the Ephemeral Container, the kubelet would upgrade the connection
|
||||
to streaming and perform an attach to the container's console. If disconnected,
|
||||
the Ephemeral Container could be reattached using the pod's `/attach` endpoint
|
||||
with `EphemeralContainerName`.
|
||||
|
||||
Ephemeral Containers could not be removed via the API and instead the process
|
||||
must terminate. While not ideal, this parallels existing behavior of `kubectl
|
||||
exec`. To kill an Ephemeral Container one would `attach` and exit the process
|
||||
interactively or create a new Ephemeral Container to send a signal with
|
||||
`kill(1)` to the original process.
|
||||
|
||||
Since the user cannot specify the `v1.Container`, this approach sacrifices a
|
||||
great deal of flexibility. This solution still requires the kubelet to publish a
|
||||
`Container` spec in the `PodStatus` that can be examined for future admission
|
||||
decisions and so retains many of the downsides of the Container Spec in
|
||||
PodStatus approach.
|
||||
|
||||
#### Ephemeral Container Controller
|
||||
|
||||
Kubernetes prefers declarative APIs where the client declares a state for
|
||||
Kubernetes to enact. We could implement this in a declarative manner by creating
|
||||
a new `EphemeralContainer` type:
|
||||
|
||||
```
|
||||
type EphemeralContainer struct {
|
||||
metav1.TypeMeta
|
||||
metav1.ObjectMeta
|
||||
|
||||
Spec v1.Container
|
||||
Status v1.ContainerStatus
|
||||
}
|
||||
```
|
||||
|
||||
A new controller in the kubelet would watch for EphemeralContainers and
|
||||
create/delete debug containers. `EphemeralContainer.Status` would be updated by
|
||||
the kubelet at the same time it updates `ContainerStatus` for regular and init
|
||||
containers. Clients would create a new `EphemeralContainer` object, wait for it
|
||||
to be started and then attach using the pod's attach subresource and the name of
|
||||
the `EphemeralContainer`.
|
||||
|
||||
A new controller is a significant amount of complexity to add to the kubelet,
|
||||
especially considering that the kubelet is already watching for changes to pods.
|
||||
The kubelet would have to be modified to create containers in a pod from
|
||||
multiple config sources. SIG Node strongly prefers to minimize kubelet
|
||||
complexity.
|
||||
|
||||
#### Mutable Pod Spec Containers
|
||||
|
||||
Rather than adding to the pod API, we could instead make the pod spec mutable so
|
||||
the client can generate an update adding a container. `SyncPod()` has no issues
|
||||
adding the container to the pod at that point, but an immutable pod spec has
|
||||
been a basic assumption and best practice in Kubernetes. Changing this
|
||||
assumption complicates the requirements of the kubelet state machine. Since the
|
||||
kubelet was not written with this in mind, we should expect such a change would
|
||||
create bugs we cannot predict.
|
||||
|
||||
#### Image Exec
|
||||
|
||||
An earlier version of this proposal suggested simply adding `Image` parameter to
|
||||
the exec API. This would run an ephemeral container in the pod namespaces
|
||||
without adding it to the pod spec or status. This container would exist only as
|
||||
long as the process it ran. This parallels the current kubectl exec, including
|
||||
its lack of transparency. We could add constructs to track and report on both
|
||||
traditional exec process and exec containers. In the end this failed to meet our
|
||||
transparency requirements.
|
||||
|
||||
#### Attaching Container Type Volume
|
||||
|
||||
|
|
@ -669,9 +636,8 @@ this simplifies the solution by working within the existing constraints of
|
|||
If Kubernetes supported the concept of an "inactive" container, we could
|
||||
configure it as part of a pod and activate it at debug time. In order to avoid
|
||||
coupling the debug tool versions with those of the running containers, we would
|
||||
need to ensure the debug image was pulled at debug time. The container could
|
||||
then be run with a TTY and attached using kubectl. We would need to figure out a
|
||||
solution that allows access the filesystem of other containers.
|
||||
want to ensure the debug image was pulled at debug time. The container could
|
||||
then be run with a TTY and attached using kubectl.
|
||||
|
||||
The downside of this approach is that it requires prior configuration. In
|
||||
addition to requiring prior consideration, it would increase boilerplate config.
|
||||
|
|
@ -681,14 +647,14 @@ than a feature of the platform.
|
|||
#### Implicit Empty Volume
|
||||
|
||||
Kubernetes could implicitly create an EmptyDir volume for every pod which would
|
||||
then be available as target for either the kubelet or a sidecar to extract a
|
||||
then be available as a target for either the kubelet or a sidecar to extract a
|
||||
package of binaries.
|
||||
|
||||
Users would have to be responsible for hosting a package build and distribution
|
||||
infrastructure or rely on a public one. The complexity of this solution makes it
|
||||
undesirable.
|
||||
|
||||
#### Standalone Pod in Shared Namespace
|
||||
#### Standalone Pod in Shared Namespace ("Debug Pod")
|
||||
|
||||
Rather than inserting a new container into a pod namespace, Kubernetes could
|
||||
instead support creating a new pod with container namespaces shared with
|
||||
|
|
@ -698,21 +664,21 @@ useful, the containers in this "Debug Pod" should be run inside the namespaces
|
|||
(network, pid, etc) of the target pod but remain in a separate resource group
|
||||
(e.g. cgroup for container-based runtimes).
|
||||
|
||||
This would be a rather fundamental change to pod, which is currently treated as
|
||||
an atomic unit. The Container Runtime Interface has no provisions for sharing
|
||||
This would be a rather large change for pod, which is currently treated as an
|
||||
atomic unit. The Container Runtime Interface has no provisions for sharing
|
||||
outside of a pod sandbox and would need a refactor. This could be a complicated
|
||||
change for non-container runtimes (e.g. hypervisor runtimes) which have more
|
||||
rigid boundaries between pods.
|
||||
|
||||
Effectively, Debug Pod must be implemented by the runtimes while Debug
|
||||
Containers are implemented by the kubelet. Minimizing change to the Kubernetes
|
||||
API is not worth the increased complexity for the kubelet and runtimes.
|
||||
This is pushing the complexity of the solution from the kubelet to the runtimes.
|
||||
Minimizing change to the Kubernetes API is not worth the increased complexity
|
||||
for the kubelet and runtimes.
|
||||
|
||||
It could also be possible to implement a Debug Pod as a privileged pod that runs
|
||||
in the host namespace and interacts with the runtime directly to run a new
|
||||
container in the appropriate namespace. This solution would be runtime-specific
|
||||
and effectively pushes the complexity of debugging to the user. Additionally,
|
||||
requiring node-level access to debug a pod does not meet our requirements.
|
||||
and pushes the complexity of debugging to the user. Additionally, requiring
|
||||
node-level access to debug a pod does not meet our requirements.
|
||||
|
||||
#### Exec from Node
|
||||
|
||||
|
|
@ -729,8 +695,7 @@ coupling it with container images.
|
|||
* [Pod Troubleshooting Tracking Issue](https://issues.k8s.io/27140)
|
||||
* [CRI Tracking Issue](https://issues.k8s.io/28789)
|
||||
* [CRI: expose optional runtime features](https://issues.k8s.io/32803)
|
||||
* [Resource QoS in
|
||||
Kubernetes](resource-qos.md)
|
||||
* [Resource QoS in Kubernetes](resource-qos.md)
|
||||
* Related Features
|
||||
* [#1615](https://issues.k8s.io/1615) - Shared PID Namespace across
|
||||
containers in a pod
|
||||
|
|
|
|||
Loading…
Reference in New Issue