Merge pull request #1269 from verb/pod-troubleshooting-use-container

Use v1.Container in Debug Containers API
This commit is contained in:
k8s-ci-robot 2018-08-23 17:15:37 -07:00 committed by GitHub
commit f1d6261f21
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 329 additions and 364 deletions

View File

@ -1,6 +1,6 @@
# Troubleshoot Running Pods
* Status: Pending
* Status: Implementing
* Version: Alpha
* Implementation Owner: @verb
@ -16,9 +16,9 @@ Many developers of native Kubernetes applications wish to treat Kubernetes as an
execution platform for custom binaries produced by a build system. These users
can forgo the scripted OS install of traditional Dockerfiles and instead `COPY`
the output of their build system into a container image built `FROM scratch` or
a [distroless container
image](https://github.com/GoogleCloudPlatform/distroless). This confers several
advantages:
a
[distroless container image](https://github.com/GoogleCloudPlatform/distroless).
This confers several advantages:
1. **Minimal images** lower operational burden and reduce attack vectors.
1. **Immutable images** improve correctness and reliability.
@ -45,26 +45,25 @@ A solution to troubleshoot arbitrary container images MUST:
* fetch troubleshooting utilities at debug time rather than at the time of pod
creation
* be compatible with admission controllers and audit logging
* allow discovery of debugging status
* allow discovery of current debugging status
* support arbitrary runtimes via the CRI (possibly with reduced feature set)
* require no administrative access to the node
* have an excellent user experience (i.e. should be a feature of the platform
rather than config-time trickery)
* have no *inherent* side effects to the running container image
* have no _inherent_ side effects to the running container image
* v1.Container must be available for inspection by admission controllers
## Feature Summary
Any new debugging functionality will require training users. We can ease the
transition by building on an existing usage pattern. We will create a new
command, `kubectl debug`, which parallels an existing command, `kubectl exec`.
Whereas `kubectl exec` runs a *process* in a *container*, `kubectl debug` will
be similar but run a *container* in a *pod*.
Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will
be similar but run a _container_ in a _pod_.
A container created by `kubectl debug` is a *Debug Container*. Just like a
process run by `kubectl exec`, a Debug Container is not part of the pod spec and
has no resource stored in the API. Unlike `kubectl exec`, a Debug Container
*does* have status that is reported in `v1.PodStatus` and displayed by `kubectl
describe pod`.
A container created by `kubectl debug` is a _Debug Container_. Unlike `kubectl
exec`, Debug Containers have status that is reported in `PodStatus` and
displayed by `kubectl describe pod`.
For example, the following command would attach to a newly created container in
a pod:
@ -82,22 +81,16 @@ kubectl debug target-pod
This creates an interactive shell in a pod which can examine and signal other
processes in the pod. It has access to the same network and IPC as processes in
the pod. It can access the filesystem of other processes by `/proc/$PID/root`.
As is already the case with regular containers, Debug Containers can enter
arbitrary namespaces of another container via `nsenter` when run with
`CAP_SYS_ADMIN`.
the pod. When [process namespace sharing](https://features.k8s.io/495) is
enabled, it can access the filesystem of other processes by `/proc/$PID/root`.
Debug Containers can enter arbitrary namespaces of another visible container via
`nsenter` when run with `CAP_SYS_ADMIN`.
*Please see the User Stories section for additional examples and Alternatives
Considered for the considerable list of other solutions we considered.*
_Please see the User Stories section for additional examples and Alternatives
Considered for the considerable list of other solutions we considered._
## Implementation Details
The implementation of `kubectl debug` closely mirrors the implementation of
`kubectl exec`, with most of the complexity implemented in the `kubelet`. How
functionality like this best fits into Kubernetes API has been contentious. In
order to make progress, we will start with the smallest possible API change,
extending `/exec` to support Debug Containers, and iterate.
From the perspective of the user, there's a new command, `kubectl debug`, that
creates a Debug Container and attaches to its console. We believe a new command
will be less confusing for users than overloading `kubectl exec` with a new
@ -106,192 +99,154 @@ subsequently be used to reattach and is reported by `kubectl describe`.
### Kubernetes API Changes
#### Chosen Solution: "exec++"
This will be implemented in the Core API to avoid new dependencies in the
kubelet. The user-level concept of a _Debug Container_ implemented with the
API-level concept of an _Ephemeral Container_. The API doesn't require an
Ephemeral Container to be used as a Debug Container. It's intended as a general
purpose construct for running a short-lived process in a pod.
We will extend `v1.Pod`'s `/exec` subresource to support "executing" container
images. The current `/exec` endpoint must implement `GET` to support streaming
for all clients. We don't want to encode a (potentially large) `v1.Container` as
an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific
fields required for creating a Debug Container:
#### Pod Changes
Ephemeral Containers are represented in `PodSpec` and `PodStatus`:
```
// PodExecOptions is the query options to a Pod's remote exec call
type PodExecOptions struct {
...
// EphemeralContainerName is the name of an ephemeral container in which the
// command ought to be run. Either both EphemeralContainerName and
// EphemeralContainerImage fields must be set, or neither.
EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
// EphemeralContainerImage is the image of an ephemeral container in which the command
// ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
// fields must be set, or neither.
EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
type PodSpec struct {
...
// List of user-initiated ephemeral containers to run in this pod.
// This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
// +optional
EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,29,opt,name=ephemeralContainers"`
}
```
After creating the Debug Container, the kubelet will upgrade the connection to
streaming and perform an attach to the container's console. If disconnected, the
Debug Container can be reattached using the pod's `/attach` endpoint with
`EphemeralContainerName`.
Debug Containers cannot be removed via the API and instead the process must
terminate. While not ideal, this parallels existing behavior of `kubectl exec`.
To kill a Debug Container one would `attach` and exit the process interactively
or create a new Debug Container to send a signal with `kill(1)` to the original
process.
#### Alternative 1: Debug Subresource
Rather than extending an existing subresource, we could create a new,
non-streaming `debug` subresource. We would create a new API Object:
```
// DebugContainer describes a container to attach to a running pod for troubleshooting.
type DebugContainer struct {
metav1.TypeMeta
metav1.ObjectMeta
// Name is the name of the Debug Container. Its presence will cause
// exec to create a Debug Container rather than performing a runtime exec.
Name string `json:"name,omitempty" ...`
// Image is an optional container image name that will be used to for the Debug
// Container in the specified Pod with Command as ENTRYPOINT. If omitted a
// default image will be used.
Image string `json:"image,omitempty" ...`
}
```
The pod would gain a new `/debug` subresource that allows the following:
1. A `POST` of a `PodDebugContainer` to
`/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` to create Debug
Container named `$NAME` running in pod `$POD_NAME`.
1. A `DELETE` of `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` will stop
the Debug Container `$NAME` in pod `$POD_NAME`.
Once created, a client would attach to the console of a debug container using
the existing attach endpoint, `/api/v1/namespaces/$NS/pods/$POD_NAME/attach`.
However, this pattern does not resemble any other current usage of the API, so
we prefer to start with exec++ and reevaluate if we discover a compelling
reason.
#### Alternative 2: Declarative Configuration
Using subresources is an imperative style API where the client instructs the
kubelet to perform an action, but in general Kubernetes prefers declarative APIs
where the client declares a state for Kubernetes to enact.
We could implement this in a declarative manner by creating a new
`EphemeralContainer` type:
```
type EphemeralContainer struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec EphemeralContainerSpec
Status v1.ContainerStatus
}
```
`EphemeralContainerSpec` is similar to `v1.Container`, but contains only fields
relevant to Debug Containers:
```
type EphemeralContainerSpec struct {
// Target is the pod in which to run the EphemeralContainer
// Required.
Target v1.ObjectReference
Name string
Image String
Command []string
Args []string
ImagePullPolicy PullPolicy
SecurityContext *SecurityContext
}
```
A new controller in the kubelet would watch for EphemeralContainers and
create/delete debug containers. `EphemeralContainer.Status` would be updated by
the kubelet at the same time it updates `ContainerStatus` for regular and init
containers. Clients would create a new `EphemeralContainer` object, wait for it
to be started and then attach using the pod's attach subresource and the name of
the `EphemeralContainer`.
Debugging is inherently imperative, however, rather than a state for Kubernetes
to enforce. Once a Debug Container is started it should not be automatically
restarted, for example. This solution imposes additionally complexity and
dependencies on the kubelet, but it's not yet clear if the complexity is
justified.
### Debug Container Status
The status of a Debug Container is reported in a new field in `v1.PodStatus`:
```
type PodStatus struct {
...
EphemeralContainerStatuses []v1.ContainerStatus
...
// Status for any Ephemeral Containers that running in this pod.
// This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
// +optional
EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,12,rep,name=ephemeralContainerStatuses"`
}
```
This status is only populated for Debug Containers, but there's interest in
tracking status for traditional exec in a similar manner.
`EphemeralContainerStatuses` resembles the existing `ContainerStatuses` and
`InitContainerStatuses`, but `EphemeralContainers` introduces a new type:
Note that `Command` and `Args` would have to be tracked in the status object
because there is no spec for Debug Containers or exec. These must either be made
available by the runtime or tracked by the kubelet. For Debug Containers this
could be stored as runtime labels, but the kubelet currently has no method of
storing state across restarts for exec. Solving this problem for exec is out of
scope for Debug Containers, but we will look for a solution as we implement this
feature.
```
// An EphemeralContainer is a container which runs temporarily in a pod for human-initiated actions
// such as troubleshooting. This is an alpha feature enabled by the EphemeralContainers feature flag.
type EphemeralContainer struct {
// Spec describes the Ephemeral Container to be created.
Spec Container `json:"spec,omitempty" protobuf:"bytes,1,opt,name=spec"`
`EphemeralContainerStatuses` is populated by the kubelet in the same way as
regular and init container statuses. This is sent to the API server and
displayed by `kubectl describe pod`.
// If set, the name of the container from PodSpec that this ephemeral container targets.
// The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container.
// If not set then the ephemeral container is run in whatever namespaces are shared
// for the pod.
// +optional
TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,2,opt,name=targetContainerName"`
}
```
Much of the utility of Ephemeral Containers comes from the ability to run a
container within the PID namespace of another container. `TargetContainerName`
allows targeting a container that doesn't share its PID namespace with the rest
of the pod. We must modify the CRI to enable this functionality (see below).
##### Alternative Considered: Omitting TargetContainerName
It would be simpler for the API, kubelet and kubectl if `EphemeralContainers`
was a `[]Container`, but as isolated PID namespaces will be the default for some
time, being able to target a container will provide a better user experience.
#### Updates
Most fields of `Pod.Spec` are immutable once created. There is a short whitelist
of fields which may be updated, and we could extend this to include
`EphemeralContainers`. The ability to add new containers is a large change for
Pod, however, and we'd like to begin conservatively by enforcing the following
best practices:
1. Ephemeral Containers lack guarantees for resources or execution, and they
will never be automatically restarted. To avoid pods that depend on
Ephemeral Containers, we allow their addition only in pod updates and
disallow them during pod create.
1. Some fields of `v1.Container` imply a fundamental role in a pod. We will
disallow the following fields in Ephemeral Containers: `resources`, `ports`,
`livenessProbe`, `readinessProbe`, and `lifecycle.`
1. Cluster administrators may want to restrict access to Ephemeral Containers
independent of other pod updates.
To enforce these restrictions and new permissions, we will introduce a new Pod
subresource, `/ephemeralcontainers`. `EphemeralContainers` can only be modified
via this subresource. `EphemeralContainerStatuses` is updated with everything
else in `Pod.Status` via `/status`.
To create a new Ephemeral Container, one appends a new `EphemeralContainer` with
the desired `v1.Container` as `Spec` in `Pod.Spec.EphemeralContainers` and
`PUT`s the pod to `/ephemeralcontainers`.
The subresources `attach`, `exec`, `log`, and `portforward` are available for
Ephemeral Containers and will be forwarded by the apiserver. This means `kubectl
attach`, `kubelet exec`, `kubectl log`, and `kubectl port-forward` will work for
Ephemeral Containers.
Once the pod is updated, the kubelet worker watching this pod will launch the
Ephemeral Container and update its status. The client is expected to watch for
the creation of the container status and then attach to the console of a debug
container using the existing attach endpoint,
`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new
container occurring between its creation and attach will not be replayed, but it
can be viewed using `kubectl log`.
##### Alternative Considered: Standard Pod Updates
It would simplify initial implementation if we updated the pod spec via the
normal means, and switched to a new update subresource if required at a future
date. It's easier to begin with a too-restrictive policy than a too-permissive
one on which users come to rely, and we expect to be able to remove the
`/ephemeralcontainers` subresource prior to exiting alpha should it prove
unnecessary.
### Container Runtime Interface (CRI) changes
The CRI requires no changes for basic functionality, but it will need to be
updated to support container namespace targeting, as described in the
[Shared PID Namespace Proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md#targeting-a-specific-containers-namespace).
### Creating Debug Containers
1. `kubectl` invokes the exec API as described in the preceding section.
1. The API server checks for name collisions with existing containers, performs
admission control and proxies the connection to the kubelet's
`/exec/$NS/$POD_NAME/$CONTAINER_NAME` endpoint.
1. The kubelet instructs the Runtime Manager to create a Debug Container.
1. The runtime manager uses the existing `startContainer()` method to create a
container in an existing pod. `startContainer()` has one modification for
Debug Containers: it creates a new runtime label (e.g. a docker label) that
identifies this container as a Debug Container.
1. After creating the container, the kubelet schedules an asynchronous update
of `PodStatus`. The update publishes the debug container status to the API
server at which point the Debug Container becomes visible via `kubectl
describe pod`.
1. The kubelet will upgrade the connection to streaming and attach to the
container's console.
To create a debug container, kubectl will take the following steps:
Rather than performing the implicit attach the kubelet could return success to
the client and require the client to perform an explicit attach, but the
implicit attach maintains consistent semantics across `/exec` rather than
varying behavior based on parameters.
The apiserver detects container name collisions with both containers in the pod
spec and other running Debug Containers by checking
`EphemeralContainerStatuses`. In a race to create two Debug Containers with the
same name, the API server will pass both requests and the kubelet must return an
error to all but one request.
1. `kubectl` constructs an `EphemeralContainer` based on command line arguments
and appends it to `Pod.Spec.EphemeralContainers`. It `PUT`s the modified pod
to the pod's `/ephemeralcontainers`.
1. The apiserver discards changes other than additions to
`Pod.Spec.EphemeralContainers` and validates the pod update.
1. Pod validation fails if container spec contains fields disallowed for
Ephemeral Containers or the same name as a container in the spec or
`EphemeralContainers`.
1. API resource versioning resolves update races.
1. The kubelet's pod watcher notices the update and triggers a `syncPod()`.
During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()`
for any new Ephemeral Container.
1. `StartEphemeralContainer()` uses the existing `startContainer()` to
start the Ephemeral Container.
1. After initial creation, future invocations of `syncPod()` will publish
its ContainerStatus but otherwise ignore the Ephemeral Container. It
will exist for the life of the pod sandbox or it exits. In no event will
it be restarted.
1. `syncPod()` finishes a regular sync, publishing an updated PodStatus (which
includes the new `EphemeralContainer`) by its normal, existing means.
1. The client performs an attach to the debug container's console.
There are no limits on the number of Debug Containers that can be created in a
pod, but exceeding a pod's resource allocation may cause the pod to be evicted.
### Restarting and Reattaching Debug Containers
Debug Containers will never be restarted automatically. It is possible to
replace a Debug Container that has exited by re-using a Debug Container name. It
is an error to attempt to replace a Debug Container that is still running, which
is detected by both the API server and the kubelet.
Debug Containers will not be restarted.
We want to be more user friendly by allowing re-use of the name of an exited
debug container, but this will be left for a future improvement.
One can reattach to a Debug Container using `kubectl attach`. When supported by
a runtime, multiple clients can attach to a single debug container and share the
@ -299,50 +254,25 @@ terminal. This is supported by Docker.
### Killing Debug Containers
Debug containers will not be killed automatically until the pod (specifically,
the pod sandbox) is destroyed. Debug Containers will stop when their command
exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug
Containers will not receive an EOF if their connection is interrupted.
Debug containers will not be killed automatically unless the pod is destroyed.
Debug Containers will stop when their command exits, such as exiting a shell.
Unlike `kubectl exec`, processes in Debug Containers will not receive an EOF if
their connection is interrupted.
### Container Lifecycle Changes
Implementing debug requires no changes to the Container Runtime Interface as
it's the same operation as creating a regular container. The following changes
are necessary in the kubelet:
1. `SyncPod()` must not kill any Debug Container even though it is not part of
the pod spec.
1. As an exception to the above, `SyncPod()` will kill Debug Containers when
the pod sandbox changes since a lone Debug Container in an abandoned sandbox
is not useful. Debug Containers are not automatically started in the new
sandbox.
1. `convertStatusToAPIStatus()` must sort Debug Containers status into
`EphemeralContainerStatuses` similar to as it does for
`InitContainerStatuses`
1. The kubelet must preserve `ContainerStatus` on debug containers for
reporting.
1. Debug Containers must be excluded from calculation of pod phase and
condition
It's worth noting some things that do not change:
1. `KillPod()` already operates on all running containers returned by the
runtime.
1. Containers created prior to this feature being enabled will have a
`containerType` of `""`. Since this does not match `"EPHEMERAL"` the special
handling of Debug Containers is backwards compatible.
A future improvement to Ephemeral Containers could allow killing Debug
Containers when they're removed the `EphemeralContainers`, but it's not clear
that we want to allow this. Removing an Ephemeral Container spec makes it
unavailable for future authorization decisions (e.g. whether to authorize exec
in a pod that had a privileged Ephemeral Container).
### Security Considerations
Debug Containers have no additional privileges above what is available to any
`v1.Container`. It's the equivalent of configuring an shell container in a pod
spec but created on demand.
spec except that it is created on demand.
Admission plugins that guard `/exec` must be updated for the new parameters. In
particular, they should enforce the same container image policy on the `Image`
parameter as is enforced for regular containers. During the alpha phase we will
additionally support a container image whitelist as a kubelet flag to allow
cluster administrators to easily constraint debug container images.
Admission plugins must be updated to guard `/ephemeralcontainers`. They should
apply the same container image and security policy as for regular containers.
### Additional Consideration
@ -352,116 +282,33 @@ cluster administrators to easily constraint debug container images.
troubleshooting causes a pod to exceed its resource limit it may be evicted.
1. There's an output stream race inherent to creating then attaching a
container which causes output generated between the start and attach to go
to the log rather than the client. This is not specific to Debug Containers
and exists because Kubernetes has no mechanism to attach a container prior
to starting it. This larger issue will not be addressed by Debug Containers,
but Debug Containers would benefit from future improvements or work arounds.
1. We do not want to describe Debug Containers using `v1.Container`. This is to
reinforce that Debug Containers are not general purpose containers by
limiting their configurability. Debug Containers should not be used to build
services.
1. Debug Containers are of limited usefulness without a shared PID namespace.
If a pod is configured with isolated PID namespaces, the Debug Container
will join the PID namespace of the target container. Debug Containers will
not be available with runtimes that do not implement PID namespace sharing
in some form.
to the log rather than the client. This is not specific to Ephemeral
Containers and exists because Kubernetes has no mechanism to attach a
container prior to starting it. This larger issue will not be addressed by
Ephemeral Containers, but Ephemeral Containers would benefit from future
improvements or work arounds.
1. Ephemeral Containers should not be used to build services, which we've
attempted to reflect in the API.
## Implementation Plan
### Alpha Release
### 1.12: Initial Alpha Release
#### Goals and Non-Goals for Alpha Release
We're targeting an alpha release in Kubernetes 1.9 that includes the following
We're targeting an alpha release in Kubernetes 1.12 that includes the following
basic functionality:
* Support in the kubelet for creating debug containers in a running pod
* A `kubectl debug` command to initiate a debug container
* `kubectl describe pod` will list status of debug containers running in a pod
1. Approval for basic core API changes to Pod
1. Basic support in the kubelet for creating Ephemeral Containers
Functionality out of scope for 1.12:
* Killing running Ephemeral Containers by removing them from the Pod Spec.
* Updating `pod.Spec.EphemeralContainers` when containers are garbage
collected.
* `kubectl` commands for creating Ephemeral Containers
Functionality will be hidden behind an alpha feature flag and disabled by
default. The following are explicitly out of scope for the 1.9 alpha release:
* Exited Debug Containers will be garbage collected as regular containers and
may disappear from the list of Debug Container Statuses.
* Security Context for the Debug Container is not configurable. It will always
be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`.
* Image pull policy for the Debug Container is not configurable. It will
always be run with `PullAlways`.
#### kubelet Implementation
Debug Containers are implemented in the kubelet's generic runtime manager.
Performing this operation with a legacy (non-CRI) runtime will result in a not
implemented error. Implementation in the kubelet will be split into the
following steps:
##### Step 1: Container Type
The first step is to add a feature gate to ensure all changes are off by
default. This will be added in the `pkg/features` `DefaultFeatureGate`.
The runtime manager stores metadata about containers in the runtime via labels
(e.g. docker labels). These labels are used to populate the fields of
`kubecontainer.ContainerStatus`. Since the runtime manager needs to handle Debug
Containers differently in a few situations, we must add a new piece of metadata
to distinguish Debug Containers from regular containers.
`startContainer()` will be updated to write a new label
`io.kubernetes.container.type` to the runtime. Existing containers will be
started with a type of `REGULAR` or `INIT`. When added in a subsequent step,
Debug Containers will start with the type `EPHEMERAL`.
##### Step 2: Creation and Handling of Debug Containers
This step adds methods for creating debug containers, but doesn't yet modify the
kubelet API. Since the runtime manager discards runtime (e.g. docker) labels
after populating `kubecontainer.ContainerStatus`, the label value will be stored
in a the new field `ContainerStatus.Type` so it can be used by `SyncPod()`.
The kubelet gains a `RunDebugContainer()` method which accepts a `v1.Container`
and passes it on to the Runtime Manager's `RunDebugContainer()` if implemented.
Currently only the Generic Runtime Manager (i.e. the CRI) implements the
`DebugContainerRunner` interface.
The Generic Runtime Manager's `RunDebugContainer()` calls `startContainer()` to
create the Debug Container. Additionally, `SyncPod()` is modified to skip Debug
Containers unless the sandbox is restarted.
##### Step 3: kubelet API changes
The kubelet exposes the new functionality in its existing `/exec/` endpoint.
`ServeExec()` constructs a `v1.Container` based on `PodExecOptions`, calls
`RunDebugContainer()`, and performs the attach.
##### Step 4: Reporting EphemeralContainerStatus
The last major change to the kubelet is to populate
v1.`PodStatus.EphemeralContainerStatuses` based on the
`kubecontainer.ContainerStatus` for the Debug Container.
#### Kubernetes API Changes
There are two changes to be made to the Kubernetes, which will be made
independently:
1. `v1.PodExecOptions` must be extended with new fields.
1. `v1.PodStatus` gains a new field to hold Debug Container statuses.
In all cases, new fields will be prepended with `Alpha` for the duration of this
feature's alpha status.
#### kubectl changes
In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a
`kubectl alpha` command to contain alpha features. We will add `kubectl alpha
debug` to invoke Debug Containers. `kubectl` does not use feature gates, so
`kubectl alpha debug` will be visible by default in `kubectl` 1.9 and return an
error when used on a cluster with the feature disabled.
`kubectl describe pod` will report the contents of `EphemeralContainerStatuses`
when not empty as it means the feature is enabled. The field will be hidden when
empty.
default.
## Appendices
@ -592,10 +439,10 @@ container image distribution mechanisms to fetch images when the debug command
is run.
**Respect admission restrictions.** Requests from kubectl are proxied through
the apiserver and so are available to existing [admission
controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins
already exist to intercept `exec` and `attach` calls, but extending this to
support `debug` has not yet been scoped.
the apiserver and so are available to existing
[admission controllers](https://kubernetes.io/docs/admin/admission-controllers/).
Plugins already exist to intercept `exec` and `attach` calls, but extending this
to support `debug` has not yet been scoped.
**Allow introspection of pod state using existing tools**. The list of
`EphemeralContainerStatuses` is never truncated. If a debug container has run in
@ -629,26 +476,146 @@ active debug container.
### Appendix 3: Alternatives Considered
#### Mutable Pod Spec
#### Container Spec in PodStatus
Rather than adding an operation to have Kubernetes attach a pod we could instead
make the pod spec mutable so the client can generate an update adding a
container. `SyncPod()` has no issues adding the container to the pod at that
point, but an immutable pod spec has been a basic assumption in Kubernetes thus
far and changing it carries risk. It's preferable to keep the pod spec immutable
as a best practice.
Originally there was a desire to keep the pod spec immutable, so we explored
modifying only the pod status. An `EphemeralContainer` would contain a Spec, a
Status and a Target:
#### Ephemeral container
```
// EphemeralContainer describes a container to attach to a running pod for troubleshooting.
type EphemeralContainer struct {
metav1.TypeMeta `json:",inline"`
An earlier version of this proposal suggested running an ephemeral container in
the pod namespaces. The container would not be added to the pod spec and would
exist only as long as the process it ran. This has the advantage of behaving
similarly to the current kubectl exec, but it is opaque and likely violates
design assumptions. We could add constructs to track and report on both
traditional exec process and exec containers, but this would probably be more
work than adding to the pod spec. Both are generally useful, and neither
precludes the other in the future, so we chose mutating the pod spec for
expedience.
// Spec describes the Ephemeral Container to be created.
Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
// Most recently observed status of the container.
// This data may not be up to date.
// Populated by the system.
// Read-only.
// +optional
Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
// If set, the name of the container from PodSpec that this ephemeral container targets.
// If not set then the ephemeral container is run in whatever namespaces are shared
// for the pod.
TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"`
}
```
Ephemeral Containers for a pod would be listed in the pod's status:
```
type PodStatus struct {
...
// List of user-initiated ephemeral containers that have been run in this pod.
// +optional
EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
}
```
To create a new Ephemeral Container, one would append a new `EphemeralContainer`
with the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod`
in the API. Users cannot normally modify the pod status, so we'd create a new
subresource `/ephemeralcontainers` that allows an update of solely
`EphemeralContainers` and enforces append-only semantics.
Since we have a requirement to describe the Ephemeral Container with a
`v1.Container`, this lead to a "spec in status" that seemed to violate API best
practices. It was confusing, and it required added complexity in the kubelet to
persist and publish user intent, which is rightfully the job of the apiserver.
#### Extend the Existing Exec API ("exec++")
A simpler change is to extend `v1.Pod`'s `/exec` subresource to support
"executing" container images. The current `/exec` endpoint must implement `GET`
to support streaming for all clients. We don't want to encode a (potentially
large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions`
with the specific fields required for creating a Debug Container:
```
// PodExecOptions is the query options to a Pod's remote exec call
type PodExecOptions struct {
...
// EphemeralContainerName is the name of an ephemeral container in which the
// command ought to be run. Either both EphemeralContainerName and
// EphemeralContainerImage fields must be set, or neither.
EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
// EphemeralContainerImage is the image of an ephemeral container in which the command
// ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
// fields must be set, or neither.
EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
}
```
After creating the Ephemeral Container, the kubelet would upgrade the connection
to streaming and perform an attach to the container's console. If disconnected,
the Ephemeral Container could be reattached using the pod's `/attach` endpoint
with `EphemeralContainerName`.
Ephemeral Containers could not be removed via the API and instead the process
must terminate. While not ideal, this parallels existing behavior of `kubectl
exec`. To kill an Ephemeral Container one would `attach` and exit the process
interactively or create a new Ephemeral Container to send a signal with
`kill(1)` to the original process.
Since the user cannot specify the `v1.Container`, this approach sacrifices a
great deal of flexibility. This solution still requires the kubelet to publish a
`Container` spec in the `PodStatus` that can be examined for future admission
decisions and so retains many of the downsides of the Container Spec in
PodStatus approach.
#### Ephemeral Container Controller
Kubernetes prefers declarative APIs where the client declares a state for
Kubernetes to enact. We could implement this in a declarative manner by creating
a new `EphemeralContainer` type:
```
type EphemeralContainer struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec v1.Container
Status v1.ContainerStatus
}
```
A new controller in the kubelet would watch for EphemeralContainers and
create/delete debug containers. `EphemeralContainer.Status` would be updated by
the kubelet at the same time it updates `ContainerStatus` for regular and init
containers. Clients would create a new `EphemeralContainer` object, wait for it
to be started and then attach using the pod's attach subresource and the name of
the `EphemeralContainer`.
A new controller is a significant amount of complexity to add to the kubelet,
especially considering that the kubelet is already watching for changes to pods.
The kubelet would have to be modified to create containers in a pod from
multiple config sources. SIG Node strongly prefers to minimize kubelet
complexity.
#### Mutable Pod Spec Containers
Rather than adding to the pod API, we could instead make the pod spec mutable so
the client can generate an update adding a container. `SyncPod()` has no issues
adding the container to the pod at that point, but an immutable pod spec has
been a basic assumption and best practice in Kubernetes. Changing this
assumption complicates the requirements of the kubelet state machine. Since the
kubelet was not written with this in mind, we should expect such a change would
create bugs we cannot predict.
#### Image Exec
An earlier version of this proposal suggested simply adding `Image` parameter to
the exec API. This would run an ephemeral container in the pod namespaces
without adding it to the pod spec or status. This container would exist only as
long as the process it ran. This parallels the current kubectl exec, including
its lack of transparency. We could add constructs to track and report on both
traditional exec process and exec containers. In the end this failed to meet our
transparency requirements.
#### Attaching Container Type Volume
@ -669,9 +636,8 @@ this simplifies the solution by working within the existing constraints of
If Kubernetes supported the concept of an "inactive" container, we could
configure it as part of a pod and activate it at debug time. In order to avoid
coupling the debug tool versions with those of the running containers, we would
need to ensure the debug image was pulled at debug time. The container could
then be run with a TTY and attached using kubectl. We would need to figure out a
solution that allows access the filesystem of other containers.
want to ensure the debug image was pulled at debug time. The container could
then be run with a TTY and attached using kubectl.
The downside of this approach is that it requires prior configuration. In
addition to requiring prior consideration, it would increase boilerplate config.
@ -681,14 +647,14 @@ than a feature of the platform.
#### Implicit Empty Volume
Kubernetes could implicitly create an EmptyDir volume for every pod which would
then be available as target for either the kubelet or a sidecar to extract a
then be available as a target for either the kubelet or a sidecar to extract a
package of binaries.
Users would have to be responsible for hosting a package build and distribution
infrastructure or rely on a public one. The complexity of this solution makes it
undesirable.
#### Standalone Pod in Shared Namespace
#### Standalone Pod in Shared Namespace ("Debug Pod")
Rather than inserting a new container into a pod namespace, Kubernetes could
instead support creating a new pod with container namespaces shared with
@ -698,21 +664,21 @@ useful, the containers in this "Debug Pod" should be run inside the namespaces
(network, pid, etc) of the target pod but remain in a separate resource group
(e.g. cgroup for container-based runtimes).
This would be a rather fundamental change to pod, which is currently treated as
an atomic unit. The Container Runtime Interface has no provisions for sharing
This would be a rather large change for pod, which is currently treated as an
atomic unit. The Container Runtime Interface has no provisions for sharing
outside of a pod sandbox and would need a refactor. This could be a complicated
change for non-container runtimes (e.g. hypervisor runtimes) which have more
rigid boundaries between pods.
Effectively, Debug Pod must be implemented by the runtimes while Debug
Containers are implemented by the kubelet. Minimizing change to the Kubernetes
API is not worth the increased complexity for the kubelet and runtimes.
This is pushing the complexity of the solution from the kubelet to the runtimes.
Minimizing change to the Kubernetes API is not worth the increased complexity
for the kubelet and runtimes.
It could also be possible to implement a Debug Pod as a privileged pod that runs
in the host namespace and interacts with the runtime directly to run a new
container in the appropriate namespace. This solution would be runtime-specific
and effectively pushes the complexity of debugging to the user. Additionally,
requiring node-level access to debug a pod does not meet our requirements.
and pushes the complexity of debugging to the user. Additionally, requiring
node-level access to debug a pod does not meet our requirements.
#### Exec from Node
@ -729,8 +695,7 @@ coupling it with container images.
* [Pod Troubleshooting Tracking Issue](https://issues.k8s.io/27140)
* [CRI Tracking Issue](https://issues.k8s.io/28789)
* [CRI: expose optional runtime features](https://issues.k8s.io/32803)
* [Resource QoS in
Kubernetes](resource-qos.md)
* [Resource QoS in Kubernetes](resource-qos.md)
* Related Features
* [#1615](https://issues.k8s.io/1615) - Shared PID Namespace across
containers in a pod