Redirect shareProcessNamespace proposal to KEP
This commit is contained in:
parent
5d62001346
commit
0cfe22d04a
|
@ -1,319 +1,10 @@
|
|||
# Shared PID Namespace
|
||||
|
||||
* Status: Pending
|
||||
* Version: Alpha
|
||||
* Status: Superseded
|
||||
* Version: N/A
|
||||
* Implementation Owner: [@verb](https://github.com/verb)
|
||||
|
||||
## Motivation
|
||||
The Shared PID Namespace proposal has moved to the
|
||||
[Shared PID Namespace KEP][shared-pid-kep].
|
||||
|
||||
Pods share namespaces where possible, but support for sharing the PID namespace
|
||||
had not been defined due to lack of support in Docker. This created an implicit
|
||||
API on which certain container images now rely. This document proposes adding
|
||||
support for sharing a process namespace between containers in a pod while
|
||||
maintaining backwards compatibility with the existing implicit API.
|
||||
|
||||
## Proposal
|
||||
|
||||
### Goals and Non-Goals
|
||||
|
||||
Goals include:
|
||||
|
||||
* Backwards compatibility with container images expecting `pid == 1` semantics
|
||||
* Per-pod configuration of PID namespace sharing
|
||||
* Ability to change default sharing behavior in `v2.Pod`
|
||||
|
||||
Non-goals include:
|
||||
|
||||
* Creating a general purpose container init solution
|
||||
* Multiple shared PID namespaces per pod
|
||||
* Per-container configuration of PID namespace sharing
|
||||
|
||||
### Summary
|
||||
|
||||
We will add support for configuring pod-shared process namespaces by adding a
|
||||
new boolean field `ShareProcessNamespace` to the pod spec. The default to false
|
||||
means that each container will have a separate process namespace. When set to
|
||||
true, all containers in the pod will share a single process namespace.
|
||||
|
||||
The Container Runtime Interface (CRI) will be updated to support three namespace
|
||||
modes: Container, Pod & Node. The Runtime Manager will translate the pod spec
|
||||
into one of these modes as follows:
|
||||
|
||||
Pod `shareProcessNamespace` | Pod `hostPID` | CRI PID Mode
|
||||
--------------------------- | ------------- | ------------
|
||||
false | false | Container
|
||||
false | true | Node
|
||||
true | false | Pod
|
||||
true | true | *Error*
|
||||
|
||||
If a runtime does not implement a particular PID mode, it must return an error.
|
||||
For reference, Docker will support all three modes when using version >= 1.13.1.
|
||||
|
||||
The shared PID functionality will be hidden behind a new feature gate in both
|
||||
the API server and the kubelet, and the existing `--docker-disable-shared-pid`
|
||||
flag will be removed from the kubelet, subject to [deprecation
|
||||
policy](https://kubernetes.io/docs/reference/deprecation-policy/).
|
||||
|
||||
## User Experience
|
||||
|
||||
### Use Cases
|
||||
|
||||
Sharing a PID namespace between containers in a pod is discussed in
|
||||
[#1615](https://issues.k8s.io/1615) and enables:
|
||||
|
||||
1. signaling between containers, which is useful for side cars (e.g. for
|
||||
signaling a daemon process after rotating logs).
|
||||
1. easier troubleshooting of pods.
|
||||
1. addressing [Docker's zombie
|
||||
problem](https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/)
|
||||
by reaping orphaned zombies in the infra container.
|
||||
|
||||
### Behavioral Changes
|
||||
|
||||
Sharing a process namespace fits well with Kubernetes' pod abstraction, but it's
|
||||
a significant departure from the traditional behavior of Docker. This may break
|
||||
container images and development patterns that have come to rely on process
|
||||
isolation. Notably:
|
||||
|
||||
1. **The main container process no longer has PID 1**. It cannot be signalled
|
||||
using `kill 1`, and attempting to do so will instead signal the
|
||||
infrastructure container and potentially restart the pod. Containers
|
||||
shipping an init system like systemd may [require additional
|
||||
flags](https://github.com/kubernetes/kubernetes/issues/48937#issuecomment-321243669).
|
||||
1. **Processes are visible to other containers in the pod**. This includes all
|
||||
information visible in `/proc`, such as passwords as arguments or
|
||||
environment variables, and process signalling. This can be somewhat
|
||||
mitigated by running processes as separate, non-root users.
|
||||
1. **Container filesystems are visible to other containers in the pod through
|
||||
the <code>/proc/$pid/root</code> magic symlink**. This makes debugging
|
||||
easier, but it also means that secrets are protected only by standard
|
||||
filesystem permissions.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Kubernetes API Changes
|
||||
|
||||
`v1.PodSpec` gains a new field named `ShareProcessNamespace`:
|
||||
|
||||
```
|
||||
// PodSpec is a description of a pod.
|
||||
type PodSpec struct {
|
||||
...
|
||||
// Use the host's pid namespace.
|
||||
// Note that HostPID and ShareProcessNamespace cannot both be set.
|
||||
// Optional: Default to false.
|
||||
// +k8s:conversion-gen=false
|
||||
// +optional
|
||||
HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"`
|
||||
// Share a single process namespace between all of the containers in a pod.
|
||||
// Note that HostPID and ShareProcessNamespace cannot both be set.
|
||||
// Optional: Default to false.
|
||||
// +k8s:conversion-gen=false
|
||||
// +optional
|
||||
ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,XX,opt,name=shareProcessNamespace"`
|
||||
...
|
||||
```
|
||||
|
||||
The field name deviates from that of HostPID in an attempt to [better signal the
|
||||
consequences](https://github.com/kubernetes/community/pull/1048/files#r159146536)
|
||||
of setting the option. Setting both `ShareProcessNamespace` and `HostPID` will
|
||||
cause a validation error.
|
||||
|
||||
### Container Runtime Interface Changes
|
||||
|
||||
Namespace options in the CRI are currently specified for both `PodSandbox` and
|
||||
`Container` creation requests via booleans in `NamespaceOption`:
|
||||
|
||||
```
|
||||
message NamespaceOption {
|
||||
// If set, use the host's network namespace.
|
||||
bool host_network = 1;
|
||||
// If set, use the host's PID namespace.
|
||||
bool host_pid = 2;
|
||||
// If set, use the host's IPC namespace.
|
||||
bool host_ipc = 3;
|
||||
}
|
||||
```
|
||||
|
||||
We will change `NamespaceOption` to use a `NamespaceMode` enumeration for the
|
||||
existing namespace options:
|
||||
|
||||
```
|
||||
enum NamespaceMode {
|
||||
POD = 0;
|
||||
CONTAINER = 1;
|
||||
NODE = 2;
|
||||
}
|
||||
|
||||
// NamespaceOption provides options for Linux namespaces.
|
||||
message NamespaceOption {
|
||||
// Network namespace for this container/sandbox.
|
||||
// Runtimes must support: POD, NODE
|
||||
NamespaceMode network = 1;
|
||||
// PID namespace for this container/sandbox.
|
||||
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
|
||||
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
|
||||
// Runtimes must support: POD, CONTAINER, NODE
|
||||
NamespaceMode pid = 2;
|
||||
// IPC namespace for this container/sandbox.
|
||||
// Runtimes must support: POD, NODE
|
||||
NamespaceMode ipc = 3;
|
||||
}
|
||||
```
|
||||
|
||||
Note that this breaks backwards compatibility in the CRI, which is still in
|
||||
alpha.
|
||||
|
||||
The protocol default for a namespace is `POD` because that's the default for
|
||||
network and IPC, and we will consider making it the default for PID in `v2.Pod`.
|
||||
The kubelet will explicitly set `pid` to `CONTAINER` for `v1.Pod` by default so
|
||||
that the default behavior of `v1.Pod` does not change.
|
||||
|
||||
This CRI design allows different namespace configuration for each of the
|
||||
containers in the pod and the sandbox, but currently we have no plans to support
|
||||
this in the Kubernetes API. The kubelet will translate namespace booleans from
|
||||
v1.PodSpec into a single `NamespaceMode` to be used for the sandbox and all
|
||||
regular and init containers in a pod.
|
||||
|
||||
#### Targeting a Specific Container's Namespace
|
||||
|
||||
Though we don't intend to support this in general pod configuration, there is a
|
||||
use case for mixed process namespaces within a single pod. [Troubleshooting
|
||||
Running Pods](troubleshooting-running-pods.md) allows inserting an ephemeral
|
||||
Debug Container in an existing, running pod. In order for this to be useful we
|
||||
want to share, within the pod, a process namespace between the new container
|
||||
performing the debugging and its existing target container.
|
||||
|
||||
This is done with the additional `NamespaceMode` `TARGET` and field `target_id`:
|
||||
|
||||
```
|
||||
enum NamespaceMode {
|
||||
POD = 0;
|
||||
CONTAINER = 1;
|
||||
NODE = 2;
|
||||
TARGET = 3;
|
||||
}
|
||||
|
||||
// NamespaceOption provides options for Linux namespaces.
|
||||
message NamespaceOption {
|
||||
// Network namespace for this container/sandbox.
|
||||
// Runtimes must support: POD, NODE
|
||||
NamespaceMode network = 1;
|
||||
// PID namespace for this container/sandbox.
|
||||
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
|
||||
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
|
||||
// Runtimes must support: POD, CONTAINER, NODE, TARGET
|
||||
NamespaceMode pid = 2;
|
||||
// IPC namespace for this container/sandbox.
|
||||
// Runtimes must support: POD, NODE
|
||||
NamespaceMode ipc = 3;
|
||||
// Target Container ID for NamespaceMode of TARGET. This container must be in the
|
||||
// same pod as the target container.
|
||||
string target_id = 4;
|
||||
}
|
||||
```
|
||||
|
||||
When `NamespaceOption.pid` is set to `TARGET`, a runtime must create the new
|
||||
container in the namespace used by the container ID in `target_id`. If the
|
||||
target container has `NamespaceOption.pid` set to `POD`, then the new container
|
||||
should also use the pod namespace. If the target container has an isolated
|
||||
process namespace, then the new container will join only that container's
|
||||
namespace. Examples are provided for dockershim below.
|
||||
|
||||
There is no mechanism in the Kubernetes API for an end-user to set `TARGET`. It
|
||||
exists for the kubelet to run automation or debugging from a container image in
|
||||
the namespace of an existing pod and container. Additionally, we choose to
|
||||
explicitly not support sharing namespaces between different pods. The kubelet
|
||||
must not generate such a reference, and the runtime should not accept it. That
|
||||
is, for pod{Container `A`, Container `B`, Sandbox `S}` and any other unrelated
|
||||
Container `C`:
|
||||
|
||||
valid `target_id` | invalid `target_id`
|
||||
----------------- | -------------------
|
||||
containerID(A) | sandboxID(S)
|
||||
containerID(B) | containerID(C)
|
||||
|
||||
### dockershim Changes
|
||||
|
||||
The Docker runtime implements the pod sandbox as a container running the pause
|
||||
container image. When configured for `POD` namespace sharing, the PID namespace
|
||||
of the sandbox will become the single PID namespace for the pod. This means a
|
||||
namespace of `POD` and `CONTAINER` are equivalent for the sandbox. The mapping
|
||||
of the _sandbox's_ PID mode to docker's `HostConfig.PidMode` is (`v1.Pod`
|
||||
settings provided as reference):
|
||||
|
||||
ShareProcessNamespace | HostPID | Sandbox PID Mode | HostConfig.PidMode
|
||||
--------------------- | ------- | ---------------- | ------------------
|
||||
false | false | CONTAINER | *unset*
|
||||
true | false | POD | *unset*
|
||||
false | true | NODE | "host"
|
||||
\- | \- | TARGET | *Error*
|
||||
|
||||
For _containers_, `HostConfig.PidMode` will be set as follows:
|
||||
|
||||
ShareProcessNamespace | HostPID | Container PID Mode | HostConfig.PidMode
|
||||
--------------------- | ------- | ------------------ | ------------------
|
||||
false | false | CONTAINER | *unset*
|
||||
true | false | POD | "container:[sandbox-container-id]"
|
||||
false | true | NODE | "host"
|
||||
false | false | TARGET | "container:[target-container-id]"
|
||||
true | false | TARGET | "container:[sandbox-container-id]"
|
||||
false | true | TARGET | "host"
|
||||
|
||||
If the Docker runtime version does not support sharing pid namespaces, a
|
||||
`CreateContainerRequest` with `namespace_options.pid` set to `POD` will return
|
||||
an error.
|
||||
|
||||
### Deprecation of existing kubelet flag
|
||||
|
||||
SIG Node did not anticipate the strong objections to migrating from isolated to
|
||||
shared process namespaces for Docker. The previous (now abandoned) migration
|
||||
plan introduced a kubelet flag to toggle the shared namespace behavior, but
|
||||
objections did not materialize until the flag had moved from experimental to GA.
|
||||
|
||||
The `--docker-disable-shared-pid` (default: true) kubelet flag disables the use
|
||||
of shared process namespaces for the Docker runtime. We will immediately mark it
|
||||
as deprecated, but according to the [deprecation
|
||||
policy](https://kubernetes.io/docs/reference/deprecation-policy/) we must
|
||||
support it for 6 months.
|
||||
|
||||
We must provide a transition path for users setting this kubelet flag to false.
|
||||
Setting this flag asserts a desire to override the default Kubernetes behavior
|
||||
for all pods. Until the flag is removed, the kubelet will honor this assertion
|
||||
by ignoring the value of `ShareProcessNamespace` and logging a warning to the
|
||||
event log.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### Explicit Container/Sandbox ID Targeting
|
||||
|
||||
Rather than using a `NamespaceMode`, `NamespaceOption.pid` could be a string
|
||||
that explicitly targets a container or sandbox ID:
|
||||
|
||||
```
|
||||
// NamespaceOption provides options for Linux namespaces.
|
||||
message NamespaceOption {
|
||||
...
|
||||
// ID of Sandbox or Container to use for PID namespace, or "host"
|
||||
string pid = 2;
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This removes the need for a separate `TARGET` mode, but a mode enumeration
|
||||
better captures the intent of the option.
|
||||
|
||||
### Defaulting to PID Namespace Sharing
|
||||
|
||||
Other Kubernetes runtimes already share a single PID namespace between
|
||||
containers in a pod. We could easily change the Docker runtime to always share a
|
||||
PID namespace when supported by the installed Docker version, but this would
|
||||
cause problems for container images that assume they will always be PID 1.
|
||||
|
||||
### Migration to Shared-only Namespaces
|
||||
|
||||
Rather than adding support to the API for configuring namespaces we could allow
|
||||
changing the default behavior with pod annotations with the intention of
|
||||
removing support for isolated PID namespaces in v2.Pod. Many members of the
|
||||
community want to use the isolated namespaces as security boundary between
|
||||
containers in a pod, however.
|
||||
[shared-pid-kep]: https://git.k8s.io/enhancements/keps/sig-node/20190920-pod-pid-namespace.md
|
||||
|
|
Loading…
Reference in New Issue