Redirect shareProcessNamespace proposal to KEP
This commit is contained in:
parent
5d62001346
commit
0cfe22d04a
|
@ -1,319 +1,10 @@
|
||||||
# Shared PID Namespace
|
# Shared PID Namespace
|
||||||
|
|
||||||
* Status: Pending
|
* Status: Superseded
|
||||||
* Version: Alpha
|
* Version: N/A
|
||||||
* Implementation Owner: [@verb](https://github.com/verb)
|
* Implementation Owner: [@verb](https://github.com/verb)
|
||||||
|
|
||||||
## Motivation
|
The Shared PID Namespace proposal has moved to the
|
||||||
|
[Shared PID Namespace KEP][shared-pid-kep].
|
||||||
|
|
||||||
Pods share namespaces where possible, but support for sharing the PID namespace
|
[shared-pid-kep]: https://git.k8s.io/enhancements/keps/sig-node/20190920-pod-pid-namespace.md
|
||||||
had not been defined due to lack of support in Docker. This created an implicit
|
|
||||||
API on which certain container images now rely. This document proposes adding
|
|
||||||
support for sharing a process namespace between containers in a pod while
|
|
||||||
maintaining backwards compatibility with the existing implicit API.
|
|
||||||
|
|
||||||
## Proposal
|
|
||||||
|
|
||||||
### Goals and Non-Goals
|
|
||||||
|
|
||||||
Goals include:
|
|
||||||
|
|
||||||
* Backwards compatibility with container images expecting `pid == 1` semantics
|
|
||||||
* Per-pod configuration of PID namespace sharing
|
|
||||||
* Ability to change default sharing behavior in `v2.Pod`
|
|
||||||
|
|
||||||
Non-goals include:
|
|
||||||
|
|
||||||
* Creating a general purpose container init solution
|
|
||||||
* Multiple shared PID namespaces per pod
|
|
||||||
* Per-container configuration of PID namespace sharing
|
|
||||||
|
|
||||||
### Summary
|
|
||||||
|
|
||||||
We will add support for configuring pod-shared process namespaces by adding a
|
|
||||||
new boolean field `ShareProcessNamespace` to the pod spec. The default to false
|
|
||||||
means that each container will have a separate process namespace. When set to
|
|
||||||
true, all containers in the pod will share a single process namespace.
|
|
||||||
|
|
||||||
The Container Runtime Interface (CRI) will be updated to support three namespace
|
|
||||||
modes: Container, Pod & Node. The Runtime Manager will translate the pod spec
|
|
||||||
into one of these modes as follows:
|
|
||||||
|
|
||||||
Pod `shareProcessNamespace` | Pod `hostPID` | CRI PID Mode
|
|
||||||
--------------------------- | ------------- | ------------
|
|
||||||
false | false | Container
|
|
||||||
false | true | Node
|
|
||||||
true | false | Pod
|
|
||||||
true | true | *Error*
|
|
||||||
|
|
||||||
If a runtime does not implement a particular PID mode, it must return an error.
|
|
||||||
For reference, Docker will support all three modes when using version >= 1.13.1.
|
|
||||||
|
|
||||||
The shared PID functionality will be hidden behind a new feature gate in both
|
|
||||||
the API server and the kubelet, and the existing `--docker-disable-shared-pid`
|
|
||||||
flag will be removed from the kubelet, subject to [deprecation
|
|
||||||
policy](https://kubernetes.io/docs/reference/deprecation-policy/).
|
|
||||||
|
|
||||||
## User Experience
|
|
||||||
|
|
||||||
### Use Cases
|
|
||||||
|
|
||||||
Sharing a PID namespace between containers in a pod is discussed in
|
|
||||||
[#1615](https://issues.k8s.io/1615) and enables:
|
|
||||||
|
|
||||||
1. signaling between containers, which is useful for side cars (e.g. for
|
|
||||||
signaling a daemon process after rotating logs).
|
|
||||||
1. easier troubleshooting of pods.
|
|
||||||
1. addressing [Docker's zombie
|
|
||||||
problem](https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/)
|
|
||||||
by reaping orphaned zombies in the infra container.
|
|
||||||
|
|
||||||
### Behavioral Changes
|
|
||||||
|
|
||||||
Sharing a process namespace fits well with Kubernetes' pod abstraction, but it's
|
|
||||||
a significant departure from the traditional behavior of Docker. This may break
|
|
||||||
container images and development patterns that have come to rely on process
|
|
||||||
isolation. Notably:
|
|
||||||
|
|
||||||
1. **The main container process no longer has PID 1**. It cannot be signalled
|
|
||||||
using `kill 1`, and attempting to do so will instead signal the
|
|
||||||
infrastructure container and potentially restart the pod. Containers
|
|
||||||
shipping an init system like systemd may [require additional
|
|
||||||
flags](https://github.com/kubernetes/kubernetes/issues/48937#issuecomment-321243669).
|
|
||||||
1. **Processes are visible to other containers in the pod**. This includes all
|
|
||||||
information visible in `/proc`, such as passwords as arguments or
|
|
||||||
environment variables, and process signalling. This can be somewhat
|
|
||||||
mitigated by running processes as separate, non-root users.
|
|
||||||
1. **Container filesystems are visible to other containers in the pod through
|
|
||||||
the <code>/proc/$pid/root</code> magic symlink**. This makes debugging
|
|
||||||
easier, but it also means that secrets are protected only by standard
|
|
||||||
filesystem permissions.
|
|
||||||
|
|
||||||
## Implementation
|
|
||||||
|
|
||||||
### Kubernetes API Changes
|
|
||||||
|
|
||||||
`v1.PodSpec` gains a new field named `ShareProcessNamespace`:
|
|
||||||
|
|
||||||
```
|
|
||||||
// PodSpec is a description of a pod.
|
|
||||||
type PodSpec struct {
|
|
||||||
...
|
|
||||||
// Use the host's pid namespace.
|
|
||||||
// Note that HostPID and ShareProcessNamespace cannot both be set.
|
|
||||||
// Optional: Default to false.
|
|
||||||
// +k8s:conversion-gen=false
|
|
||||||
// +optional
|
|
||||||
HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"`
|
|
||||||
// Share a single process namespace between all of the containers in a pod.
|
|
||||||
// Note that HostPID and ShareProcessNamespace cannot both be set.
|
|
||||||
// Optional: Default to false.
|
|
||||||
// +k8s:conversion-gen=false
|
|
||||||
// +optional
|
|
||||||
ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,XX,opt,name=shareProcessNamespace"`
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
The field name deviates from that of HostPID in an attempt to [better signal the
|
|
||||||
consequences](https://github.com/kubernetes/community/pull/1048/files#r159146536)
|
|
||||||
of setting the option. Setting both `ShareProcessNamespace` and `HostPID` will
|
|
||||||
cause a validation error.
|
|
||||||
|
|
||||||
### Container Runtime Interface Changes
|
|
||||||
|
|
||||||
Namespace options in the CRI are currently specified for both `PodSandbox` and
|
|
||||||
`Container` creation requests via booleans in `NamespaceOption`:
|
|
||||||
|
|
||||||
```
|
|
||||||
message NamespaceOption {
|
|
||||||
// If set, use the host's network namespace.
|
|
||||||
bool host_network = 1;
|
|
||||||
// If set, use the host's PID namespace.
|
|
||||||
bool host_pid = 2;
|
|
||||||
// If set, use the host's IPC namespace.
|
|
||||||
bool host_ipc = 3;
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
We will change `NamespaceOption` to use a `NamespaceMode` enumeration for the
|
|
||||||
existing namespace options:
|
|
||||||
|
|
||||||
```
|
|
||||||
enum NamespaceMode {
|
|
||||||
POD = 0;
|
|
||||||
CONTAINER = 1;
|
|
||||||
NODE = 2;
|
|
||||||
}
|
|
||||||
|
|
||||||
// NamespaceOption provides options for Linux namespaces.
|
|
||||||
message NamespaceOption {
|
|
||||||
// Network namespace for this container/sandbox.
|
|
||||||
// Runtimes must support: POD, NODE
|
|
||||||
NamespaceMode network = 1;
|
|
||||||
// PID namespace for this container/sandbox.
|
|
||||||
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
|
|
||||||
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
|
|
||||||
// Runtimes must support: POD, CONTAINER, NODE
|
|
||||||
NamespaceMode pid = 2;
|
|
||||||
// IPC namespace for this container/sandbox.
|
|
||||||
// Runtimes must support: POD, NODE
|
|
||||||
NamespaceMode ipc = 3;
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Note that this breaks backwards compatibility in the CRI, which is still in
|
|
||||||
alpha.
|
|
||||||
|
|
||||||
The protocol default for a namespace is `POD` because that's the default for
|
|
||||||
network and IPC, and we will consider making it the default for PID in `v2.Pod`.
|
|
||||||
The kubelet will explicitly set `pid` to `CONTAINER` for `v1.Pod` by default so
|
|
||||||
that the default behavior of `v1.Pod` does not change.
|
|
||||||
|
|
||||||
This CRI design allows different namespace configuration for each of the
|
|
||||||
containers in the pod and the sandbox, but currently we have no plans to support
|
|
||||||
this in the Kubernetes API. The kubelet will translate namespace booleans from
|
|
||||||
v1.PodSpec into a single `NamespaceMode` to be used for the sandbox and all
|
|
||||||
regular and init containers in a pod.
|
|
||||||
|
|
||||||
#### Targeting a Specific Container's Namespace
|
|
||||||
|
|
||||||
Though we don't intend to support this in general pod configuration, there is a
|
|
||||||
use case for mixed process namespaces within a single pod. [Troubleshooting
|
|
||||||
Running Pods](troubleshooting-running-pods.md) allows inserting an ephemeral
|
|
||||||
Debug Container in an existing, running pod. In order for this to be useful we
|
|
||||||
want to share, within the pod, a process namespace between the new container
|
|
||||||
performing the debugging and its existing target container.
|
|
||||||
|
|
||||||
This is done with the additional `NamespaceMode` `TARGET` and field `target_id`:
|
|
||||||
|
|
||||||
```
|
|
||||||
enum NamespaceMode {
|
|
||||||
POD = 0;
|
|
||||||
CONTAINER = 1;
|
|
||||||
NODE = 2;
|
|
||||||
TARGET = 3;
|
|
||||||
}
|
|
||||||
|
|
||||||
// NamespaceOption provides options for Linux namespaces.
|
|
||||||
message NamespaceOption {
|
|
||||||
// Network namespace for this container/sandbox.
|
|
||||||
// Runtimes must support: POD, NODE
|
|
||||||
NamespaceMode network = 1;
|
|
||||||
// PID namespace for this container/sandbox.
|
|
||||||
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
|
|
||||||
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
|
|
||||||
// Runtimes must support: POD, CONTAINER, NODE, TARGET
|
|
||||||
NamespaceMode pid = 2;
|
|
||||||
// IPC namespace for this container/sandbox.
|
|
||||||
// Runtimes must support: POD, NODE
|
|
||||||
NamespaceMode ipc = 3;
|
|
||||||
// Target Container ID for NamespaceMode of TARGET. This container must be in the
|
|
||||||
// same pod as the target container.
|
|
||||||
string target_id = 4;
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
When `NamespaceOption.pid` is set to `TARGET`, a runtime must create the new
|
|
||||||
container in the namespace used by the container ID in `target_id`. If the
|
|
||||||
target container has `NamespaceOption.pid` set to `POD`, then the new container
|
|
||||||
should also use the pod namespace. If the target container has an isolated
|
|
||||||
process namespace, then the new container will join only that container's
|
|
||||||
namespace. Examples are provided for dockershim below.
|
|
||||||
|
|
||||||
There is no mechanism in the Kubernetes API for an end-user to set `TARGET`. It
|
|
||||||
exists for the kubelet to run automation or debugging from a container image in
|
|
||||||
the namespace of an existing pod and container. Additionally, we choose to
|
|
||||||
explicitly not support sharing namespaces between different pods. The kubelet
|
|
||||||
must not generate such a reference, and the runtime should not accept it. That
|
|
||||||
is, for pod{Container `A`, Container `B`, Sandbox `S}` and any other unrelated
|
|
||||||
Container `C`:
|
|
||||||
|
|
||||||
valid `target_id` | invalid `target_id`
|
|
||||||
----------------- | -------------------
|
|
||||||
containerID(A) | sandboxID(S)
|
|
||||||
containerID(B) | containerID(C)
|
|
||||||
|
|
||||||
### dockershim Changes
|
|
||||||
|
|
||||||
The Docker runtime implements the pod sandbox as a container running the pause
|
|
||||||
container image. When configured for `POD` namespace sharing, the PID namespace
|
|
||||||
of the sandbox will become the single PID namespace for the pod. This means a
|
|
||||||
namespace of `POD` and `CONTAINER` are equivalent for the sandbox. The mapping
|
|
||||||
of the _sandbox's_ PID mode to docker's `HostConfig.PidMode` is (`v1.Pod`
|
|
||||||
settings provided as reference):
|
|
||||||
|
|
||||||
ShareProcessNamespace | HostPID | Sandbox PID Mode | HostConfig.PidMode
|
|
||||||
--------------------- | ------- | ---------------- | ------------------
|
|
||||||
false | false | CONTAINER | *unset*
|
|
||||||
true | false | POD | *unset*
|
|
||||||
false | true | NODE | "host"
|
|
||||||
\- | \- | TARGET | *Error*
|
|
||||||
|
|
||||||
For _containers_, `HostConfig.PidMode` will be set as follows:
|
|
||||||
|
|
||||||
ShareProcessNamespace | HostPID | Container PID Mode | HostConfig.PidMode
|
|
||||||
--------------------- | ------- | ------------------ | ------------------
|
|
||||||
false | false | CONTAINER | *unset*
|
|
||||||
true | false | POD | "container:[sandbox-container-id]"
|
|
||||||
false | true | NODE | "host"
|
|
||||||
false | false | TARGET | "container:[target-container-id]"
|
|
||||||
true | false | TARGET | "container:[sandbox-container-id]"
|
|
||||||
false | true | TARGET | "host"
|
|
||||||
|
|
||||||
If the Docker runtime version does not support sharing pid namespaces, a
|
|
||||||
`CreateContainerRequest` with `namespace_options.pid` set to `POD` will return
|
|
||||||
an error.
|
|
||||||
|
|
||||||
### Deprecation of existing kubelet flag
|
|
||||||
|
|
||||||
SIG Node did not anticipate the strong objections to migrating from isolated to
|
|
||||||
shared process namespaces for Docker. The previous (now abandoned) migration
|
|
||||||
plan introduced a kubelet flag to toggle the shared namespace behavior, but
|
|
||||||
objections did not materialize until the flag had moved from experimental to GA.
|
|
||||||
|
|
||||||
The `--docker-disable-shared-pid` (default: true) kubelet flag disables the use
|
|
||||||
of shared process namespaces for the Docker runtime. We will immediately mark it
|
|
||||||
as deprecated, but according to the [deprecation
|
|
||||||
policy](https://kubernetes.io/docs/reference/deprecation-policy/) we must
|
|
||||||
support it for 6 months.
|
|
||||||
|
|
||||||
We must provide a transition path for users setting this kubelet flag to false.
|
|
||||||
Setting this flag asserts a desire to override the default Kubernetes behavior
|
|
||||||
for all pods. Until the flag is removed, the kubelet will honor this assertion
|
|
||||||
by ignoring the value of `ShareProcessNamespace` and logging a warning to the
|
|
||||||
event log.
|
|
||||||
|
|
||||||
## Alternatives Considered
|
|
||||||
|
|
||||||
### Explicit Container/Sandbox ID Targeting
|
|
||||||
|
|
||||||
Rather than using a `NamespaceMode`, `NamespaceOption.pid` could be a string
|
|
||||||
that explicitly targets a container or sandbox ID:
|
|
||||||
|
|
||||||
```
|
|
||||||
// NamespaceOption provides options for Linux namespaces.
|
|
||||||
message NamespaceOption {
|
|
||||||
...
|
|
||||||
// ID of Sandbox or Container to use for PID namespace, or "host"
|
|
||||||
string pid = 2;
|
|
||||||
...
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
This removes the need for a separate `TARGET` mode, but a mode enumeration
|
|
||||||
better captures the intent of the option.
|
|
||||||
|
|
||||||
### Defaulting to PID Namespace Sharing
|
|
||||||
|
|
||||||
Other Kubernetes runtimes already share a single PID namespace between
|
|
||||||
containers in a pod. We could easily change the Docker runtime to always share a
|
|
||||||
PID namespace when supported by the installed Docker version, but this would
|
|
||||||
cause problems for container images that assume they will always be PID 1.
|
|
||||||
|
|
||||||
### Migration to Shared-only Namespaces
|
|
||||||
|
|
||||||
Rather than adding support to the API for configuring namespaces we could allow
|
|
||||||
changing the default behavior with pod annotations with the intention of
|
|
||||||
removing support for isolated PID namespaces in v2.Pod. Many members of the
|
|
||||||
community want to use the isolated namespaces as security boundary between
|
|
||||||
containers in a pod, however.
|
|
||||||
|
|
Loading…
Reference in New Issue