Require shared PID namespace in CRI & plan rollout

This commit is contained in:
Lee Verberne 2017-01-18 17:27:53 -08:00
parent d3b09aa70d
commit d4789e1112
3 changed files with 79 additions and 71 deletions

View File

@ -86,7 +86,7 @@ container setup that are not currently trackable as Pod constraints, e.g.,
filesystem setup, container image pulling, etc.*
A container in a PodSandbox maps to an application in the Pod Spec. For Linux
containers, they are expected to share at least network and IPC namespaces,
containers, they are expected to share at least network, PID and IPC namespaces,
with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).

View File

@ -1,70 +0,0 @@
# Shared PID Namespace for the Docker Runtime
Pods share many namespaces, but the ability to share a PID namespace was not
supported by Docker until version 1.12. SIG Node approved a change to the
default behavior contingent on a brief rollout plan, which is this document.
Please refer to [#1615](https://issues.k8s.io/1615) for full technical details.
## Motivation
Sharing a PID namespace is discussed in [#1615](https://issues.k8s.io/1615),
and enables:
1. signaling between containers, which is useful for side cars (e.g. for
signaling a daemon process after rotating logs).
2. easier troubleshooting of pods.
3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the
infra container.
## Goals and Non-Goals
Goals include:
- Changing default behavior in the Kubernetes Docker runtime
Non-goals include:
- Creating an init solution that works for all runtimes
- Supporting isolated PID namespace indefinitely
- Addressing the larger issue of requiring shared namespaces in all runtimes
Kubernetes does not currently specify how runtimes must support a PID namespace,
but many runtimes (e.g. cri-o & rkt) already support a shared namespace. This
rolls out support for Docker.
## Rollout Plan
Sharing the PID namespace changes an implicit behavior of the Docker runtime
whereby the command run by the container image is always PID 1. This is a side
effect of isolated namespaces rather than intentional behavior, but users may
have built upon this assumption so we should change the default behavior over
the course of multiple releases. (The following release numbers are earliest
possible releases and may change based on implementation and community
feedback.)
1. Release 1.6: Enable the shared PID namespace for pods annotated with
`docker.kubernetes.io/shared-pid: true` (i.e. opt-in) when running with
Docker >= 1.12. Pods with this annotation will fail to start with older
Docker versions rather than failing to meet a user's expectation.
2. Release 1.7: Enable the shared PID namespace for pods unless annotated
with `docker.kubernetes.io/shared-pid: false` (i.e. opt-out) when running
with Docker >= 1.12.
3. Release 1.8: Remove the annotation. All pods receive a shared PID
namespace when running with Docker >= 1.12.
With each step we will add a release note that clearly describes the change.
After each release we will poll kubernetes-users to determine what, if any,
applications were impacted by this change. If we discover a use case which
cannot be accommodated by a shared PID namespace, we will abort step 3 and
instead formalize a shared-pid field into the pod spec.
## Alternatives Considered
Changing this behavior over the course of 6 months is a bit conservative. We
could instead change the behavior in 2 releases by omitting the first step, but
the opt-in phase allows users to test the change with fewer surprises.
[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-pid-namespace.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -0,0 +1,78 @@
# Shared PID Namespace
Pods share namespaces where possible, but a requirement for sharing the PID
namespace has not been defined due to lack of support in Docker. Docker began
supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt,
cri-o, hyper) have already implemented a shared PID namespace.
This proposal defines a shared PID namespace as a requirement of the Container
Runtime Interface and links its rollout in Docker to that of the CRI.
## Motivation
Sharing a PID namespace is discussed in [#1615](https://issues.k8s.io/1615),
and enables:
1. signaling between containers, which is useful for side cars (e.g. for
signaling a daemon process after rotating logs).
2. easier troubleshooting of pods.
3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the
infra container.
## Goals and Non-Goals
Goals include:
- Changing default behavior in the Docker runtime as implemented by the CRI
- Making Docker behavior compatible with the other Kubernetes runtimes
Non-goals include:
- Creating an init solution that works for all runtimes
- Supporting isolated PID namespace indefinitely
## Modification to the Docker Runtime
We will modify the Docker implementation of the CRI to use a shared PID
namespace when running with a version of Docker >= 1.12. The legacy
`dockertools` implementation will not be changed.
Linking this change to the CRI means that Kubernetes users who care to test such
changes can test the combined changes at once. Users who do not care to test
such changes will be insulated by Kubernetes not recommending Docker >= 1.12
until after switching to the CRI.
Other changes that must be made to support this change:
1. Ensure all containers restart if the infra container responsible for the
PodSandbox dies. (Note: With Docker 1.12 if the source of the PID namespace
dies all containers sharing that namespace are killed as well.)
2. Modify the Infra container used by the Docker runtime to reap orphaned
zombies ([#36853](https://pr.k8s.io/36853)).
## Rollout Plan
SIG Node is planning to switch to the CRI as a default in 1.6, at which point
users with Docker >= 1.12 will be able to test Shared namespaces. Switching
back to isolated PID namespaces will require disabling the CRI.
At some point, say 1.7, SIG Node will remove support for disabling the CRI.
After this point users must roll back to a previous version of Kubernetes or
Docker to achieve PID namespace isolation. This is acceptable because:
* No one has been able to identify a concrete use case requiring isolated PID
namespaces.
* The lack of use cases means we can't justify the complexity required to make
PID namespace type configurable.
* Users will already be looking for issues due to the major version upgrade and
prepared for a rollback to the previous release.
Alternatively, we could create a flag in the kublet to disable shared PID
namespace, but this wouldn't be especially useful to users of a hosted
Kubernetes cluster.
[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-pid-namespace.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->