community/contributors/design-proposals/pod-safety.md

# Pod Safety, Consistency Guarantees, and Storage Implicitions

@smarterclayton @bprashanth

October 2016

## Proposal and Motivation

A pod represents the finite execution of one or more related processes on the
cluster. In order to ensure higher level consistent controllers can safely
build on top of pods, the exact guarantees around its lifecycle on the cluster
must be clarified, and it must be possible for higher order controllers
and application authors to correctly reason about the lifetime of those
processes and their access to cluster resources in a distributed computing
environment.

To run most clustered software on Kubernetes, it must be possible to guarantee
**at most once** execution of a particular pet pod at any time on the cluster.
This allows the controller to prevent multiple processes having access to
shared cluster resources believing they are the same entity. When a node
containing a pet is partitioned, the Pet Set must remain consistent (no new
entity will be spawned) but may become unavailable (cluster no longer has
a sufficient number of members). The Pet Set guarantee must be strong enough
for an administrator to reason about the state of the cluster by observing
the Kubernetes API.

In order to reconcile partitions, an actor (human or automated) must decide
when the partition is unrecoverable. The actor may be informed of the failure
in an unambiguous way (e.g. the node was destroyed by a meteor) allowing for
certainty that the processes on that node are terminated, and thus may
resolve the partition by deleting the node and the pods on the node.
Alternatively, the actor may take steps to ensure the partitioned node
cannot return to the cluster or access shared resources - this is known
as **fencing** and is a well understood domain.

This proposal covers the changes necessary to ensure:

* Pet Sets can ensure **at most one** semantics for each individual pet
* Other system components such as the node and namespace controller can
  safely perform their responsibilities without violating that guarantee
* An administrator or higher level controller can signal that a node
  partition is permanent, allowing the Pet Set controller to proceed.
* A fencing controller can take corrective action automatically to heal
  partitions

We will accomplish this by:

* Clarifying which components are allowed to force delete pods (as opposed
  to merely requesting termination)
* Ensuring system components can observe partitioned pods and nodes
  correctly
* Defining how a fencing controller could safely interoperate with
  partitioned nodes and pods to safely heal partitions
* Describing how shared storage components without innate safety
  guarantees can be safely shared on the cluster.


### Current Guarantees for Pod lifecycle

The existing pod model provides the following guarantees:

* A pod is executed on exactly one node
* A pod has the following lifecycle phases:
  * Creation
  * Scheduling
  * Execution
    * Init containers
    * Application containers
  * Termination
  * Deletion
* A pod can only move through its phases in order, and may not return
  to an earlier phase.
* A user may specify an interval on the pod called the **termination
  grace period** that defines the minimum amount of time the pod will
  have to complete the termination phase, and all components will honor
  this interval.
* Once a pod begins termination, its termination grace period can only
  be shortened, not lengthened.

Pod termination is divided into the following steps:

* A component requests the termination of the pod by issuing a DELETE
  to the pod resource with an optional **grace period**
  * If no grace period is provided, the default from the pod is leveraged
* When the kubelet observes the deletion, it starts a timer equal to the
  grace period and performs the following actions:
  * Executes the pre-stop hook, if specified, waiting up to **grace period**
    seconds before continuing
  * Sends the termination signal to the container runtime (SIGTERM or the
    container image's STOPSIGNAL on Docker)
  * Waits 2 seconds, or the remaining grace period, whichever is longer
  * Sends the force termination signal to the container runtime (SIGKILL)
* Once the kubelet observes the container is fully terminated, it issues
  a status update to the REST API for the pod indicating termination, then
  issues a DELETE with grace period = 0.

If the kubelet crashes during the termination process, it will restart the
termination process from the beginning (grace period is reset). This ensures
that a process is always given **at least** grace period to terminate cleanly.

A user may re-issue a DELETE to the pod resource specifying a shorter grace
period, but never a longer one.

Deleting a pod with grace period 0 is called **force deletion** and will
update the pod with a `deletionGracePeriodSeconds` of 0, and then immediately
remove the pod from etcd. Because all communication is asynchronous,
force deleting a pod means that the pod processes may continue
to run for an arbitary amount of time. If a higher level component like the
StatefulSet controller treats the existence of the pod API object as a strongly
consistent entity, deleting the pod in this fashion will violate the
at-most-one guarantee we wish to offer for pet sets.


### Guarantees provided by replica sets and replication controllers

ReplicaSets and ReplicationControllers both attempt to **preserve availability**
of their constituent pods over ensuring at most one (of a pod) semantics. So a
replica set to scale 1 will immediately create a new pod when it observes an
old pod has begun graceful deletion, and as a result at many points in the
lifetime of a replica set there will be 2 copies of a pod's processes running
concurrently. Only access to exclusive resources like storage can prevent that
simultaneous execution.

Deployments, being based on replica sets, can offer no stronger guarantee.


### Concurrent access guarantees for shared storage

A persistent volume that references a strongly consistent storage backend
like AWS EBS, GCE PD, OpenStack Cinder, or Ceph RBD can rely on the storage
API to prevent corruption of the data due to simultaneous access by multiple
clients. However, many commonly deployed storage technologies in the
enterprise offer no such consistency guarantee, or much weaker variants, and
rely on complex systems to control which clients may access the storage.

If a PV is assigned a iSCSI, Fibre Channel, or NFS mount point and that PV
is used by two pods on different nodes simultaneously, concurrent access may
result in corruption, even if the PV or PVC is identified as "read write once".
PVC consumers must ensure these volume types are *never* referenced from
multiple pods without some external synchronization. As described above, it
is not safe to use persistent volumes that lack RWO guarantees with a
replica set or deployment, even at scale 1.


## Proposed changes

### Avoid multiple instances of pods

To ensure that the Pet Set controller can safely use pods and ensure at most
one pod instance is running on the cluster at any time for a given pod name,
it must be possible to make pod deletion strongly consistent.

To do that, we will:

* Give the Kubelet sole responsibility for normal deletion of pods -
  only the Kubelet in the course of normal operation should ever remove a
  pod from etcd (only the Kubelet should force delete)
  * The kubelet must not delete the pod until all processes are confirmed
    terminated.
  * The kubelet SHOULD ensure all consumed resources on the node are freed
    before deleting the pod.
* Application owners must be free to force delete pods, but they *must*
  understand the implications of doing so, and all client UI must be able
  to communicate those implications.
  * Force deleting a pod may cause data loss (two instances of the same
    pod process may be running at the same time)
* All existing controllers in the system must be limited to signaling pod
  termination (starting graceful deletion), and are not allowed to force
  delete a pod.
  * The node controller will no longer be allowed to force delete pods -
    it may only signal deletion by beginning (but not completing) a
    graceful deletion.
  * The GC controller may not force delete pods
  * The namespace controller used to force delete pods, but no longer
    does so. This means a node partition can block namespace deletion
    indefinitely.
  * The pod GC controller may continue to force delete pods on nodes that
    no longer exist if we treat node deletion as confirming permanent
    partition. If we do not, the pod GC controller must not force delete
    pods.
* It must be possible for an administrator to effectively resolve partitions
  manually to allow namespace deletion.
* Deleting a node from etcd should be seen as a signal to the cluster that
  the node is permanently partitioned. We must audit existing components
  to verify this is the case.
  * The PodGC controller has primary responsibility for this - it already
    owns the responsibility to delete pods on nodes that do not exist, and
    so is allowed to force delete pods on nodes that do not exist.
  * The PodGC controller must therefore always be running and will be
    changed to always be running for this responsibility in a >=1.5
    cluster.

In the above scheme, force deleting a pod releases the lock on that pod and
allows higher level components to proceed to create a replacement.

It has been requested that force deletion be restricted to privileged users.
That limits the application owner in resolving partitions when the consequences
of force deletion are understood, and not all application owners will be
privileged users. For example, a user may be running a 3 node etcd cluster in a
pet set. If pet 2 becomes partitioned, the user can instruct etcd to remove
pet 2 from the cluster (via direct etcd membership calls), and because a quorum
exists pets 0 and 1 can safely accept that action. The user can then force
delete pet 2 and the pet set controller will be able to recreate that pet on
another node and have it join the cluster safely (pets 0 and 1 constitute a
quorum for membership change).

This proposal does not alter the behavior of finalizers - instead, it makes
finalizers unnecessary for common application cases (because the cluster only
deletes pods when safe).

### Fencing

The changes above allow Pet Sets to ensure at-most-one pod, but provide no
recourse for the automatic resolution of cluster partitions during normal
operation. For that, we propose a **fencing controller** which exists above
the current controller plane and is capable of detecting and automatically
resolving partitions. The fencing controller is an agent empowered to make
similar decisions as a human administrator would make to resolve partitions,
and to take corresponding steps to prevent a dead machine from coming back
to life automatically.

Fencing controllers most benefit services that are not innately replicated
by reducing the amount of time it takes to detect a failure of a node or
process, isolate that node or process so it cannot initiate or receive
communication from clients, and then spawn another process. It is expected
that many StatefulSets of size 1 would prefer to be fenced, given that most
applications in the real world of size 1 have no other alternative for HA
except reducing mean-time-to-recovery.

While the methods and algorithms may vary, the basic pattern would be:

1. Detect a partitioned pod or node via the Kubernetes API or via external
   means.
2. Decide whether the partition justifies fencing based on priority, policy, or
   service availability requirements.
3. Fence the node or any connected storage using appropriate mechanisms.

For this proposal we only describe the general shape of detection and how existing
Kubernetes components can be leveraged for policy, while the exact implementation
and mechanisms for fencing are left to a future proposal. A future fencing controller
would be able to leverage a number of systems including but not limited to:

* Cloud control plane APIs such as machine force shutdown
* Additional agents running on each host to force kill process or trigger reboots
* Agents integrated with or communicating with hypervisors running hosts to stop VMs
* Hardware IPMI interfaces to reboot a host
* Rack level power units to power cycle a blade
* Network routers, backplane switches, software defined networks, or system firewalls
* Storage server APIs to block client access

to appropriately limit the ability of the partitioned system to impact the cluster.
Fencing agents today use many of these mechanisms to allow the system to make
progress in the event of failure.  The key contribution of Kubernetes is to define
a strongly consistent pattern whereby fencing agents can be plugged in.

To allow users, clients, and automated systems like the fencing controllers to
observe partitions, we propose an additional responsibility to the node controller
or any future controller that attempts to detect partition. The node controller should
add an additional condition to pods that have been terminated due to a node failing
to heartbeat that indicates that the cause of the deletion was node partition.

It may be desirable for users to be able to request fencing when they suspect a
component is malfunctioning. It is outside the scope of this proposal but would
allow administrators to take an action that is safer than force deletion, and
decide at the end whether to force delete.

How the fencing controller decides to fence is left undefined, but it is likely
it could use a combination of pod forgiveness (as a signal of how much disruption
a pod author is likely to accept) and pod disruption budget (as a measurement of
the amount of disruption already undergone) to measure how much latency between
failure and fencing the app is willing to tolerate. Likewise, it can use its own
understanding of the latency of the various failure detectors - the node controller,
any hypothetical information it gathers from service proxies or node peers, any
heartbeat agents in the system - to describe an upper bound on reaction.


### Storage Consistency

To ensure that shared storage without implicit locking be safe for RWO access, the
Kubernetes storage subsystem should leverage the strong consistency available through
the API server and prevent concurrent execution for some types of persistent volumes.
By leveraging existing concepts, we can allow the scheduler and the kubelet to enforce
a guarantee that an RWO volume can be used on at-most-one node at a time.

In order to properly support region and zone specific storage, Kubernetes adds node
selector restrictions to pods derived from the persistent volume. Expanding this
concept to volume types that have no external metadata to read (NFS, iSCSI) may
result in adding a label selector to PVs that defines the allowed nodes the storage
can run on (this is a common requirement for iSCSI, FibreChannel, or NFS clusters).

Because all nodes in a Kubernetes cluster possess a special node name label, it would
be possible for a controller to observe the scheduling decision of a pod using an
unsafe volume and "attach" that volume to the node, and also observe the deletion of
the pod and "detach" the volume from the node. The node would then require that these
unsafe volumes be "attached" before allowing pod execution. Attach and detach may
be recorded on the PVC or PV as a new field or materialized via the selection labels.

Possible sequence of operations:

1. Cluster administrator creates a RWO iSCSI persistent volume, available only to
   nodes with the label selector `storagecluster=iscsi-1`
2. User requests an RWO volume and is bound to the iSCSI volume
3. The user creates a pod referencing the PVC
4. The scheduler observes the pod must schedule on nodes with `storagecluster=iscsi-1`
   (alternatively this could be enforced in admission) and binds to node `A`
5. The kubelet on node `A` observes the pod references a PVC that specifies RWO which
   requires "attach" to be successful
6. The attach/detach controller observes that a pod has been bound with a PVC that
   requires "attach", and attempts to execute a compare and swap update on the PVC/PV
   attaching it to node `A` and pod 1
7. The kubelet observes the attach of the PVC/PV and executes the pod
8. The user terminates the pod
9. The user creates a new pod that references the PVC
10. The scheduler binds this new pod to node `B`, which also has `storagecluster=iscsi-1`
11. The kubelet on node `B` observes the new pod, but sees that the PVC/PV is bound
    to node `A` and so must wait for detach
12. The kubelet on node `A` completes the deletion of pod 1
13. The attach/detach controller observes the first pod has been deleted and that the
    previous attach of the volume to pod 1 is no longer valid - it performs a CAS
    update on the PVC/PV clearing its attach state.
14. The attach/detach controller observes the second pod has been scheduled and
    attaches it to node `B` and pod 2
15. The kubelet on node `B` observes the attach and allows the pod to execute.

If a partition occurred after step 11, the attach controller would block waiting
for the pod to be deleted, and prevent node `B` from launching the second pod.
The fencing controller, upon observing the partition, could signal the iSCSI servers
to firewall node `A`. Once that firewall is in place, the fencing controller could
break the PVC/PV attach to node `A`, allowing steps 13 onwards to continue.


### User interface changes

Clients today may assume that force deletions are safe. We must appropriately
audit clients to identify this behavior and improve the messages. For instance,
`kubectl delete --grace-period=0` could print a warning and require `--confirm`:

```
$ kubectl delete pod foo --grace-period=0
warning: Force deleting a pod does not wait for the pod to terminate, meaning
         your containers will be stopped asynchronously. Pass --confirm to
         continue
```

Likewise, attached volumes would require new semantics to allow the attachment
to be broken.

Clients should communicate partitioned state more clearly - changing the status
column of a pod list to contain the condition indicating NodeDown would help
users understand what actions they could take.


## Backwards compatibility

On an upgrade, pet sets would not be "safe" until the above behavior is implemented.
All other behaviors should remain as-is.


## Testing

All of the above implementations propose to ensure pods can be treated as components
of a strongly consistent cluster. Since formal proofs of correctness are unlikely in
the foreseeable future, Kubernetes must empirically demonstrate the correctness of
the proposed systems. Automated testing of the mentioned components should be
designed to expose ordering and consistency flaws in the presence of

* Master-node partitions
* Node-node partitions
* Master-etcd partitions
* Concurrent controller execution
* Kubelet failures
* Controller failures

A test suite that can perform these tests in combination with real world pet sets
would be desirable, although possibly non-blocking for this proposal.


## Documentation

We should document the lifecycle guarantees provided by the cluster in a clear
and unambiguous way to end users.


## Deferred issues

* Live migration continues to be unsupported on Kubernetes for the foreseeable
  future, and no additional changes will be made to this proposal to account for
  that feature.


## Open Questions

* Should node deletion be treated as "node was down and all processes terminated"
  * Pro: it's a convenient signal that we use in other places today
  * Con: the kubelet recreates its Node object, so if a node is partitioned and
    the admin deletes the node, when the partition is healed the node would be
    recreated, and the processes are *definitely* not terminated
  * Implies we must alter the pod GC controller to only signal graceful deletion,
    and only to flag pods on nodes that don't exist as partitioned, rather than
    force deleting them.
  * Decision: YES - captured above.


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-safety.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->