408 lines
20 KiB
Markdown
408 lines
20 KiB
Markdown
# Pod Safety, Consistency Guarantees, and Storage Implicitions
|
|
|
|
@smarterclayton @bprashanth
|
|
|
|
October 2016
|
|
|
|
## Proposal and Motivation
|
|
|
|
A pod represents the finite execution of one or more related processes on the
|
|
cluster. In order to ensure higher level consistent controllers can safely
|
|
build on top of pods, the exact guarantees around its lifecycle on the cluster
|
|
must be clarified, and it must be possible for higher order controllers
|
|
and application authors to correctly reason about the lifetime of those
|
|
processes and their access to cluster resources in a distributed computing
|
|
environment.
|
|
|
|
To run most clustered software on Kubernetes, it must be possible to guarantee
|
|
**at most once** execution of a particular pet pod at any time on the cluster.
|
|
This allows the controller to prevent multiple processes having access to
|
|
shared cluster resources believing they are the same entity. When a node
|
|
containing a pet is partitioned, the Pet Set must remain consistent (no new
|
|
entity will be spawned) but may become unavailable (cluster no longer has
|
|
a sufficient number of members). The Pet Set guarantee must be strong enough
|
|
for an administrator to reason about the state of the cluster by observing
|
|
the Kubernetes API.
|
|
|
|
In order to reconcile partitions, an actor (human or automated) must decide
|
|
when the partition is unrecoverable. The actor may be informed of the failure
|
|
in an unambiguous way (e.g. the node was destroyed by a meteor) allowing for
|
|
certainty that the processes on that node are terminated, and thus may
|
|
resolve the partition by deleting the node and the pods on the node.
|
|
Alternatively, the actor may take steps to ensure the partitioned node
|
|
cannot return to the cluster or access shared resources - this is known
|
|
as **fencing** and is a well understood domain.
|
|
|
|
This proposal covers the changes necessary to ensure:
|
|
|
|
* Pet Sets can ensure **at most one** semantics for each individual pet
|
|
* Other system components such as the node and namespace controller can
|
|
safely perform their responsibilities without violating that guarantee
|
|
* An administrator or higher level controller can signal that a node
|
|
partition is permanent, allowing the Pet Set controller to proceed.
|
|
* A fencing controller can take corrective action automatically to heal
|
|
partitions
|
|
|
|
We will accomplish this by:
|
|
|
|
* Clarifying which components are allowed to force delete pods (as opposed
|
|
to merely requesting termination)
|
|
* Ensuring system components can observe partitioned pods and nodes
|
|
correctly
|
|
* Defining how a fencing controller could safely interoperate with
|
|
partitioned nodes and pods to safely heal partitions
|
|
* Describing how shared storage components without innate safety
|
|
guarantees can be safely shared on the cluster.
|
|
|
|
|
|
### Current Guarantees for Pod lifecycle
|
|
|
|
The existing pod model provides the following guarantees:
|
|
|
|
* A pod is executed on exactly one node
|
|
* A pod has the following lifecycle phases:
|
|
* Creation
|
|
* Scheduling
|
|
* Execution
|
|
* Init containers
|
|
* Application containers
|
|
* Termination
|
|
* Deletion
|
|
* A pod can only move through its phases in order, and may not return
|
|
to an earlier phase.
|
|
* A user may specify an interval on the pod called the **termination
|
|
grace period** that defines the minimum amount of time the pod will
|
|
have to complete the termination phase, and all components will honor
|
|
this interval.
|
|
* Once a pod begins termination, its termination grace period can only
|
|
be shortened, not lengthened.
|
|
|
|
Pod termination is divided into the following steps:
|
|
|
|
* A component requests the termination of the pod by issuing a DELETE
|
|
to the pod resource with an optional **grace period**
|
|
* If no grace period is provided, the default from the pod is leveraged
|
|
* When the kubelet observes the deletion, it starts a timer equal to the
|
|
grace period and performs the following actions:
|
|
* Executes the pre-stop hook, if specified, waiting up to **grace period**
|
|
seconds before continuing
|
|
* Sends the termination signal to the container runtime (SIGTERM or the
|
|
container image's STOPSIGNAL on Docker)
|
|
* Waits 2 seconds, or the remaining grace period, whichever is longer
|
|
* Sends the force termination signal to the container runtime (SIGKILL)
|
|
* Once the kubelet observes the container is fully terminated, it issues
|
|
a status update to the REST API for the pod indicating termination, then
|
|
issues a DELETE with grace period = 0.
|
|
|
|
If the kubelet crashes during the termination process, it will restart the
|
|
termination process from the beginning (grace period is reset). This ensures
|
|
that a process is always given **at least** grace period to terminate cleanly.
|
|
|
|
A user may re-issue a DELETE to the pod resource specifying a shorter grace
|
|
period, but never a longer one.
|
|
|
|
Deleting a pod with grace period 0 is called **force deletion** and will
|
|
update the pod with a `deletionGracePeriodSeconds` of 0, and then immediately
|
|
remove the pod from etcd. Because all communication is asynchronous,
|
|
force deleting a pod means that the pod processes may continue
|
|
to run for an arbitary amount of time. If a higher level component like the
|
|
StatefulSet controller treats the existence of the pod API object as a strongly
|
|
consistent entity, deleting the pod in this fashion will violate the
|
|
at-most-one guarantee we wish to offer for pet sets.
|
|
|
|
|
|
### Guarantees provided by replica sets and replication controllers
|
|
|
|
ReplicaSets and ReplicationControllers both attempt to **preserve availability**
|
|
of their constituent pods over ensuring at most one (of a pod) semantics. So a
|
|
replica set to scale 1 will immediately create a new pod when it observes an
|
|
old pod has begun graceful deletion, and as a result at many points in the
|
|
lifetime of a replica set there will be 2 copies of a pod's processes running
|
|
concurrently. Only access to exclusive resources like storage can prevent that
|
|
simultaneous execution.
|
|
|
|
Deployments, being based on replica sets, can offer no stronger guarantee.
|
|
|
|
|
|
### Concurrent access guarantees for shared storage
|
|
|
|
A persistent volume that references a strongly consistent storage backend
|
|
like AWS EBS, GCE PD, OpenStack Cinder, or Ceph RBD can rely on the storage
|
|
API to prevent corruption of the data due to simultaneous access by multiple
|
|
clients. However, many commonly deployed storage technologies in the
|
|
enterprise offer no such consistency guarantee, or much weaker variants, and
|
|
rely on complex systems to control which clients may access the storage.
|
|
|
|
If a PV is assigned a iSCSI, Fibre Channel, or NFS mount point and that PV
|
|
is used by two pods on different nodes simultaneously, concurrent access may
|
|
result in corruption, even if the PV or PVC is identified as "read write once".
|
|
PVC consumers must ensure these volume types are *never* referenced from
|
|
multiple pods without some external synchronization. As described above, it
|
|
is not safe to use persistent volumes that lack RWO guarantees with a
|
|
replica set or deployment, even at scale 1.
|
|
|
|
|
|
## Proposed changes
|
|
|
|
### Avoid multiple instances of pods
|
|
|
|
To ensure that the Pet Set controller can safely use pods and ensure at most
|
|
one pod instance is running on the cluster at any time for a given pod name,
|
|
it must be possible to make pod deletion strongly consistent.
|
|
|
|
To do that, we will:
|
|
|
|
* Give the Kubelet sole responsibility for normal deletion of pods -
|
|
only the Kubelet in the course of normal operation should ever remove a
|
|
pod from etcd (only the Kubelet should force delete)
|
|
* The kubelet must not delete the pod until all processes are confirmed
|
|
terminated.
|
|
* The kubelet SHOULD ensure all consumed resources on the node are freed
|
|
before deleting the pod.
|
|
* Application owners must be free to force delete pods, but they *must*
|
|
understand the implications of doing so, and all client UI must be able
|
|
to communicate those implications.
|
|
* Force deleting a pod may cause data loss (two instances of the same
|
|
pod process may be running at the same time)
|
|
* All existing controllers in the system must be limited to signaling pod
|
|
termination (starting graceful deletion), and are not allowed to force
|
|
delete a pod.
|
|
* The node controller will no longer be allowed to force delete pods -
|
|
it may only signal deletion by beginning (but not completing) a
|
|
graceful deletion.
|
|
* The GC controller may not force delete pods
|
|
* The namespace controller used to force delete pods, but no longer
|
|
does so. This means a node partition can block namespace deletion
|
|
indefinitely.
|
|
* The pod GC controller may continue to force delete pods on nodes that
|
|
no longer exist if we treat node deletion as confirming permanent
|
|
partition. If we do not, the pod GC controller must not force delete
|
|
pods.
|
|
* It must be possible for an administrator to effectively resolve partitions
|
|
manually to allow namespace deletion.
|
|
* Deleting a node from etcd should be seen as a signal to the cluster that
|
|
the node is permanently partitioned. We must audit existing components
|
|
to verify this is the case.
|
|
* The PodGC controller has primary responsibility for this - it already
|
|
owns the responsibility to delete pods on nodes that do not exist, and
|
|
so is allowed to force delete pods on nodes that do not exist.
|
|
* The PodGC controller must therefore always be running and will be
|
|
changed to always be running for this responsibility in a >=1.5
|
|
cluster.
|
|
|
|
In the above scheme, force deleting a pod releases the lock on that pod and
|
|
allows higher level components to proceed to create a replacement.
|
|
|
|
It has been requested that force deletion be restricted to privileged users.
|
|
That limits the application owner in resolving partitions when the consequences
|
|
of force deletion are understood, and not all application owners will be
|
|
privileged users. For example, a user may be running a 3 node etcd cluster in a
|
|
pet set. If pet 2 becomes partitioned, the user can instruct etcd to remove
|
|
pet 2 from the cluster (via direct etcd membership calls), and because a quorum
|
|
exists pets 0 and 1 can safely accept that action. The user can then force
|
|
delete pet 2 and the pet set controller will be able to recreate that pet on
|
|
another node and have it join the cluster safely (pets 0 and 1 constitute a
|
|
quorum for membership change).
|
|
|
|
This proposal does not alter the behavior of finalizers - instead, it makes
|
|
finalizers unnecessary for common application cases (because the cluster only
|
|
deletes pods when safe).
|
|
|
|
### Fencing
|
|
|
|
The changes above allow Pet Sets to ensure at-most-one pod, but provide no
|
|
recourse for the automatic resolution of cluster partitions during normal
|
|
operation. For that, we propose a **fencing controller** which exists above
|
|
the current controller plane and is capable of detecting and automatically
|
|
resolving partitions. The fencing controller is an agent empowered to make
|
|
similar decisions as a human administrator would make to resolve partitions,
|
|
and to take corresponding steps to prevent a dead machine from coming back
|
|
to life automatically.
|
|
|
|
Fencing controllers most benefit services that are not innately replicated
|
|
by reducing the amount of time it takes to detect a failure of a node or
|
|
process, isolate that node or process so it cannot initiate or receive
|
|
communication from clients, and then spawn another process. It is expected
|
|
that many StatefulSets of size 1 would prefer to be fenced, given that most
|
|
applications in the real world of size 1 have no other alternative for HA
|
|
except reducing mean-time-to-recovery.
|
|
|
|
While the methods and algorithms may vary, the basic pattern would be:
|
|
|
|
1. Detect a partitioned pod or node via the Kubernetes API or via external
|
|
means.
|
|
2. Decide whether the partition justifies fencing based on priority, policy, or
|
|
service availability requirements.
|
|
3. Fence the node or any connected storage using appropriate mechanisms.
|
|
|
|
For this proposal we only describe the general shape of detection and how existing
|
|
Kubernetes components can be leveraged for policy, while the exact implementation
|
|
and mechanisms for fencing are left to a future proposal. A future fencing controller
|
|
would be able to leverage a number of systems including but not limited to:
|
|
|
|
* Cloud control plane APIs such as machine force shutdown
|
|
* Additional agents running on each host to force kill process or trigger reboots
|
|
* Agents integrated with or communicating with hypervisors running hosts to stop VMs
|
|
* Hardware IPMI interfaces to reboot a host
|
|
* Rack level power units to power cycle a blade
|
|
* Network routers, backplane switches, software defined networks, or system firewalls
|
|
* Storage server APIs to block client access
|
|
|
|
to appropriately limit the ability of the partitioned system to impact the cluster.
|
|
Fencing agents today use many of these mechanisms to allow the system to make
|
|
progress in the event of failure. The key contribution of Kubernetes is to define
|
|
a strongly consistent pattern whereby fencing agents can be plugged in.
|
|
|
|
To allow users, clients, and automated systems like the fencing controllers to
|
|
observe partitions, we propose an additional responsibility to the node controller
|
|
or any future controller that attempts to detect partition. The node controller should
|
|
add an additional condition to pods that have been terminated due to a node failing
|
|
to heartbeat that indicates that the cause of the deletion was node partition.
|
|
|
|
It may be desirable for users to be able to request fencing when they suspect a
|
|
component is malfunctioning. It is outside the scope of this proposal but would
|
|
allow administrators to take an action that is safer than force deletion, and
|
|
decide at the end whether to force delete.
|
|
|
|
How the fencing controller decides to fence is left undefined, but it is likely
|
|
it could use a combination of pod forgiveness (as a signal of how much disruption
|
|
a pod author is likely to accept) and pod disruption budget (as a measurement of
|
|
the amount of disruption already undergone) to measure how much latency between
|
|
failure and fencing the app is willing to tolerate. Likewise, it can use its own
|
|
understanding of the latency of the various failure detectors - the node controller,
|
|
any hypothetical information it gathers from service proxies or node peers, any
|
|
heartbeat agents in the system - to describe an upper bound on reaction.
|
|
|
|
|
|
### Storage Consistency
|
|
|
|
To ensure that shared storage without implicit locking be safe for RWO access, the
|
|
Kubernetes storage subsystem should leverage the strong consistency available through
|
|
the API server and prevent concurrent execution for some types of persistent volumes.
|
|
By leveraging existing concepts, we can allow the scheduler and the kubelet to enforce
|
|
a guarantee that an RWO volume can be used on at-most-one node at a time.
|
|
|
|
In order to properly support region and zone specific storage, Kubernetes adds node
|
|
selector restrictions to pods derived from the persistent volume. Expanding this
|
|
concept to volume types that have no external metadata to read (NFS, iSCSI) may
|
|
result in adding a label selector to PVs that defines the allowed nodes the storage
|
|
can run on (this is a common requirement for iSCSI, FibreChannel, or NFS clusters).
|
|
|
|
Because all nodes in a Kubernetes cluster possess a special node name label, it would
|
|
be possible for a controller to observe the scheduling decision of a pod using an
|
|
unsafe volume and "attach" that volume to the node, and also observe the deletion of
|
|
the pod and "detach" the volume from the node. The node would then require that these
|
|
unsafe volumes be "attached" before allowing pod execution. Attach and detach may
|
|
be recorded on the PVC or PV as a new field or materialized via the selection labels.
|
|
|
|
Possible sequence of operations:
|
|
|
|
1. Cluster administrator creates a RWO iSCSI persistent volume, available only to
|
|
nodes with the label selector `storagecluster=iscsi-1`
|
|
2. User requests an RWO volume and is bound to the iSCSI volume
|
|
3. The user creates a pod referencing the PVC
|
|
4. The scheduler observes the pod must schedule on nodes with `storagecluster=iscsi-1`
|
|
(alternatively this could be enforced in admission) and binds to node `A`
|
|
5. The kubelet on node `A` observes the pod references a PVC that specifies RWO which
|
|
requires "attach" to be successful
|
|
6. The attach/detach controller observes that a pod has been bound with a PVC that
|
|
requires "attach", and attempts to execute a compare and swap update on the PVC/PV
|
|
attaching it to node `A` and pod 1
|
|
7. The kubelet observes the attach of the PVC/PV and executes the pod
|
|
8. The user terminates the pod
|
|
9. The user creates a new pod that references the PVC
|
|
10. The scheduler binds this new pod to node `B`, which also has `storagecluster=iscsi-1`
|
|
11. The kubelet on node `B` observes the new pod, but sees that the PVC/PV is bound
|
|
to node `A` and so must wait for detach
|
|
12. The kubelet on node `A` completes the deletion of pod 1
|
|
13. The attach/detach controller observes the first pod has been deleted and that the
|
|
previous attach of the volume to pod 1 is no longer valid - it performs a CAS
|
|
update on the PVC/PV clearing its attach state.
|
|
14. The attach/detach controller observes the second pod has been scheduled and
|
|
attaches it to node `B` and pod 2
|
|
15. The kubelet on node `B` observes the attach and allows the pod to execute.
|
|
|
|
If a partition occurred after step 11, the attach controller would block waiting
|
|
for the pod to be deleted, and prevent node `B` from launching the second pod.
|
|
The fencing controller, upon observing the partition, could signal the iSCSI servers
|
|
to firewall node `A`. Once that firewall is in place, the fencing controller could
|
|
break the PVC/PV attach to node `A`, allowing steps 13 onwards to continue.
|
|
|
|
|
|
### User interface changes
|
|
|
|
Clients today may assume that force deletions are safe. We must appropriately
|
|
audit clients to identify this behavior and improve the messages. For instance,
|
|
`kubectl delete --grace-period=0` could print a warning and require `--confirm`:
|
|
|
|
```
|
|
$ kubectl delete pod foo --grace-period=0
|
|
warning: Force deleting a pod does not wait for the pod to terminate, meaning
|
|
your containers will be stopped asynchronously. Pass --confirm to
|
|
continue
|
|
```
|
|
|
|
Likewise, attached volumes would require new semantics to allow the attachment
|
|
to be broken.
|
|
|
|
Clients should communicate partitioned state more clearly - changing the status
|
|
column of a pod list to contain the condition indicating NodeDown would help
|
|
users understand what actions they could take.
|
|
|
|
|
|
## Backwards compatibility
|
|
|
|
On an upgrade, pet sets would not be "safe" until the above behavior is implemented.
|
|
All other behaviors should remain as-is.
|
|
|
|
|
|
## Testing
|
|
|
|
All of the above implementations propose to ensure pods can be treated as components
|
|
of a strongly consistent cluster. Since formal proofs of correctness are unlikely in
|
|
the foreseeable future, Kubernetes must empirically demonstrate the correctness of
|
|
the proposed systems. Automated testing of the mentioned components should be
|
|
designed to expose ordering and consistency flaws in the presence of
|
|
|
|
* Master-node partitions
|
|
* Node-node partitions
|
|
* Master-etcd partitions
|
|
* Concurrent controller execution
|
|
* Kubelet failures
|
|
* Controller failures
|
|
|
|
A test suite that can perform these tests in combination with real world pet sets
|
|
would be desirable, although possibly non-blocking for this proposal.
|
|
|
|
|
|
## Documentation
|
|
|
|
We should document the lifecycle guarantees provided by the cluster in a clear
|
|
and unambiguous way to end users.
|
|
|
|
|
|
## Deferred issues
|
|
|
|
* Live migration continues to be unsupported on Kubernetes for the foreseeable
|
|
future, and no additional changes will be made to this proposal to account for
|
|
that feature.
|
|
|
|
|
|
## Open Questions
|
|
|
|
* Should node deletion be treated as "node was down and all processes terminated"
|
|
* Pro: it's a convenient signal that we use in other places today
|
|
* Con: the kubelet recreates its Node object, so if a node is partitioned and
|
|
the admin deletes the node, when the partition is healed the node would be
|
|
recreated, and the processes are *definitely* not terminated
|
|
* Implies we must alter the pod GC controller to only signal graceful deletion,
|
|
and only to flag pods on nodes that don't exist as partitioned, rather than
|
|
force deleting them.
|
|
* Decision: YES - captured above.
|
|
|
|
|
|
|
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
|
[]()
|
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|