Initial proposal for controller history in support of DaemonSet update
and StatefulSet update features.
This commit is contained in:
parent
f929dac304
commit
3bc2e74ec3
|
@ -0,0 +1,462 @@
|
|||
# Controller History
|
||||
|
||||
**Author**: kow3ns@
|
||||
|
||||
**Status**: Proposal
|
||||
|
||||
## Abstract
|
||||
In Kubernetes, in order to update and rollback the configuration and binary
|
||||
images of controller managed Pods, users mutate DaemonSet, StatefulSet,
|
||||
and Deployment Objects, and the corresponding controllers attempt to transition
|
||||
the current state of the system to the new declared target state.
|
||||
|
||||
To facilitate update and rollback for these controllers, and to provide a
|
||||
primitive that third party controllers can build on, we propose a mechanism
|
||||
that allows controllers to manage a bounded history of revisions to the declared
|
||||
target state of their generated Objects.
|
||||
|
||||
## Affected Components
|
||||
|
||||
1. API Machinery
|
||||
1. API Server
|
||||
1. Kubectl
|
||||
1. Controllers that utilize the feature
|
||||
|
||||
## Requirements
|
||||
|
||||
1. History is a collection of points in time, and each point in time must be
|
||||
represented by its own Object. While it is tempting to aggregate all of an
|
||||
Object's history into a single container Object, experience with Borg and Mesos
|
||||
has taught us that this inevitably leads to exhausting the single Object size
|
||||
limit of the system's storage backend.
|
||||
1. We must be able to select the Objects that contain point in time snapshots
|
||||
of versions of an Object to reconstruct the Object's history.
|
||||
1. History respects causality. The Object type used to store point in time
|
||||
snapshots must be strictly ordered with respect to creation. CreationTimestamp
|
||||
should not be used, as this is susceptible to clock skew.
|
||||
1. History must not be revisionist. Once an Object corresponding to a version
|
||||
of a controllers target state is created, it can not be mutated.
|
||||
1. Controller history requires only current events. Storing an exhaustive
|
||||
history of all revisions to all controllers is out of scope for our purposes,
|
||||
and it can be solved by applying a version control system to manifests. Internal
|
||||
revision history must only store revisions to the controller's target state that
|
||||
correspond to live Objects and (potentially) a small, configurable number of
|
||||
prior revisions.
|
||||
1. History is scale invariant. A revision to a controller is a modification
|
||||
that changes the specification of the Objects it generates. Changing the
|
||||
cardinality of those Objects is a scaling operation and does not constitute a
|
||||
revision.
|
||||
|
||||
## Terminology
|
||||
The following terminology is used throughout the rest of this proposal. We
|
||||
make its meaning explicit here.
|
||||
- The specification type of a controller is the type that contains the
|
||||
specification for the Objects generated by the controller.
|
||||
- For example, the specification types for the ReplicaSet, DaemonSet,
|
||||
and StatefulSet controllers are ReplicaSetSpec, DaemonSetSpec,
|
||||
and StatefulSetSpec respectively.
|
||||
- The generated type(s) for a controller is/are the type of the Object(s)
|
||||
generated by the controller.
|
||||
- Pod is a generated type for the ReplicaSet, DaemonSet, and StatefulSet
|
||||
controllers.
|
||||
- PersistentVolumeClaim is also a generated type for the StatefulSet
|
||||
controller.
|
||||
- The current state of a controller is the union of the states of its generated
|
||||
Objects along with its status.
|
||||
- For ReplicaSet, DaemonSet, and StatefulSet, the current state of the
|
||||
corresponding controllers can be derived from Pods they contain and the
|
||||
ReplicasSetStatus, DaemonSetStatus, and StatefulSetStatus objects
|
||||
respectively.
|
||||
- For all specification type Objects for controllers, the target state is the
|
||||
set of fields in the Object that determine the state to which the controller
|
||||
attempts to evolve the system.
|
||||
- This may not necessarily be all fields of the Object.
|
||||
- For example, for the StatefulSet controller `.Spec.Template`,
|
||||
`.Spec.Replicas`, and `.Spec.VolumeClaims` determine the target state. The
|
||||
controller "wants" to create `.Spec.Replicas` Pods generated from
|
||||
`.Spec.Template` and `.Spec.VolumeClaims`.
|
||||
- The target Object state is the subset of the target state necessary to create
|
||||
Objects of the generated type(s).
|
||||
- To make this concrete, for the StatefulSet controller `.Spec.Template`
|
||||
and `.Spec.VolumeClaims` are the target Object state. This is enough
|
||||
information for the controller to generate Pods and corresponding PVCs.
|
||||
- If a version of the target Object state was used to generate an Object that
|
||||
has not yet been deleted, we refer to the version, and any snapshots of the
|
||||
version, as live.
|
||||
|
||||
## API Objects
|
||||
|
||||
Kubernetes controllers already persist their current and target states to the
|
||||
API Server. In order to maintain a history of revisions to specification type
|
||||
Objects, we only need to persist snapshots of the target Object states
|
||||
contained in the specification type when they are revised.
|
||||
|
||||
One approach would be to, for every specification type, have a
|
||||
corresponding History type. For example, we could introduce a StatefulSetHistory
|
||||
object that aggregates a PodTemplateSpec and a slice of PersistentVolumeClaims.
|
||||
The StatefulSet controller could use this object to store point in time
|
||||
snapshots of versions of StatefulSetSpecs. However, this requires that we
|
||||
introduce a new History Kind for all current and future controllers. It has the
|
||||
benefit of type safety, but, for this benefit, we trade generality.
|
||||
|
||||
Another approach would be to use PodTemplate objects. This mechanisms provides
|
||||
the desired generality, but it only provides for the recording of versions of
|
||||
PodTemplateSpecs (e.g. For StatefulSet, we can not use PodTemplates to
|
||||
record revisions to PersistentVolumeClaims). Also, it introduces the potential
|
||||
for overlapping histories for two Objects of different Kinds, with the same
|
||||
`.Name` in the same Namespace. Lastly, it constrains the PodTemplate Kind from
|
||||
evolving to fulfill its original intention.
|
||||
|
||||
We propose an approach that has analogs with the approach taken by the
|
||||
[Mesos](http://mesos.apache.org/) community. Mesos frameworks, which are in some
|
||||
ways like Kubernetes controllers, are responsible for check pointing,
|
||||
persisting, and recovering their own state. This problem is so common that
|
||||
Mesos provides a ["State Abstraction"](https://github.com/apache/mesos/blob/master/include/mesos/state/state.hpp)
|
||||
that allows frameworks to persist their state in either ZooKeeper or the
|
||||
Mesos Replicate Log (A Multi-Paxos based state machine used by the Mesos
|
||||
Masters). This State Abstraction is a mutable, durable dictionary where keys
|
||||
and values are opaque strings. As controllers only need the capability to
|
||||
persist an immutable point in time snapshot of target Object states to
|
||||
implement a revision history, we propose to use the ControllerRevision object
|
||||
for this purpose.
|
||||
|
||||
``` golang
|
||||
// ControllerRevision implements an immutable snapshot of state data. Clients
|
||||
// are responsible for serializing and deserializing the objects that contain
|
||||
// their internal state.
|
||||
// Once a ControllerRevision has been successfully created, it can not be updated.
|
||||
// The API Server will fail validation of all requests that attempt to mutate
|
||||
// the Data field. ControllerRevisions may, however, be deleted.
|
||||
type ControllerRevision struct {
|
||||
metav1.TypeMeta
|
||||
// +optional
|
||||
metav1.ObjectMeta
|
||||
// Data contains the serialized state.
|
||||
Data runtime.RawExtension
|
||||
// Revision indicates the revision of the state represented by Data.
|
||||
Revision int64
|
||||
}
|
||||
```
|
||||
|
||||
## API Server
|
||||
The API Server must support the creation and deletion of ControllerRevision
|
||||
objects. As we have no mechanism for declarative immutability, the API server
|
||||
must fail any update request that updates the `.Data` field of a
|
||||
ControllerRevision Object.
|
||||
|
||||
## Controllers
|
||||
This section is presented as a generalization of how an arbitrary controller
|
||||
can use ControllerRevision to persist a history of revisions to its
|
||||
specification type Objects. The technique is applicable, without loss of
|
||||
generality, to the existing Kubernetes controllers that have Pod as a generated
|
||||
type.
|
||||
|
||||
When a controller detects a revision to the target Object state of a
|
||||
specification type Object it will do the following.
|
||||
|
||||
1. The controller will [create a snapshot](#version-snapshot-creation) of the
|
||||
current target Object state.
|
||||
1. The controller will [reconstruct the history](#history-reconstruction) of
|
||||
revisions to the Object's target Object state.
|
||||
1. The controller will test the current target Object state for
|
||||
[equivalence](#version-equivalence) with all other versions in the Object's
|
||||
revision history.
|
||||
- If the current version is semantically equivalent to its immediate
|
||||
predecessor no update to the Object's target state has been performed.
|
||||
- If the current version is equivalent to a version prior to its immediate
|
||||
predecessor, this indicates a rollback.
|
||||
- If the current version is not equivalent to any prior version, this
|
||||
indicates an update or a roll forward.
|
||||
- Controllers should use their status objects for book keeping with respect
|
||||
to current and prior revisions.
|
||||
1. The controller will
|
||||
[reconcile its generated Objects](#target-object-state-reconciliation)
|
||||
with the new target Object state.
|
||||
1. The controller will [maintain the length of its history](#history-maintenance)
|
||||
to be less than the configured limit.
|
||||
|
||||
### Version Snapshot Creation
|
||||
To take a snapshot of the target Object state contained in a specification type
|
||||
Object, a controller will do the following.
|
||||
|
||||
1. The controller will serialize all the Object's target object state and store
|
||||
the serialized representation in the ControllerRevision's `.Data`.
|
||||
1. The controller will store a unique, monotonically increasing
|
||||
[revision number](#revision-number-selection) in the Revision field.
|
||||
1. The controller will compute the [hash](#hashing) of the
|
||||
ControllerRevision's `.Data`.
|
||||
1. The controller will attach a label to the ControllerRevision so that it is
|
||||
selectable with a low probability of overlap.
|
||||
- ControllerRefs will be used as the authoritative test for ownership.
|
||||
- The specification type Object's `.Selector` should be used where
|
||||
applicable.
|
||||
- Alternatively, a Kind unique label may be set to the `.Name` of the
|
||||
specification type Object.
|
||||
1. The controller will add a ControllerRef indicating the specification type
|
||||
Object as the owner of the ControllerRevision in the ControllerRevision's
|
||||
`.OwnerReferences`.
|
||||
1. The controller will use the hash from above, along with a user identifiable
|
||||
prefix, to [generate a unique `.Name`](#unique-name-generation) for the
|
||||
ControllerRevision.
|
||||
- The controller should, where possible, use the `.Name` of the
|
||||
specification type Object.
|
||||
1. The controller will persist the ControllerRevision via the API Server.
|
||||
- Note that, in practice, creation occurs concurrently with
|
||||
[collision resolution](#collision-resolution).
|
||||
|
||||
### Revision Number Selection
|
||||
We propose two methods for selecting the `.Revision` used to order a
|
||||
specification type Object's revision history.
|
||||
|
||||
1. Set the `.Revision` field to the `.Generation` field.
|
||||
- This approach has the benefit of leveraging the existing monotonically
|
||||
increasing sequence generated by `.Generation` field.
|
||||
- The downside of this approach is that history will not survive the
|
||||
destruction of an Object.
|
||||
1. Use an approach analogous to Deployment.
|
||||
1. Reconstruct the Object's revision history.
|
||||
1. If the history is empty, use a `.Revision` of `0`.
|
||||
1. If the history is not empty, set the `.Revision` to a value greater than
|
||||
the maximum value of all previous `.Revisions`.
|
||||
|
||||
### History Reconstruction
|
||||
To reconstruct the history of a specification type Object, a controller will do
|
||||
the following.
|
||||
|
||||
1. Select all ControllerRevision Objects labeled as described
|
||||
[above](#version-snapshot-creation).
|
||||
1. Filter any ControllerRevisions that do not have a ControllerRef in their
|
||||
`.OwnerReferences` indicating ownership by the Object.
|
||||
1. Sort the ControllerRevisions by the `.Revision` field.
|
||||
1. This produces a strictly ordered set of ControllerRevisions that comprises
|
||||
the ordered revision history of the specification type Object.
|
||||
|
||||
### History Maintenance
|
||||
Controllers should be configured, either globally or on a per specification type
|
||||
Object basis, to have a `RevisionHistoryLimit`. This field will indicate the
|
||||
number of non-live revisions the controller should maintain in its history
|
||||
for each specification type Object. Every time a controller observes a
|
||||
specification type Object it will do the following.
|
||||
|
||||
1. The controller will
|
||||
[reconstruct the Object's revision history](#history-reconstruction).
|
||||
- Note that the process of reconstructing the Object's history filters any
|
||||
ControllerRevisions not owned by the Object.
|
||||
1. The controller will filter any ControllerRevisions that represent a live
|
||||
version.
|
||||
1. If the number of remaining ControllerRevisions is greater than the configured
|
||||
`RevisionHistoryLimit`, the controller will delete them, in order with respect
|
||||
to the value mapped to their `.Revisions`, until the number
|
||||
of remaining ControllerRevisions is equal to the `RevisionHistoryLimit`.
|
||||
|
||||
This ensures that the number of recorded, non-live revisions is less than or
|
||||
equal to the configured `RevisionHistoryLimit`.
|
||||
|
||||
### Version Tracking
|
||||
Controllers must track the version of the target Object state that corresponds
|
||||
to their generated Objects. This information is necessary to determine which
|
||||
versions are live, and to track which Objects need to be updated during a
|
||||
target state update or rollback. We propose two methods that controllers may
|
||||
use to track live versions and their association with generated Objects.
|
||||
|
||||
1. The most straightforward method is labeling. In this method the generated
|
||||
Objects are labeled with the `.Name` of the ControllerRevision object that
|
||||
corresponds to the version of the target Object state that was used to generate
|
||||
them. As we have taken care to ensure the uniqueness of the `.Names` of the
|
||||
ControllerRevisions, this approach is reasonable.
|
||||
- A revision is considered to be live while any generated Object labeled
|
||||
with its `.Name` is live.
|
||||
- This method has the benefit of providing visibility, via the label, to
|
||||
users with respect to the historical provenance of a generated Object.
|
||||
- The primary drawback is the lack of support for using garbage collection
|
||||
to ensure that only non-live version snapshots are collected.
|
||||
1. Controllers may also use the `OwnerReferences` field of the
|
||||
ControllerRevision to record all Objects that are generated from target Object
|
||||
state version represented by the ControllerRevision as its owners.
|
||||
- A revision is considered to be live while any generated Object that owns
|
||||
it is live.
|
||||
- This method allows for the implementation of generic garbage collection.
|
||||
- The primary drawback with this method is that the book keeping is complex,
|
||||
and deciding if a generated Object corresponds to a particular revision
|
||||
will require testing each Object for membership in the `OwnerReferences`
|
||||
of all ControllerRevisions.
|
||||
|
||||
Note that, since we are labeling the generated Objects to indicate their
|
||||
provenance with respect to the version of the controller's target Object state,
|
||||
we are susceptible to downstream mutations by other controllers changing the
|
||||
controller's product. The best we can do is guarantee that our product meets
|
||||
the specification at the time of creation. If a third party mutates the product
|
||||
downstream (as long as it does so in a consistent and intentional way), we
|
||||
don't want to recall it and make it conform to the original specification. This
|
||||
would cause the controllers to "fight" indefinitely.
|
||||
|
||||
At the cost of the complexity of implementing both labeling and ownership,
|
||||
controllers may use a combination of both approaches to mitigate the
|
||||
deficiencies of each.
|
||||
|
||||
### Version Equivalence
|
||||
When the target Object state of a specification type Object is revised, we wish
|
||||
to minimize the number of mutations to generated Objects as the controller seeks
|
||||
to conform the system to its target state. That is, if a generated Object
|
||||
already conforms to the revised target Object state, it is imperative that we
|
||||
do not mutate it.
|
||||
|
||||
Failure to implement this correctly could result in the simultaneous rolling
|
||||
restart of every Pod in every StatefulSet and DaemonSet in the system when
|
||||
additions are made to PodTemplateSpec during a master upgrade. It is therefore
|
||||
necessary to determine if the current target Object state is equivalent to a
|
||||
prior version.
|
||||
|
||||
Since we [track the version of](#version-tracking) of generated Objects, this
|
||||
reduces to deciding if the version of the target Object state associated with
|
||||
the generated Object is equivalent to the current target Object state.
|
||||
Even though [hashing](#hashing) is used to generate the `.Name` of the
|
||||
ControllerRevisions used to encapsulate versions of the target Object state, as
|
||||
we do not require cryptographically strong collision resistance, and given we
|
||||
use a [collision resolution](#collision-resolution) technique, we can't use the
|
||||
[generated names](#unique-name-generation) of ControllerRevisions to decide
|
||||
equality.
|
||||
|
||||
We propose that two ControllerRevisions can be considered equal if their
|
||||
`.Data` is equivalent, but that it is not sufficient to compare the serialized
|
||||
representation of the their `.Data`. Consider that the addition of new fields
|
||||
to the Objects that represent the target Object state may cause the serialized
|
||||
representation of those Objects to be unequal even when they are semantically
|
||||
equivalent.
|
||||
|
||||
The controller should deserialize the values of the ControllerRevisions
|
||||
representing their target Object state and perform a deep, semantic equality
|
||||
test. Here all differences that do not constitute a mutation to the target
|
||||
Object state are disregarded during the equivalence test.
|
||||
|
||||
### Target Object State Reconciliation
|
||||
There are three ways for a controller to reconcile a generated Object with the
|
||||
declared target Object state.
|
||||
|
||||
1. If the target Object state is [equivalent](#version-equivalence) to the
|
||||
target Object state associated with the generated Object, the controller will
|
||||
update the associated [version tracking information](#version-tracking).
|
||||
1. If the Object can be updated in place to reconcile its state with the
|
||||
current target Object state, a controller may update the Object in place
|
||||
provided that the associated version tracking information is updated as well.
|
||||
1. Otherwise, the controller must destroy the Object and recreate it from the
|
||||
current target Object state.
|
||||
|
||||
### Kubernetes Upgrades
|
||||
During the upgrade process form a version of Kubernetes that does not support
|
||||
controller history to a version that does, controllers that implement history
|
||||
based update mechanisms may find that they have specification type Objects with
|
||||
no history and with generated Objects. For instance, a StatefulSet may exist
|
||||
with several Pods and no history. We defer requirements for handling history
|
||||
initialization to the individual proposals pertaining to those controller's
|
||||
update mechanisms. However, implementors should take note of the following.
|
||||
|
||||
1. If the history of an Object is not initialized, controllers should
|
||||
continue to (re)create generated Objects based on the current target Object
|
||||
state.
|
||||
1. The history should be initialized on the first mutation to the specification
|
||||
type Object for which the history will be generated.
|
||||
1. After the history has been initialized, any generated Objects that have no
|
||||
indication of the revision from which they were generated may be treated as if
|
||||
they have a nil revision. That is, without respect to the method of
|
||||
[version tracking](#version-tracking) used, the generated Objects may be
|
||||
treated as if they have a version that corresponds to no revision, and the
|
||||
controller may proceed to
|
||||
[reconcile their state](target-object-state-reconciliation) as appropriate to
|
||||
the internal implementation.
|
||||
|
||||
## Kubectl
|
||||
|
||||
Modifications to kubectl to leverage controller history are an optional
|
||||
extension. Users can trigger rolling updates and rollbacks by modifying their
|
||||
manifests and using `kubectl apply`. Controllers will be able to detect
|
||||
revisions to their target Object state and perform
|
||||
[reconciliation](#target-object-state-reconciliation) as necessary.
|
||||
|
||||
### Viewing History
|
||||
|
||||
Users can view a controller's revision history with the following command.
|
||||
|
||||
```bash
|
||||
> kubectl rollout history
|
||||
```
|
||||
|
||||
To view the details of the revision indicated by `<revision>`. Users can use
|
||||
the following command.
|
||||
|
||||
```bash
|
||||
> kubectl rollout history --revision <revision>
|
||||
```
|
||||
|
||||
### Rollback
|
||||
|
||||
For future work, `kubeclt rollout undo` can be implemented in the general case
|
||||
as an extension of the [above](#viewing-history ).
|
||||
|
||||
```bash
|
||||
> kubectl rollout undo
|
||||
```
|
||||
|
||||
Here `kubectl undo` simply uses strategic merge patch to apply the state
|
||||
contained at a particular revision.
|
||||
|
||||
## Tests
|
||||
|
||||
1. Controllers can create a ControllerRevision containing a revision of their
|
||||
target Object state.
|
||||
1. Controllers can reconstruct their revision history.
|
||||
1. Controllers can't update a ControllerRevision's `.Data`.
|
||||
1. Controllers can delete a ControllerRevision to maintain their history with
|
||||
respect to the configured `RevisionHistoryLimit`.
|
||||
|
||||
## Appendix
|
||||
|
||||
### Hashing
|
||||
We will require a CRHF (collision resistant hash function), but, as we expect
|
||||
no adversaries, such a function need not be resistant to pre-image and
|
||||
secondary pre-image attacks.
|
||||
As the property of interest is primarily collision resistance, and as we
|
||||
provide a method of [collision resolution](#collision-resolution), both
|
||||
cryptographically strong functions, such as Secure Hash Algorithm 2 (SHA-2),
|
||||
and non-cryptographic functions, such as Fowler-Noll-Vo (FNV) are applicable.
|
||||
|
||||
### Collision Resolution
|
||||
As the function selected for hashing may not be cryptographically strong and may
|
||||
produce collisions, we need a method for collision resolution. To demonstrate
|
||||
its feasibility, we construct such a scheme here. However, this proposal does
|
||||
not mandate its use.
|
||||
|
||||
Given a hash function with output size `HashSize` defined
|
||||
as `func H(s srtring) [HashSize] byte`, in order to resolve collisions we
|
||||
define a new function `func H'(s string, n int) [HashSize]byte` where `H'`
|
||||
returns the result of invoking `H` on the concatenation of `s` with the string
|
||||
value of `n`. We define a third function
|
||||
`func H''(s string, exists func (string) bool)(int,[HashSize]byte)`. `H''`
|
||||
will start with `n := 0` and compute `s' := H'(s,n)`, incrementing `n` when
|
||||
`exists(s')` returns true, until `exists(s')` returns false. After this it will
|
||||
return `n,s'`.
|
||||
|
||||
For our purposes, the implementation of the `exists` function will attempt to
|
||||
create a `.Named` ControllerRevision via the API Server using a
|
||||
[unique name generation](#unique-name-generation). If creation fails, due to a
|
||||
conflict, the method returns false.
|
||||
|
||||
### Unique Name Generation
|
||||
We can use our [hash function](#hashsing) and
|
||||
[collision resolution](#collision-resolution) scheme to generate a system
|
||||
wide unique identifier for an Object based on a deterministic non-unique prefix
|
||||
and a serialized representation of the Object. Kubernetes Object's `.Name`
|
||||
fields must conform to a DNS subdomain. Therefore, the total length of the
|
||||
unique identifier must not exceed 255, and in practice 253, characters. We can
|
||||
generate a unique identifier that meets this constraint by selecting a hash
|
||||
function such that the output length is equal to `253-len(prefix)` and applying
|
||||
our [hash](#hashing) function and [collision-resolution](#collision-resolution)
|
||||
scheme to the serialized representation of the Object's data. The unique hash
|
||||
and integer can be combined to produce a unique suffix for the Object's `.Name`.
|
||||
|
||||
1. We must also ensure that unique name does not contain any bad words.
|
||||
1. We may also wish to spend additional characters to prettify the generated
|
||||
name for readability.
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue