Initial proposal for controller history in support of DaemonSet update

and StatefulSet update features.
This commit is contained in:
Kenneth Owens 2017-05-02 13:52:16 -07:00
parent f929dac304
commit 3bc2e74ec3
1 changed files with 462 additions and 0 deletions

View File

@ -0,0 +1,462 @@
# Controller History
**Author**: kow3ns@
**Status**: Proposal
## Abstract
In Kubernetes, in order to update and rollback the configuration and binary
images of controller managed Pods, users mutate DaemonSet, StatefulSet,
and Deployment Objects, and the corresponding controllers attempt to transition
the current state of the system to the new declared target state.
To facilitate update and rollback for these controllers, and to provide a
primitive that third party controllers can build on, we propose a mechanism
that allows controllers to manage a bounded history of revisions to the declared
target state of their generated Objects.
## Affected Components
1. API Machinery
1. API Server
1. Kubectl
1. Controllers that utilize the feature
## Requirements
1. History is a collection of points in time, and each point in time must be
represented by its own Object. While it is tempting to aggregate all of an
Object's history into a single container Object, experience with Borg and Mesos
has taught us that this inevitably leads to exhausting the single Object size
limit of the system's storage backend.
1. We must be able to select the Objects that contain point in time snapshots
of versions of an Object to reconstruct the Object's history.
1. History respects causality. The Object type used to store point in time
snapshots must be strictly ordered with respect to creation. CreationTimestamp
should not be used, as this is susceptible to clock skew.
1. History must not be revisionist. Once an Object corresponding to a version
of a controllers target state is created, it can not be mutated.
1. Controller history requires only current events. Storing an exhaustive
history of all revisions to all controllers is out of scope for our purposes,
and it can be solved by applying a version control system to manifests. Internal
revision history must only store revisions to the controller's target state that
correspond to live Objects and (potentially) a small, configurable number of
prior revisions.
1. History is scale invariant. A revision to a controller is a modification
that changes the specification of the Objects it generates. Changing the
cardinality of those Objects is a scaling operation and does not constitute a
revision.
## Terminology
The following terminology is used throughout the rest of this proposal. We
make its meaning explicit here.
- The specification type of a controller is the type that contains the
specification for the Objects generated by the controller.
- For example, the specification types for the ReplicaSet, DaemonSet,
and StatefulSet controllers are ReplicaSetSpec, DaemonSetSpec,
and StatefulSetSpec respectively.
- The generated type(s) for a controller is/are the type of the Object(s)
generated by the controller.
- Pod is a generated type for the ReplicaSet, DaemonSet, and StatefulSet
controllers.
- PersistentVolumeClaim is also a generated type for the StatefulSet
controller.
- The current state of a controller is the union of the states of its generated
Objects along with its status.
- For ReplicaSet, DaemonSet, and StatefulSet, the current state of the
corresponding controllers can be derived from Pods they contain and the
ReplicasSetStatus, DaemonSetStatus, and StatefulSetStatus objects
respectively.
- For all specification type Objects for controllers, the target state is the
set of fields in the Object that determine the state to which the controller
attempts to evolve the system.
- This may not necessarily be all fields of the Object.
- For example, for the StatefulSet controller `.Spec.Template`,
`.Spec.Replicas`, and `.Spec.VolumeClaims` determine the target state. The
controller "wants" to create `.Spec.Replicas` Pods generated from
`.Spec.Template` and `.Spec.VolumeClaims`.
- The target Object state is the subset of the target state necessary to create
Objects of the generated type(s).
- To make this concrete, for the StatefulSet controller `.Spec.Template`
and `.Spec.VolumeClaims` are the target Object state. This is enough
information for the controller to generate Pods and corresponding PVCs.
- If a version of the target Object state was used to generate an Object that
has not yet been deleted, we refer to the version, and any snapshots of the
version, as live.
## API Objects
Kubernetes controllers already persist their current and target states to the
API Server. In order to maintain a history of revisions to specification type
Objects, we only need to persist snapshots of the target Object states
contained in the specification type when they are revised.
One approach would be to, for every specification type, have a
corresponding History type. For example, we could introduce a StatefulSetHistory
object that aggregates a PodTemplateSpec and a slice of PersistentVolumeClaims.
The StatefulSet controller could use this object to store point in time
snapshots of versions of StatefulSetSpecs. However, this requires that we
introduce a new History Kind for all current and future controllers. It has the
benefit of type safety, but, for this benefit, we trade generality.
Another approach would be to use PodTemplate objects. This mechanisms provides
the desired generality, but it only provides for the recording of versions of
PodTemplateSpecs (e.g. For StatefulSet, we can not use PodTemplates to
record revisions to PersistentVolumeClaims). Also, it introduces the potential
for overlapping histories for two Objects of different Kinds, with the same
`.Name` in the same Namespace. Lastly, it constrains the PodTemplate Kind from
evolving to fulfill its original intention.
We propose an approach that has analogs with the approach taken by the
[Mesos](http://mesos.apache.org/) community. Mesos frameworks, which are in some
ways like Kubernetes controllers, are responsible for check pointing,
persisting, and recovering their own state. This problem is so common that
Mesos provides a ["State Abstraction"](https://github.com/apache/mesos/blob/master/include/mesos/state/state.hpp)
that allows frameworks to persist their state in either ZooKeeper or the
Mesos Replicate Log (A Multi-Paxos based state machine used by the Mesos
Masters). This State Abstraction is a mutable, durable dictionary where keys
and values are opaque strings. As controllers only need the capability to
persist an immutable point in time snapshot of target Object states to
implement a revision history, we propose to use the ControllerRevision object
for this purpose.
``` golang
// ControllerRevision implements an immutable snapshot of state data. Clients
// are responsible for serializing and deserializing the objects that contain
// their internal state.
// Once a ControllerRevision has been successfully created, it can not be updated.
// The API Server will fail validation of all requests that attempt to mutate
// the Data field. ControllerRevisions may, however, be deleted.
type ControllerRevision struct {
metav1.TypeMeta
// +optional
metav1.ObjectMeta
// Data contains the serialized state.
Data runtime.RawExtension
// Revision indicates the revision of the state represented by Data.
Revision int64
}
```
## API Server
The API Server must support the creation and deletion of ControllerRevision
objects. As we have no mechanism for declarative immutability, the API server
must fail any update request that updates the `.Data` field of a
ControllerRevision Object.
## Controllers
This section is presented as a generalization of how an arbitrary controller
can use ControllerRevision to persist a history of revisions to its
specification type Objects. The technique is applicable, without loss of
generality, to the existing Kubernetes controllers that have Pod as a generated
type.
When a controller detects a revision to the target Object state of a
specification type Object it will do the following.
1. The controller will [create a snapshot](#version-snapshot-creation) of the
current target Object state.
1. The controller will [reconstruct the history](#history-reconstruction) of
revisions to the Object's target Object state.
1. The controller will test the current target Object state for
[equivalence](#version-equivalence) with all other versions in the Object's
revision history.
- If the current version is semantically equivalent to its immediate
predecessor no update to the Object's target state has been performed.
- If the current version is equivalent to a version prior to its immediate
predecessor, this indicates a rollback.
- If the current version is not equivalent to any prior version, this
indicates an update or a roll forward.
- Controllers should use their status objects for book keeping with respect
to current and prior revisions.
1. The controller will
[reconcile its generated Objects](#target-object-state-reconciliation)
with the new target Object state.
1. The controller will [maintain the length of its history](#history-maintenance)
to be less than the configured limit.
### Version Snapshot Creation
To take a snapshot of the target Object state contained in a specification type
Object, a controller will do the following.
1. The controller will serialize all the Object's target object state and store
the serialized representation in the ControllerRevision's `.Data`.
1. The controller will store a unique, monotonically increasing
[revision number](#revision-number-selection) in the Revision field.
1. The controller will compute the [hash](#hashing) of the
ControllerRevision's `.Data`.
1. The controller will attach a label to the ControllerRevision so that it is
selectable with a low probability of overlap.
- ControllerRefs will be used as the authoritative test for ownership.
- The specification type Object's `.Selector` should be used where
applicable.
- Alternatively, a Kind unique label may be set to the `.Name` of the
specification type Object.
1. The controller will add a ControllerRef indicating the specification type
Object as the owner of the ControllerRevision in the ControllerRevision's
`.OwnerReferences`.
1. The controller will use the hash from above, along with a user identifiable
prefix, to [generate a unique `.Name`](#unique-name-generation) for the
ControllerRevision.
- The controller should, where possible, use the `.Name` of the
specification type Object.
1. The controller will persist the ControllerRevision via the API Server.
- Note that, in practice, creation occurs concurrently with
[collision resolution](#collision-resolution).
### Revision Number Selection
We propose two methods for selecting the `.Revision` used to order a
specification type Object's revision history.
1. Set the `.Revision` field to the `.Generation` field.
- This approach has the benefit of leveraging the existing monotonically
increasing sequence generated by `.Generation` field.
- The downside of this approach is that history will not survive the
destruction of an Object.
1. Use an approach analogous to Deployment.
1. Reconstruct the Object's revision history.
1. If the history is empty, use a `.Revision` of `0`.
1. If the history is not empty, set the `.Revision` to a value greater than
the maximum value of all previous `.Revisions`.
### History Reconstruction
To reconstruct the history of a specification type Object, a controller will do
the following.
1. Select all ControllerRevision Objects labeled as described
[above](#version-snapshot-creation).
1. Filter any ControllerRevisions that do not have a ControllerRef in their
`.OwnerReferences` indicating ownership by the Object.
1. Sort the ControllerRevisions by the `.Revision` field.
1. This produces a strictly ordered set of ControllerRevisions that comprises
the ordered revision history of the specification type Object.
### History Maintenance
Controllers should be configured, either globally or on a per specification type
Object basis, to have a `RevisionHistoryLimit`. This field will indicate the
number of non-live revisions the controller should maintain in its history
for each specification type Object. Every time a controller observes a
specification type Object it will do the following.
1. The controller will
[reconstruct the Object's revision history](#history-reconstruction).
- Note that the process of reconstructing the Object's history filters any
ControllerRevisions not owned by the Object.
1. The controller will filter any ControllerRevisions that represent a live
version.
1. If the number of remaining ControllerRevisions is greater than the configured
`RevisionHistoryLimit`, the controller will delete them, in order with respect
to the value mapped to their `.Revisions`, until the number
of remaining ControllerRevisions is equal to the `RevisionHistoryLimit`.
This ensures that the number of recorded, non-live revisions is less than or
equal to the configured `RevisionHistoryLimit`.
### Version Tracking
Controllers must track the version of the target Object state that corresponds
to their generated Objects. This information is necessary to determine which
versions are live, and to track which Objects need to be updated during a
target state update or rollback. We propose two methods that controllers may
use to track live versions and their association with generated Objects.
1. The most straightforward method is labeling. In this method the generated
Objects are labeled with the `.Name` of the ControllerRevision object that
corresponds to the version of the target Object state that was used to generate
them. As we have taken care to ensure the uniqueness of the `.Names` of the
ControllerRevisions, this approach is reasonable.
- A revision is considered to be live while any generated Object labeled
with its `.Name` is live.
- This method has the benefit of providing visibility, via the label, to
users with respect to the historical provenance of a generated Object.
- The primary drawback is the lack of support for using garbage collection
to ensure that only non-live version snapshots are collected.
1. Controllers may also use the `OwnerReferences` field of the
ControllerRevision to record all Objects that are generated from target Object
state version represented by the ControllerRevision as its owners.
- A revision is considered to be live while any generated Object that owns
it is live.
- This method allows for the implementation of generic garbage collection.
- The primary drawback with this method is that the book keeping is complex,
and deciding if a generated Object corresponds to a particular revision
will require testing each Object for membership in the `OwnerReferences`
of all ControllerRevisions.
Note that, since we are labeling the generated Objects to indicate their
provenance with respect to the version of the controller's target Object state,
we are susceptible to downstream mutations by other controllers changing the
controller's product. The best we can do is guarantee that our product meets
the specification at the time of creation. If a third party mutates the product
downstream (as long as it does so in a consistent and intentional way), we
don't want to recall it and make it conform to the original specification. This
would cause the controllers to "fight" indefinitely.
At the cost of the complexity of implementing both labeling and ownership,
controllers may use a combination of both approaches to mitigate the
deficiencies of each.
### Version Equivalence
When the target Object state of a specification type Object is revised, we wish
to minimize the number of mutations to generated Objects as the controller seeks
to conform the system to its target state. That is, if a generated Object
already conforms to the revised target Object state, it is imperative that we
do not mutate it.
Failure to implement this correctly could result in the simultaneous rolling
restart of every Pod in every StatefulSet and DaemonSet in the system when
additions are made to PodTemplateSpec during a master upgrade. It is therefore
necessary to determine if the current target Object state is equivalent to a
prior version.
Since we [track the version of](#version-tracking) of generated Objects, this
reduces to deciding if the version of the target Object state associated with
the generated Object is equivalent to the current target Object state.
Even though [hashing](#hashing) is used to generate the `.Name` of the
ControllerRevisions used to encapsulate versions of the target Object state, as
we do not require cryptographically strong collision resistance, and given we
use a [collision resolution](#collision-resolution) technique, we can't use the
[generated names](#unique-name-generation) of ControllerRevisions to decide
equality.
We propose that two ControllerRevisions can be considered equal if their
`.Data` is equivalent, but that it is not sufficient to compare the serialized
representation of the their `.Data`. Consider that the addition of new fields
to the Objects that represent the target Object state may cause the serialized
representation of those Objects to be unequal even when they are semantically
equivalent.
The controller should deserialize the values of the ControllerRevisions
representing their target Object state and perform a deep, semantic equality
test. Here all differences that do not constitute a mutation to the target
Object state are disregarded during the equivalence test.
### Target Object State Reconciliation
There are three ways for a controller to reconcile a generated Object with the
declared target Object state.
1. If the target Object state is [equivalent](#version-equivalence) to the
target Object state associated with the generated Object, the controller will
update the associated [version tracking information](#version-tracking).
1. If the Object can be updated in place to reconcile its state with the
current target Object state, a controller may update the Object in place
provided that the associated version tracking information is updated as well.
1. Otherwise, the controller must destroy the Object and recreate it from the
current target Object state.
### Kubernetes Upgrades
During the upgrade process form a version of Kubernetes that does not support
controller history to a version that does, controllers that implement history
based update mechanisms may find that they have specification type Objects with
no history and with generated Objects. For instance, a StatefulSet may exist
with several Pods and no history. We defer requirements for handling history
initialization to the individual proposals pertaining to those controller's
update mechanisms. However, implementors should take note of the following.
1. If the history of an Object is not initialized, controllers should
continue to (re)create generated Objects based on the current target Object
state.
1. The history should be initialized on the first mutation to the specification
type Object for which the history will be generated.
1. After the history has been initialized, any generated Objects that have no
indication of the revision from which they were generated may be treated as if
they have a nil revision. That is, without respect to the method of
[version tracking](#version-tracking) used, the generated Objects may be
treated as if they have a version that corresponds to no revision, and the
controller may proceed to
[reconcile their state](target-object-state-reconciliation) as appropriate to
the internal implementation.
## Kubectl
Modifications to kubectl to leverage controller history are an optional
extension. Users can trigger rolling updates and rollbacks by modifying their
manifests and using `kubectl apply`. Controllers will be able to detect
revisions to their target Object state and perform
[reconciliation](#target-object-state-reconciliation) as necessary.
### Viewing History
Users can view a controller's revision history with the following command.
```bash
> kubectl rollout history
```
To view the details of the revision indicated by `<revision>`. Users can use
the following command.
```bash
> kubectl rollout history --revision <revision>
```
### Rollback
For future work, `kubeclt rollout undo` can be implemented in the general case
as an extension of the [above](#viewing-history ).
```bash
> kubectl rollout undo
```
Here `kubectl undo` simply uses strategic merge patch to apply the state
contained at a particular revision.
## Tests
1. Controllers can create a ControllerRevision containing a revision of their
target Object state.
1. Controllers can reconstruct their revision history.
1. Controllers can't update a ControllerRevision's `.Data`.
1. Controllers can delete a ControllerRevision to maintain their history with
respect to the configured `RevisionHistoryLimit`.
## Appendix
### Hashing
We will require a CRHF (collision resistant hash function), but, as we expect
no adversaries, such a function need not be resistant to pre-image and
secondary pre-image attacks.
As the property of interest is primarily collision resistance, and as we
provide a method of [collision resolution](#collision-resolution), both
cryptographically strong functions, such as Secure Hash Algorithm 2 (SHA-2),
and non-cryptographic functions, such as Fowler-Noll-Vo (FNV) are applicable.
### Collision Resolution
As the function selected for hashing may not be cryptographically strong and may
produce collisions, we need a method for collision resolution. To demonstrate
its feasibility, we construct such a scheme here. However, this proposal does
not mandate its use.
Given a hash function with output size `HashSize` defined
as `func H(s srtring) [HashSize] byte`, in order to resolve collisions we
define a new function `func H'(s string, n int) [HashSize]byte` where `H'`
returns the result of invoking `H` on the concatenation of `s` with the string
value of `n`. We define a third function
`func H''(s string, exists func (string) bool)(int,[HashSize]byte)`. `H''`
will start with `n := 0` and compute `s' := H'(s,n)`, incrementing `n` when
`exists(s')` returns true, until `exists(s')` returns false. After this it will
return `n,s'`.
For our purposes, the implementation of the `exists` function will attempt to
create a `.Named` ControllerRevision via the API Server using a
[unique name generation](#unique-name-generation). If creation fails, due to a
conflict, the method returns false.
### Unique Name Generation
We can use our [hash function](#hashsing) and
[collision resolution](#collision-resolution) scheme to generate a system
wide unique identifier for an Object based on a deterministic non-unique prefix
and a serialized representation of the Object. Kubernetes Object's `.Name`
fields must conform to a DNS subdomain. Therefore, the total length of the
unique identifier must not exceed 255, and in practice 253, characters. We can
generate a unique identifier that meets this constraint by selecting a hash
function such that the output length is equal to `253-len(prefix)` and applying
our [hash](#hashing) function and [collision-resolution](#collision-resolution)
scheme to the serialized representation of the Object's data. The unique hash
and integer can be combined to produce a unique suffix for the Object's `.Name`.
1. We must also ensure that unique name does not contain any bad words.
1. We may also wish to spend additional characters to prettify the generated
name for readability.