diff --git a/contributors/design-proposals/controller_history.md b/contributors/design-proposals/controller_history.md new file mode 100644 index 000000000..fbf89b702 --- /dev/null +++ b/contributors/design-proposals/controller_history.md @@ -0,0 +1,462 @@ +# Controller History + +**Author**: kow3ns@ + +**Status**: Proposal + +## Abstract +In Kubernetes, in order to update and rollback the configuration and binary +images of controller managed Pods, users mutate DaemonSet, StatefulSet, +and Deployment Objects, and the corresponding controllers attempt to transition +the current state of the system to the new declared target state. + +To facilitate update and rollback for these controllers, and to provide a +primitive that third party controllers can build on, we propose a mechanism +that allows controllers to manage a bounded history of revisions to the declared +target state of their generated Objects. + +## Affected Components + +1. API Machinery +1. API Server +1. Kubectl +1. Controllers that utilize the feature + +## Requirements + +1. History is a collection of points in time, and each point in time must be +represented by its own Object. While it is tempting to aggregate all of an +Object's history into a single container Object, experience with Borg and Mesos +has taught us that this inevitably leads to exhausting the single Object size +limit of the system's storage backend. +1. We must be able to select the Objects that contain point in time snapshots +of versions of an Object to reconstruct the Object's history. +1. History respects causality. The Object type used to store point in time +snapshots must be strictly ordered with respect to creation. CreationTimestamp +should not be used, as this is susceptible to clock skew. +1. History must not be revisionist. Once an Object corresponding to a version +of a controllers target state is created, it can not be mutated. +1. Controller history requires only current events. Storing an exhaustive +history of all revisions to all controllers is out of scope for our purposes, +and it can be solved by applying a version control system to manifests. Internal +revision history must only store revisions to the controller's target state that +correspond to live Objects and (potentially) a small, configurable number of +prior revisions. +1. History is scale invariant. A revision to a controller is a modification +that changes the specification of the Objects it generates. Changing the +cardinality of those Objects is a scaling operation and does not constitute a +revision. + +## Terminology +The following terminology is used throughout the rest of this proposal. We +make its meaning explicit here. +- The specification type of a controller is the type that contains the +specification for the Objects generated by the controller. + - For example, the specification types for the ReplicaSet, DaemonSet, + and StatefulSet controllers are ReplicaSetSpec, DaemonSetSpec, + and StatefulSetSpec respectively. +- The generated type(s) for a controller is/are the type of the Object(s) +generated by the controller. + - Pod is a generated type for the ReplicaSet, DaemonSet, and StatefulSet + controllers. + - PersistentVolumeClaim is also a generated type for the StatefulSet + controller. +- The current state of a controller is the union of the states of its generated +Objects along with its status. + - For ReplicaSet, DaemonSet, and StatefulSet, the current state of the + corresponding controllers can be derived from Pods they contain and the + ReplicasSetStatus, DaemonSetStatus, and StatefulSetStatus objects + respectively. +- For all specification type Objects for controllers, the target state is the +set of fields in the Object that determine the state to which the controller +attempts to evolve the system. + - This may not necessarily be all fields of the Object. + - For example, for the StatefulSet controller `.Spec.Template`, + `.Spec.Replicas`, and `.Spec.VolumeClaims` determine the target state. The + controller "wants" to create `.Spec.Replicas` Pods generated from + `.Spec.Template` and `.Spec.VolumeClaims`. +- The target Object state is the subset of the target state necessary to create +Objects of the generated type(s). + - To make this concrete, for the StatefulSet controller `.Spec.Template` + and `.Spec.VolumeClaims` are the target Object state. This is enough + information for the controller to generate Pods and corresponding PVCs. +- If a version of the target Object state was used to generate an Object that +has not yet been deleted, we refer to the version, and any snapshots of the +version, as live. + +## API Objects + +Kubernetes controllers already persist their current and target states to the +API Server. In order to maintain a history of revisions to specification type +Objects, we only need to persist snapshots of the target Object states +contained in the specification type when they are revised. + +One approach would be to, for every specification type, have a +corresponding History type. For example, we could introduce a StatefulSetHistory +object that aggregates a PodTemplateSpec and a slice of PersistentVolumeClaims. +The StatefulSet controller could use this object to store point in time +snapshots of versions of StatefulSetSpecs. However, this requires that we +introduce a new History Kind for all current and future controllers. It has the +benefit of type safety, but, for this benefit, we trade generality. + +Another approach would be to use PodTemplate objects. This mechanisms provides +the desired generality, but it only provides for the recording of versions of +PodTemplateSpecs (e.g. For StatefulSet, we can not use PodTemplates to +record revisions to PersistentVolumeClaims). Also, it introduces the potential +for overlapping histories for two Objects of different Kinds, with the same +`.Name` in the same Namespace. Lastly, it constrains the PodTemplate Kind from +evolving to fulfill its original intention. + +We propose an approach that has analogs with the approach taken by the +[Mesos](http://mesos.apache.org/) community. Mesos frameworks, which are in some +ways like Kubernetes controllers, are responsible for check pointing, +persisting, and recovering their own state. This problem is so common that +Mesos provides a ["State Abstraction"](https://github.com/apache/mesos/blob/master/include/mesos/state/state.hpp) +that allows frameworks to persist their state in either ZooKeeper or the +Mesos Replicate Log (A Multi-Paxos based state machine used by the Mesos +Masters). This State Abstraction is a mutable, durable dictionary where keys +and values are opaque strings. As controllers only need the capability to +persist an immutable point in time snapshot of target Object states to +implement a revision history, we propose to use the ControllerRevision object +for this purpose. + +``` golang +// ControllerRevision implements an immutable snapshot of state data. Clients +// are responsible for serializing and deserializing the objects that contain +// their internal state. +// Once a ControllerRevision has been successfully created, it can not be updated. +// The API Server will fail validation of all requests that attempt to mutate +// the Data field. ControllerRevisions may, however, be deleted. +type ControllerRevision struct { + metav1.TypeMeta + // +optional + metav1.ObjectMeta + // Data contains the serialized state. + Data runtime.RawExtension + // Revision indicates the revision of the state represented by Data. + Revision int64 +} +``` + +## API Server +The API Server must support the creation and deletion of ControllerRevision +objects. As we have no mechanism for declarative immutability, the API server +must fail any update request that updates the `.Data` field of a +ControllerRevision Object. + +## Controllers +This section is presented as a generalization of how an arbitrary controller +can use ControllerRevision to persist a history of revisions to its +specification type Objects. The technique is applicable, without loss of +generality, to the existing Kubernetes controllers that have Pod as a generated +type. + +When a controller detects a revision to the target Object state of a +specification type Object it will do the following. + +1. The controller will [create a snapshot](#version-snapshot-creation) of the +current target Object state. +1. The controller will [reconstruct the history](#history-reconstruction) of +revisions to the Object's target Object state. +1. The controller will test the current target Object state for +[equivalence](#version-equivalence) with all other versions in the Object's +revision history. + - If the current version is semantically equivalent to its immediate + predecessor no update to the Object's target state has been performed. + - If the current version is equivalent to a version prior to its immediate + predecessor, this indicates a rollback. + - If the current version is not equivalent to any prior version, this + indicates an update or a roll forward. + - Controllers should use their status objects for book keeping with respect + to current and prior revisions. +1. The controller will +[reconcile its generated Objects](#target-object-state-reconciliation) +with the new target Object state. +1. The controller will [maintain the length of its history](#history-maintenance) +to be less than the configured limit. + +### Version Snapshot Creation +To take a snapshot of the target Object state contained in a specification type +Object, a controller will do the following. + +1. The controller will serialize all the Object's target object state and store +the serialized representation in the ControllerRevision's `.Data`. +1. The controller will store a unique, monotonically increasing +[revision number](#revision-number-selection) in the Revision field. +1. The controller will compute the [hash](#hashing) of the +ControllerRevision's `.Data`. +1. The controller will attach a label to the ControllerRevision so that it is +selectable with a low probability of overlap. + - ControllerRefs will be used as the authoritative test for ownership. + - The specification type Object's `.Selector` should be used where + applicable. + - Alternatively, a Kind unique label may be set to the `.Name` of the + specification type Object. +1. The controller will add a ControllerRef indicating the specification type +Object as the owner of the ControllerRevision in the ControllerRevision's +`.OwnerReferences`. +1. The controller will use the hash from above, along with a user identifiable +prefix, to [generate a unique `.Name`](#unique-name-generation) for the +ControllerRevision. + - The controller should, where possible, use the `.Name` of the + specification type Object. +1. The controller will persist the ControllerRevision via the API Server. + - Note that, in practice, creation occurs concurrently with + [collision resolution](#collision-resolution). + +### Revision Number Selection +We propose two methods for selecting the `.Revision` used to order a +specification type Object's revision history. + +1. Set the `.Revision` field to the `.Generation` field. + - This approach has the benefit of leveraging the existing monotonically + increasing sequence generated by `.Generation` field. + - The downside of this approach is that history will not survive the + destruction of an Object. +1. Use an approach analogous to Deployment. + 1. Reconstruct the Object's revision history. + 1. If the history is empty, use a `.Revision` of `0`. + 1. If the history is not empty, set the `.Revision` to a value greater than + the maximum value of all previous `.Revisions`. + +### History Reconstruction +To reconstruct the history of a specification type Object, a controller will do +the following. + +1. Select all ControllerRevision Objects labeled as described +[above](#version-snapshot-creation). +1. Filter any ControllerRevisions that do not have a ControllerRef in their +`.OwnerReferences` indicating ownership by the Object. +1. Sort the ControllerRevisions by the `.Revision` field. +1. This produces a strictly ordered set of ControllerRevisions that comprises +the ordered revision history of the specification type Object. + +### History Maintenance +Controllers should be configured, either globally or on a per specification type +Object basis, to have a `RevisionHistoryLimit`. This field will indicate the +number of non-live revisions the controller should maintain in its history +for each specification type Object. Every time a controller observes a +specification type Object it will do the following. + +1. The controller will +[reconstruct the Object's revision history](#history-reconstruction). + - Note that the process of reconstructing the Object's history filters any + ControllerRevisions not owned by the Object. +1. The controller will filter any ControllerRevisions that represent a live +version. +1. If the number of remaining ControllerRevisions is greater than the configured +`RevisionHistoryLimit`, the controller will delete them, in order with respect +to the value mapped to their `.Revisions`, until the number +of remaining ControllerRevisions is equal to the `RevisionHistoryLimit`. + +This ensures that the number of recorded, non-live revisions is less than or +equal to the configured `RevisionHistoryLimit`. + +### Version Tracking +Controllers must track the version of the target Object state that corresponds +to their generated Objects. This information is necessary to determine which +versions are live, and to track which Objects need to be updated during a +target state update or rollback. We propose two methods that controllers may +use to track live versions and their association with generated Objects. + +1. The most straightforward method is labeling. In this method the generated +Objects are labeled with the `.Name` of the ControllerRevision object that +corresponds to the version of the target Object state that was used to generate +them. As we have taken care to ensure the uniqueness of the `.Names` of the +ControllerRevisions, this approach is reasonable. + - A revision is considered to be live while any generated Object labeled + with its `.Name` is live. + - This method has the benefit of providing visibility, via the label, to + users with respect to the historical provenance of a generated Object. + - The primary drawback is the lack of support for using garbage collection + to ensure that only non-live version snapshots are collected. +1. Controllers may also use the `OwnerReferences` field of the +ControllerRevision to record all Objects that are generated from target Object +state version represented by the ControllerRevision as its owners. + - A revision is considered to be live while any generated Object that owns + it is live. + - This method allows for the implementation of generic garbage collection. + - The primary drawback with this method is that the book keeping is complex, + and deciding if a generated Object corresponds to a particular revision + will require testing each Object for membership in the `OwnerReferences` + of all ControllerRevisions. + +Note that, since we are labeling the generated Objects to indicate their +provenance with respect to the version of the controller's target Object state, +we are susceptible to downstream mutations by other controllers changing the +controller's product. The best we can do is guarantee that our product meets +the specification at the time of creation. If a third party mutates the product +downstream (as long as it does so in a consistent and intentional way), we +don't want to recall it and make it conform to the original specification. This +would cause the controllers to "fight" indefinitely. + +At the cost of the complexity of implementing both labeling and ownership, +controllers may use a combination of both approaches to mitigate the +deficiencies of each. + +### Version Equivalence +When the target Object state of a specification type Object is revised, we wish +to minimize the number of mutations to generated Objects as the controller seeks +to conform the system to its target state. That is, if a generated Object +already conforms to the revised target Object state, it is imperative that we +do not mutate it. + +Failure to implement this correctly could result in the simultaneous rolling +restart of every Pod in every StatefulSet and DaemonSet in the system when +additions are made to PodTemplateSpec during a master upgrade. It is therefore +necessary to determine if the current target Object state is equivalent to a +prior version. + +Since we [track the version of](#version-tracking) of generated Objects, this +reduces to deciding if the version of the target Object state associated with +the generated Object is equivalent to the current target Object state. +Even though [hashing](#hashing) is used to generate the `.Name` of the +ControllerRevisions used to encapsulate versions of the target Object state, as +we do not require cryptographically strong collision resistance, and given we +use a [collision resolution](#collision-resolution) technique, we can't use the +[generated names](#unique-name-generation) of ControllerRevisions to decide +equality. + +We propose that two ControllerRevisions can be considered equal if their +`.Data` is equivalent, but that it is not sufficient to compare the serialized +representation of the their `.Data`. Consider that the addition of new fields +to the Objects that represent the target Object state may cause the serialized +representation of those Objects to be unequal even when they are semantically +equivalent. + +The controller should deserialize the values of the ControllerRevisions +representing their target Object state and perform a deep, semantic equality +test. Here all differences that do not constitute a mutation to the target +Object state are disregarded during the equivalence test. + +### Target Object State Reconciliation +There are three ways for a controller to reconcile a generated Object with the +declared target Object state. + +1. If the target Object state is [equivalent](#version-equivalence) to the +target Object state associated with the generated Object, the controller will +update the associated [version tracking information](#version-tracking). +1. If the Object can be updated in place to reconcile its state with the +current target Object state, a controller may update the Object in place +provided that the associated version tracking information is updated as well. +1. Otherwise, the controller must destroy the Object and recreate it from the +current target Object state. + +### Kubernetes Upgrades +During the upgrade process form a version of Kubernetes that does not support +controller history to a version that does, controllers that implement history +based update mechanisms may find that they have specification type Objects with +no history and with generated Objects. For instance, a StatefulSet may exist +with several Pods and no history. We defer requirements for handling history +initialization to the individual proposals pertaining to those controller's +update mechanisms. However, implementors should take note of the following. + +1. If the history of an Object is not initialized, controllers should +continue to (re)create generated Objects based on the current target Object +state. +1. The history should be initialized on the first mutation to the specification +type Object for which the history will be generated. +1. After the history has been initialized, any generated Objects that have no +indication of the revision from which they were generated may be treated as if +they have a nil revision. That is, without respect to the method of +[version tracking](#version-tracking) used, the generated Objects may be +treated as if they have a version that corresponds to no revision, and the +controller may proceed to +[reconcile their state](target-object-state-reconciliation) as appropriate to +the internal implementation. + +## Kubectl + +Modifications to kubectl to leverage controller history are an optional +extension. Users can trigger rolling updates and rollbacks by modifying their +manifests and using `kubectl apply`. Controllers will be able to detect +revisions to their target Object state and perform +[reconciliation](#target-object-state-reconciliation) as necessary. + +### Viewing History + +Users can view a controller's revision history with the following command. + +```bash +> kubectl rollout history +``` + +To view the details of the revision indicated by ``. Users can use +the following command. + +```bash +> kubectl rollout history --revision +``` + +### Rollback + +For future work, `kubeclt rollout undo` can be implemented in the general case +as an extension of the [above](#viewing-history ). + +```bash +> kubectl rollout undo +``` + +Here `kubectl undo` simply uses strategic merge patch to apply the state +contained at a particular revision. + +## Tests + +1. Controllers can create a ControllerRevision containing a revision of their +target Object state. +1. Controllers can reconstruct their revision history. +1. Controllers can't update a ControllerRevision's `.Data`. +1. Controllers can delete a ControllerRevision to maintain their history with +respect to the configured `RevisionHistoryLimit`. + +## Appendix + +### Hashing +We will require a CRHF (collision resistant hash function), but, as we expect +no adversaries, such a function need not be resistant to pre-image and +secondary pre-image attacks. +As the property of interest is primarily collision resistance, and as we +provide a method of [collision resolution](#collision-resolution), both +cryptographically strong functions, such as Secure Hash Algorithm 2 (SHA-2), +and non-cryptographic functions, such as Fowler-Noll-Vo (FNV) are applicable. + +### Collision Resolution +As the function selected for hashing may not be cryptographically strong and may +produce collisions, we need a method for collision resolution. To demonstrate +its feasibility, we construct such a scheme here. However, this proposal does +not mandate its use. + +Given a hash function with output size `HashSize` defined +as `func H(s srtring) [HashSize] byte`, in order to resolve collisions we +define a new function `func H'(s string, n int) [HashSize]byte` where `H'` +returns the result of invoking `H` on the concatenation of `s` with the string +value of `n`. We define a third function +`func H''(s string, exists func (string) bool)(int,[HashSize]byte)`. `H''` +will start with `n := 0` and compute `s' := H'(s,n)`, incrementing `n` when +`exists(s')` returns true, until `exists(s')` returns false. After this it will +return `n,s'`. + +For our purposes, the implementation of the `exists` function will attempt to +create a `.Named` ControllerRevision via the API Server using a +[unique name generation](#unique-name-generation). If creation fails, due to a +conflict, the method returns false. + +### Unique Name Generation +We can use our [hash function](#hashsing) and +[collision resolution](#collision-resolution) scheme to generate a system +wide unique identifier for an Object based on a deterministic non-unique prefix +and a serialized representation of the Object. Kubernetes Object's `.Name` +fields must conform to a DNS subdomain. Therefore, the total length of the +unique identifier must not exceed 255, and in practice 253, characters. We can +generate a unique identifier that meets this constraint by selecting a hash +function such that the output length is equal to `253-len(prefix)` and applying +our [hash](#hashing) function and [collision-resolution](#collision-resolution) +scheme to the serialized representation of the Object's data. The unique hash +and integer can be combined to produce a unique suffix for the Object's `.Name`. + +1. We must also ensure that unique name does not contain any bad words. +1. We may also wish to spend additional characters to prettify the generated +name for readability. + + +