Merge pull request #477 from kargakis/alternative-deployment-proposal

Refine the Deployment proposal and switch hashing algorithm
2017-04-27 18:45:01 +02:00 · 2017-04-27 18:45:01 +02:00 · 7bcff32eca
parent 44914b05c7 ca952a81dd
commit 7bcff32eca
1 changed files with 110 additions and 75 deletions
--- a/contributors/design-proposals/deployment.md
+++ b/contributors/design-proposals/deployment.md
@ -1,16 +1,24 @@
 # Deployment
 Authors:
 - Brian Grant (@bgrant0607)
 - Clayton Coleman (@smarterclayton)
 - Dan Mace (@ironcladlou)
 - David Oppenheimer (@davidopp)
 - Janet Kuo (@janetkuo)
 - Michail Kargakis (@kargakis)
 - Nikhil Jindal (@nikhiljindal)
 ## Abstract
 A proposal for implementing a new resource - Deployment - which will enable
-declarative config updates for Pods and ReplicationControllers.
+declarative config updates for ReplicaSets. Users will be able to create a
-
+Deployment, which will spin up a ReplicaSet to bring up the desired Pods.
-Users will be able to create a Deployment, which will spin up
+Users can also target the Deployment to an existing ReplicaSet either by
-a ReplicationController to bring up the desired pods.
+rolling back an existing Deployment or creating a new Deployment that can
-Users can also target the Deployment at existing ReplicationControllers, in
+adopt an existing ReplicaSet. The exact mechanics of replacement depends on
-which case the new RC will replace the existing ones. The exact mechanics of
+the DeploymentStrategy chosen by the user. DeploymentStrategies are explained
-replacement depends on the DeploymentStrategy chosen by the user.
+in detail in a later section.
 DeploymentStrategies are explained in detail in a later section.
 ## Implementation
@ -33,10 +41,10 @@ type Deployment struct {
 type DeploymentSpec struct {
  // Number of desired pods. This is a pointer to distinguish between explicit
  // zero and not specified. Defaults to 1.
-  Replicas *int
+  Replicas *int32
-  // Label selector for pods. Existing ReplicationControllers whose pods are
+  // Label selector for pods. Existing ReplicaSets whose pods are
-  // selected by this will be scaled down. New ReplicationControllers will be
+  // selected by this will be scaled down. New ReplicaSets will be
  // created with this selector, with a unique label `pod-template-hash`.
  // If Selector is empty, it is defaulted to the labels present on the Pod template.
  Selector map[string]string
@ -46,14 +54,17 @@ type DeploymentSpec struct {
  // The deployment strategy to use to replace existing pods with new ones.
  Strategy DeploymentStrategy
  // Minimum number of seconds for which a newly created pod should be ready
  // without any of its container crashing, for it to be considered available.
  // Defaults to 0 (pod will be considered available as soon as it is ready)
  MinReadySeconds int32
 }
 type DeploymentStrategy struct {
  // Type of deployment. Can be "Recreate" or "RollingUpdate".
  Type DeploymentStrategyType
  // TODO: Update this to follow our convention for oneOf, whatever we decide it
  // to be.
  // Rolling update config params. Present only if DeploymentStrategyType =
  // RollingUpdate.
  RollingUpdate *RollingUpdateDeploymentStrategy
@ -65,7 +76,8 @@ const (
  // Kill all existing pods before creating new ones.
  RecreateDeploymentStrategyType DeploymentStrategyType = "Recreate"
-  // Replace the old RCs by new one using rolling update i.e gradually scale down the old RCs and scale up the new one.
+  // Replace the old ReplicaSets by new one using rolling update i.e gradually scale
  // down the old ReplicaSets and scale up the new one.
  RollingUpdateDeploymentStrategyType DeploymentStrategyType = "RollingUpdate"
 )
@ -94,20 +106,20 @@ type RollingUpdateDeploymentStrategy struct {
  // new RC can be scaled up further, ensuring that total number of pods running
  // at any time during the update is atmost 130% of original pods.
  MaxSurge IntOrString
  // Minimum number of seconds for which a newly created pod should be ready
  // without any of its container crashing, for it to be considered available.
  // Defaults to 0 (pod will be considered available as soon as it is ready)
  MinReadySeconds int
 }
 type DeploymentStatus struct {
  // Total number of ready pods targeted by this deployment (this
  // includes both the old and new pods).
-  Replicas int
+  Replicas int32
  // Total number of new ready pods with the desired template spec.
-  UpdatedReplicas int
+  UpdatedReplicas int32
  // Monotonically increasing counter that tracks hash collisions for
  // the Deployment. Used as a collision avoidance mechanism by the
  // Deployment controller.
  Uniquifier *int64
 }
 ```
@ -116,38 +128,42 @@ type DeploymentStatus struct {
 #### Deployment Controller
-The DeploymentController will make Deployments happen.
+The DeploymentController will process Deployments and crud ReplicaSets.
-It will watch Deployment objects in etcd.
+For each creation or update for a Deployment, it will:
 For each pending deployment, it will:
-1. Find all RCs whose label selector is a superset of DeploymentSpec.Selector.
+1. Find all RSs (ReplicaSets) whose label selector is a superset of DeploymentSpec.Selector.
-   - For now, we will do this in the client - list all RCs and then filter the
+   - For now, we will do this in the client - list all RSs and then filter the
     ones we want. Eventually, we want to expose this in the API.
-2. The new RC can have the same selector as the old RC and hence we add a unique
+2. The new RS can have the same selector as the old RS and hence we add a unique
-   selector to all these RCs (and the corresponding label to their pods) to ensure
+   selector to all these RSs (and the corresponding label to their pods) to ensure
-   that they do not select the newly created pods (or old pods get selected by
+   that they do not select the newly created pods (or old pods get selected by the
-   new RC).
+   new RS).
   - The label key will be "pod-template-hash".
-   - The label value will be hash of the podTemplateSpec for that RC without
+   - The label value will be the hash of {podTemplateSpec+uniquifier} where podTemplateSpec
-     this label. This value will be unique for all RCs, since PodTemplateSpec should be unique.
+     is the one that the new RS uses and uniquifier is a counter in the DeploymentStatus
-   - If the RCs and pods dont already have this label and selector:
+     that increments every time a [hash collision](#hashing-collisions) happens (hash
-     - We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to
+     collisions should be rare with fnv).
   - If the RSs and pods dont already have this label and selector:
     - We will first add this to RS.PodTemplateSpec.Metadata.Labels for all RSs to
       ensure that all new pods that they create will have this label.
-     - Then we will add this label to their existing pods and then add this as a selector
+     - Then we will add this label to their existing pods
-       to that RC.
+     - Eventually we flip the RS selector to use the new label.
-3. Find if there exists an RC for which value of "pod-template-hash" label
+     This process potentially can be abstracted to a new endpoint for controllers [1].
 3. Find if there exists an RS for which value of "pod-template-hash" label
   is same as hash of DeploymentSpec.PodTemplateSpec. If it exists already, then
-   this is the RC that will be ramped up. If there is no such RC, then we create
+   this is the RS that will be ramped up. If there is no such RS, then we create
   a new one using DeploymentSpec and then add a "pod-template-hash" label
-   to it. RCSpec.replicas = 0 for a newly created RC.
+   to it. The size of the new RS depends on the used DeploymentStrategyType
-4. Scale up the new RC and scale down the olds ones as per the DeploymentStrategy.
+4. Scale up the new RS and scale down the olds ones as per the DeploymentStrategy.
-   - Raise an event if we detect an error, like new pods failing to come up.
+   Raise events appropriately (both in case of failure or success).
-5. Go back to step 1 unless the new RC has been ramped up to desired replicas
+5. Go back to step 1 unless the new RS has been ramped up to desired replicas
-   and the old RCs have been ramped down to 0.
+   and the old RSs have been ramped down to 0.
-6. Cleanup.
+6. Cleanup old RSs as per revisionHistoryLimit.
 DeploymentController is stateless so that it can recover in case it crashes during a deployment.
 [1] See https://github.com/kubernetes/kubernetes/issues/36897
 ### MinReadySeconds
 We will implement MinReadySeconds using the Ready condition in Pod. We will add
@ -163,52 +179,71 @@ LastTransitionTime to PodCondition.
 ### Updating
-Users can update an ongoing deployment before it is completed.
+Users can update an ongoing Deployment before it is completed.
-In this case, the existing deployment will be stalled and the new one will
+In this case, the existing rollout will be stalled and the new one will
 begin.
-For ex: consider the following case:
+For example, consider the following case:
- User creates a deployment to rolling-update 10 pods with image:v1 to
+- User updates a Deployment to rolling-update 10 pods with image:v1 to
  pods with image:v2.
- User then updates this deployment to create pods with image:v3,
+- User then updates this Deployment to create pods with image:v3,
-  when the image:v2 RC had been ramped up to 5 pods and the image:v1 RC
+  when the image:v2 RS had been ramped up to 5 pods and the image:v1 RS
  had been ramped down to 5 pods.
- When Deployment Controller observes the new deployment, it will create
+- When Deployment Controller observes the new update, it will create
-  a new RC for creating pods with image:v3. It will then start ramping up this
+  a new RS for creating pods with image:v3. It will then start ramping up this
-  new RC to 10 pods and will ramp down both the existing RCs to 0.
+  new RS to 10 pods and will ramp down both the existing RSs to 0.
 ### Deleting
-Users can pause/cancel a deployment by deleting it before it is completed.
+Users can pause/cancel a rollout by doing a non-cascading deletion of the Deployment
-Recreating the same deployment will resume it.
+before it is complete. Recreating the same Deployment will resume it.
-For ex: consider the following case:
+For example, consider the following case:
- User creates a deployment to rolling-update 10 pods with image:v1 to
+- User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to
-  pods with image:v2.
+ image:v2.
- User then deletes this deployment while the old and new RCs are at 5 replicas each.
+- User then deletes the Deployment while the old and new RSs are at 5 replicas each.
-  User will end up with 2 RCs with 5 replicas each.
+  User will end up with 2 RSs with 5 replicas each.
-User can then create the same deployment again in which case, DeploymentController will
+User can then re-create the same Deployment again in which case, DeploymentController will
-notice that the second RC exists already which it can ramp up while ramping down
+notice that the second RS exists already which it can ramp up while ramping down
 the first one.
 ### Rollback
-We want to allow the user to rollback a deployment. To rollback a
+We want to allow the user to rollback a Deployment. To rollback a completed (or
-completed (or ongoing) deployment, user can create (or update) a deployment with
+ongoing) Deployment, users can simply use `kubectl rollout undo` or update the
-DeploymentSpec.PodTemplateSpec = oldRC.PodTemplateSpec.
+Deployment directly by using its spec.rollbackTo.revision field and specify the
 revision they want to rollback to or no revision which means that the Deployment
 will be rolled back to its previous revision.
 ## Deployment Strategies
-DeploymentStrategy specifies how the new RC should replace existing RCs.
+DeploymentStrategy specifies how the new RS should replace existing RSs.
-To begin with, we will support 2 types of deployment:
+To begin with, we will support 2 types of Deployment:
-* Recreate: We kill all existing RCs and then bring up the new one. This results
+* Recreate: We kill all existing RSs and then bring up the new one. This results
-  in quick deployment but there is a downtime when old pods are down but
+  in quick Deployment but there is a downtime when old pods are down but
  the new ones have not come up yet.
-* Rolling update: We gradually scale down old RCs while scaling up the new one.
+* Rolling update: We gradually scale down old RSs while scaling up the new one.
-  This results in a slower deployment, but there is no downtime. At all times
+  This results in a slower Deployment, but there can be no downtime. Depending on
-  during the deployment, there are a few pods available (old or new). The number
+  the strategy parameters, it is possible to have at all times during the rollout
-  of available pods and when is a pod considered "available" can be configured
+  available pods (old or new). The number of available pods and when is a pod
-  using RollingUpdateDeploymentStrategy.
+  considered "available" can be configured using RollingUpdateDeploymentStrategy.
-In future, we want to support more deployment types.
+## Hashing collisions
 Hashing collisions are a real thing with the existing hashing algorithm[1]. We
 need to switch to a more stable algorithm like fnv. Preliminary benchmarks[2]
 show that while fnv is a bit slower than adler, it is much more stable. Also,
 hashing an API object is subject to API changes which means that the name
 for a ReplicaSet may differ between minor Kubernetes versions.
 For both of the aforementioned cases, we will use a field in the DeploymentStatus,
 called Uniquifier, to create a unique hash value when a hash collision happens.
 The Deployment controller will compute the hash value of {template+uniquifier},
 and will use the resulting hash in the ReplicaSet names and selectors. One side
 effect of this hash collision avoidance mechanism is that we don't need to
 migrate ReplicaSets that were created with adler.
 [1] https://github.com/kubernetes/kubernetes/issues/29735
 [2] https://github.com/kubernetes/kubernetes/pull/39527
 ## Future