Merge pull request #477 from kargakis/alternative-deployment-proposal
Refine the Deployment proposal and switch hashing algorithm
This commit is contained in:
commit
7bcff32eca
|
|
@ -1,16 +1,24 @@
|
||||||
# Deployment
|
# Deployment
|
||||||
|
|
||||||
|
Authors:
|
||||||
|
- Brian Grant (@bgrant0607)
|
||||||
|
- Clayton Coleman (@smarterclayton)
|
||||||
|
- Dan Mace (@ironcladlou)
|
||||||
|
- David Oppenheimer (@davidopp)
|
||||||
|
- Janet Kuo (@janetkuo)
|
||||||
|
- Michail Kargakis (@kargakis)
|
||||||
|
- Nikhil Jindal (@nikhiljindal)
|
||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
A proposal for implementing a new resource - Deployment - which will enable
|
A proposal for implementing a new resource - Deployment - which will enable
|
||||||
declarative config updates for Pods and ReplicationControllers.
|
declarative config updates for ReplicaSets. Users will be able to create a
|
||||||
|
Deployment, which will spin up a ReplicaSet to bring up the desired Pods.
|
||||||
Users will be able to create a Deployment, which will spin up
|
Users can also target the Deployment to an existing ReplicaSet either by
|
||||||
a ReplicationController to bring up the desired pods.
|
rolling back an existing Deployment or creating a new Deployment that can
|
||||||
Users can also target the Deployment at existing ReplicationControllers, in
|
adopt an existing ReplicaSet. The exact mechanics of replacement depends on
|
||||||
which case the new RC will replace the existing ones. The exact mechanics of
|
the DeploymentStrategy chosen by the user. DeploymentStrategies are explained
|
||||||
replacement depends on the DeploymentStrategy chosen by the user.
|
in detail in a later section.
|
||||||
DeploymentStrategies are explained in detail in a later section.
|
|
||||||
|
|
||||||
## Implementation
|
## Implementation
|
||||||
|
|
||||||
|
|
@ -33,10 +41,10 @@ type Deployment struct {
|
||||||
type DeploymentSpec struct {
|
type DeploymentSpec struct {
|
||||||
// Number of desired pods. This is a pointer to distinguish between explicit
|
// Number of desired pods. This is a pointer to distinguish between explicit
|
||||||
// zero and not specified. Defaults to 1.
|
// zero and not specified. Defaults to 1.
|
||||||
Replicas *int
|
Replicas *int32
|
||||||
|
|
||||||
// Label selector for pods. Existing ReplicationControllers whose pods are
|
// Label selector for pods. Existing ReplicaSets whose pods are
|
||||||
// selected by this will be scaled down. New ReplicationControllers will be
|
// selected by this will be scaled down. New ReplicaSets will be
|
||||||
// created with this selector, with a unique label `pod-template-hash`.
|
// created with this selector, with a unique label `pod-template-hash`.
|
||||||
// If Selector is empty, it is defaulted to the labels present on the Pod template.
|
// If Selector is empty, it is defaulted to the labels present on the Pod template.
|
||||||
Selector map[string]string
|
Selector map[string]string
|
||||||
|
|
@ -46,14 +54,17 @@ type DeploymentSpec struct {
|
||||||
|
|
||||||
// The deployment strategy to use to replace existing pods with new ones.
|
// The deployment strategy to use to replace existing pods with new ones.
|
||||||
Strategy DeploymentStrategy
|
Strategy DeploymentStrategy
|
||||||
|
|
||||||
|
// Minimum number of seconds for which a newly created pod should be ready
|
||||||
|
// without any of its container crashing, for it to be considered available.
|
||||||
|
// Defaults to 0 (pod will be considered available as soon as it is ready)
|
||||||
|
MinReadySeconds int32
|
||||||
}
|
}
|
||||||
|
|
||||||
type DeploymentStrategy struct {
|
type DeploymentStrategy struct {
|
||||||
// Type of deployment. Can be "Recreate" or "RollingUpdate".
|
// Type of deployment. Can be "Recreate" or "RollingUpdate".
|
||||||
Type DeploymentStrategyType
|
Type DeploymentStrategyType
|
||||||
|
|
||||||
// TODO: Update this to follow our convention for oneOf, whatever we decide it
|
|
||||||
// to be.
|
|
||||||
// Rolling update config params. Present only if DeploymentStrategyType =
|
// Rolling update config params. Present only if DeploymentStrategyType =
|
||||||
// RollingUpdate.
|
// RollingUpdate.
|
||||||
RollingUpdate *RollingUpdateDeploymentStrategy
|
RollingUpdate *RollingUpdateDeploymentStrategy
|
||||||
|
|
@ -65,7 +76,8 @@ const (
|
||||||
// Kill all existing pods before creating new ones.
|
// Kill all existing pods before creating new ones.
|
||||||
RecreateDeploymentStrategyType DeploymentStrategyType = "Recreate"
|
RecreateDeploymentStrategyType DeploymentStrategyType = "Recreate"
|
||||||
|
|
||||||
// Replace the old RCs by new one using rolling update i.e gradually scale down the old RCs and scale up the new one.
|
// Replace the old ReplicaSets by new one using rolling update i.e gradually scale
|
||||||
|
// down the old ReplicaSets and scale up the new one.
|
||||||
RollingUpdateDeploymentStrategyType DeploymentStrategyType = "RollingUpdate"
|
RollingUpdateDeploymentStrategyType DeploymentStrategyType = "RollingUpdate"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
@ -94,20 +106,20 @@ type RollingUpdateDeploymentStrategy struct {
|
||||||
// new RC can be scaled up further, ensuring that total number of pods running
|
// new RC can be scaled up further, ensuring that total number of pods running
|
||||||
// at any time during the update is atmost 130% of original pods.
|
// at any time during the update is atmost 130% of original pods.
|
||||||
MaxSurge IntOrString
|
MaxSurge IntOrString
|
||||||
|
|
||||||
// Minimum number of seconds for which a newly created pod should be ready
|
|
||||||
// without any of its container crashing, for it to be considered available.
|
|
||||||
// Defaults to 0 (pod will be considered available as soon as it is ready)
|
|
||||||
MinReadySeconds int
|
|
||||||
}
|
}
|
||||||
|
|
||||||
type DeploymentStatus struct {
|
type DeploymentStatus struct {
|
||||||
// Total number of ready pods targeted by this deployment (this
|
// Total number of ready pods targeted by this deployment (this
|
||||||
// includes both the old and new pods).
|
// includes both the old and new pods).
|
||||||
Replicas int
|
Replicas int32
|
||||||
|
|
||||||
// Total number of new ready pods with the desired template spec.
|
// Total number of new ready pods with the desired template spec.
|
||||||
UpdatedReplicas int
|
UpdatedReplicas int32
|
||||||
|
|
||||||
|
// Monotonically increasing counter that tracks hash collisions for
|
||||||
|
// the Deployment. Used as a collision avoidance mechanism by the
|
||||||
|
// Deployment controller.
|
||||||
|
Uniquifier *int64
|
||||||
}
|
}
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
@ -116,38 +128,42 @@ type DeploymentStatus struct {
|
||||||
|
|
||||||
#### Deployment Controller
|
#### Deployment Controller
|
||||||
|
|
||||||
The DeploymentController will make Deployments happen.
|
The DeploymentController will process Deployments and crud ReplicaSets.
|
||||||
It will watch Deployment objects in etcd.
|
For each creation or update for a Deployment, it will:
|
||||||
For each pending deployment, it will:
|
|
||||||
|
|
||||||
1. Find all RCs whose label selector is a superset of DeploymentSpec.Selector.
|
1. Find all RSs (ReplicaSets) whose label selector is a superset of DeploymentSpec.Selector.
|
||||||
- For now, we will do this in the client - list all RCs and then filter the
|
- For now, we will do this in the client - list all RSs and then filter the
|
||||||
ones we want. Eventually, we want to expose this in the API.
|
ones we want. Eventually, we want to expose this in the API.
|
||||||
2. The new RC can have the same selector as the old RC and hence we add a unique
|
2. The new RS can have the same selector as the old RS and hence we add a unique
|
||||||
selector to all these RCs (and the corresponding label to their pods) to ensure
|
selector to all these RSs (and the corresponding label to their pods) to ensure
|
||||||
that they do not select the newly created pods (or old pods get selected by
|
that they do not select the newly created pods (or old pods get selected by the
|
||||||
new RC).
|
new RS).
|
||||||
- The label key will be "pod-template-hash".
|
- The label key will be "pod-template-hash".
|
||||||
- The label value will be hash of the podTemplateSpec for that RC without
|
- The label value will be the hash of {podTemplateSpec+uniquifier} where podTemplateSpec
|
||||||
this label. This value will be unique for all RCs, since PodTemplateSpec should be unique.
|
is the one that the new RS uses and uniquifier is a counter in the DeploymentStatus
|
||||||
- If the RCs and pods dont already have this label and selector:
|
that increments every time a [hash collision](#hashing-collisions) happens (hash
|
||||||
- We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to
|
collisions should be rare with fnv).
|
||||||
|
- If the RSs and pods dont already have this label and selector:
|
||||||
|
- We will first add this to RS.PodTemplateSpec.Metadata.Labels for all RSs to
|
||||||
ensure that all new pods that they create will have this label.
|
ensure that all new pods that they create will have this label.
|
||||||
- Then we will add this label to their existing pods and then add this as a selector
|
- Then we will add this label to their existing pods
|
||||||
to that RC.
|
- Eventually we flip the RS selector to use the new label.
|
||||||
3. Find if there exists an RC for which value of "pod-template-hash" label
|
This process potentially can be abstracted to a new endpoint for controllers [1].
|
||||||
|
3. Find if there exists an RS for which value of "pod-template-hash" label
|
||||||
is same as hash of DeploymentSpec.PodTemplateSpec. If it exists already, then
|
is same as hash of DeploymentSpec.PodTemplateSpec. If it exists already, then
|
||||||
this is the RC that will be ramped up. If there is no such RC, then we create
|
this is the RS that will be ramped up. If there is no such RS, then we create
|
||||||
a new one using DeploymentSpec and then add a "pod-template-hash" label
|
a new one using DeploymentSpec and then add a "pod-template-hash" label
|
||||||
to it. RCSpec.replicas = 0 for a newly created RC.
|
to it. The size of the new RS depends on the used DeploymentStrategyType
|
||||||
4. Scale up the new RC and scale down the olds ones as per the DeploymentStrategy.
|
4. Scale up the new RS and scale down the olds ones as per the DeploymentStrategy.
|
||||||
- Raise an event if we detect an error, like new pods failing to come up.
|
Raise events appropriately (both in case of failure or success).
|
||||||
5. Go back to step 1 unless the new RC has been ramped up to desired replicas
|
5. Go back to step 1 unless the new RS has been ramped up to desired replicas
|
||||||
and the old RCs have been ramped down to 0.
|
and the old RSs have been ramped down to 0.
|
||||||
6. Cleanup.
|
6. Cleanup old RSs as per revisionHistoryLimit.
|
||||||
|
|
||||||
DeploymentController is stateless so that it can recover in case it crashes during a deployment.
|
DeploymentController is stateless so that it can recover in case it crashes during a deployment.
|
||||||
|
|
||||||
|
[1] See https://github.com/kubernetes/kubernetes/issues/36897
|
||||||
|
|
||||||
### MinReadySeconds
|
### MinReadySeconds
|
||||||
|
|
||||||
We will implement MinReadySeconds using the Ready condition in Pod. We will add
|
We will implement MinReadySeconds using the Ready condition in Pod. We will add
|
||||||
|
|
@ -163,52 +179,71 @@ LastTransitionTime to PodCondition.
|
||||||
|
|
||||||
### Updating
|
### Updating
|
||||||
|
|
||||||
Users can update an ongoing deployment before it is completed.
|
Users can update an ongoing Deployment before it is completed.
|
||||||
In this case, the existing deployment will be stalled and the new one will
|
In this case, the existing rollout will be stalled and the new one will
|
||||||
begin.
|
begin.
|
||||||
For ex: consider the following case:
|
For example, consider the following case:
|
||||||
- User creates a deployment to rolling-update 10 pods with image:v1 to
|
- User updates a Deployment to rolling-update 10 pods with image:v1 to
|
||||||
pods with image:v2.
|
pods with image:v2.
|
||||||
- User then updates this deployment to create pods with image:v3,
|
- User then updates this Deployment to create pods with image:v3,
|
||||||
when the image:v2 RC had been ramped up to 5 pods and the image:v1 RC
|
when the image:v2 RS had been ramped up to 5 pods and the image:v1 RS
|
||||||
had been ramped down to 5 pods.
|
had been ramped down to 5 pods.
|
||||||
- When Deployment Controller observes the new deployment, it will create
|
- When Deployment Controller observes the new update, it will create
|
||||||
a new RC for creating pods with image:v3. It will then start ramping up this
|
a new RS for creating pods with image:v3. It will then start ramping up this
|
||||||
new RC to 10 pods and will ramp down both the existing RCs to 0.
|
new RS to 10 pods and will ramp down both the existing RSs to 0.
|
||||||
|
|
||||||
### Deleting
|
### Deleting
|
||||||
|
|
||||||
Users can pause/cancel a deployment by deleting it before it is completed.
|
Users can pause/cancel a rollout by doing a non-cascading deletion of the Deployment
|
||||||
Recreating the same deployment will resume it.
|
before it is complete. Recreating the same Deployment will resume it.
|
||||||
For ex: consider the following case:
|
For example, consider the following case:
|
||||||
- User creates a deployment to rolling-update 10 pods with image:v1 to
|
- User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to
|
||||||
pods with image:v2.
|
image:v2.
|
||||||
- User then deletes this deployment while the old and new RCs are at 5 replicas each.
|
- User then deletes the Deployment while the old and new RSs are at 5 replicas each.
|
||||||
User will end up with 2 RCs with 5 replicas each.
|
User will end up with 2 RSs with 5 replicas each.
|
||||||
User can then create the same deployment again in which case, DeploymentController will
|
User can then re-create the same Deployment again in which case, DeploymentController will
|
||||||
notice that the second RC exists already which it can ramp up while ramping down
|
notice that the second RS exists already which it can ramp up while ramping down
|
||||||
the first one.
|
the first one.
|
||||||
|
|
||||||
### Rollback
|
### Rollback
|
||||||
|
|
||||||
We want to allow the user to rollback a deployment. To rollback a
|
We want to allow the user to rollback a Deployment. To rollback a completed (or
|
||||||
completed (or ongoing) deployment, user can create (or update) a deployment with
|
ongoing) Deployment, users can simply use `kubectl rollout undo` or update the
|
||||||
DeploymentSpec.PodTemplateSpec = oldRC.PodTemplateSpec.
|
Deployment directly by using its spec.rollbackTo.revision field and specify the
|
||||||
|
revision they want to rollback to or no revision which means that the Deployment
|
||||||
|
will be rolled back to its previous revision.
|
||||||
|
|
||||||
## Deployment Strategies
|
## Deployment Strategies
|
||||||
|
|
||||||
DeploymentStrategy specifies how the new RC should replace existing RCs.
|
DeploymentStrategy specifies how the new RS should replace existing RSs.
|
||||||
To begin with, we will support 2 types of deployment:
|
To begin with, we will support 2 types of Deployment:
|
||||||
* Recreate: We kill all existing RCs and then bring up the new one. This results
|
* Recreate: We kill all existing RSs and then bring up the new one. This results
|
||||||
in quick deployment but there is a downtime when old pods are down but
|
in quick Deployment but there is a downtime when old pods are down but
|
||||||
the new ones have not come up yet.
|
the new ones have not come up yet.
|
||||||
* Rolling update: We gradually scale down old RCs while scaling up the new one.
|
* Rolling update: We gradually scale down old RSs while scaling up the new one.
|
||||||
This results in a slower deployment, but there is no downtime. At all times
|
This results in a slower Deployment, but there can be no downtime. Depending on
|
||||||
during the deployment, there are a few pods available (old or new). The number
|
the strategy parameters, it is possible to have at all times during the rollout
|
||||||
of available pods and when is a pod considered "available" can be configured
|
available pods (old or new). The number of available pods and when is a pod
|
||||||
using RollingUpdateDeploymentStrategy.
|
considered "available" can be configured using RollingUpdateDeploymentStrategy.
|
||||||
|
|
||||||
In future, we want to support more deployment types.
|
## Hashing collisions
|
||||||
|
|
||||||
|
Hashing collisions are a real thing with the existing hashing algorithm[1]. We
|
||||||
|
need to switch to a more stable algorithm like fnv. Preliminary benchmarks[2]
|
||||||
|
show that while fnv is a bit slower than adler, it is much more stable. Also,
|
||||||
|
hashing an API object is subject to API changes which means that the name
|
||||||
|
for a ReplicaSet may differ between minor Kubernetes versions.
|
||||||
|
|
||||||
|
For both of the aforementioned cases, we will use a field in the DeploymentStatus,
|
||||||
|
called Uniquifier, to create a unique hash value when a hash collision happens.
|
||||||
|
The Deployment controller will compute the hash value of {template+uniquifier},
|
||||||
|
and will use the resulting hash in the ReplicaSet names and selectors. One side
|
||||||
|
effect of this hash collision avoidance mechanism is that we don't need to
|
||||||
|
migrate ReplicaSets that were created with adler.
|
||||||
|
|
||||||
|
[1] https://github.com/kubernetes/kubernetes/issues/29735
|
||||||
|
|
||||||
|
[2] https://github.com/kubernetes/kubernetes/pull/39527
|
||||||
|
|
||||||
## Future
|
## Future
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue