mirror of https://github.com/kubernetes/kops.git
156 lines
6.8 KiB
Markdown
156 lines
6.8 KiB
Markdown
# Rolling Updates
|
|
|
|
Upgrading and modifying a k8s cluster usually requires the replacement of cloud instances.
|
|
In order to avoid loss of service and other disruption, Kops replaces cloud instances
|
|
incrementally with a rolling update.
|
|
|
|
Rolling updates are performed using
|
|
[the `kops rolling-update cluster` command](../cli/kops_rolling-update_cluster.md).
|
|
|
|
## Instance selection
|
|
|
|
Cloud instances are chosen to be updated (replaced) if at least one of the following is true:
|
|
|
|
* The instance was created with a specification that is older than that generated by the last
|
|
`kops update cluster`.
|
|
* The instance was detached for surging by a previous (failed or interrupted) rolling update.
|
|
* The node has a `kops.k8s.io/needs-update` annotation.
|
|
* The `--force` flag was given to the `kops rolling-update cluster` command.
|
|
|
|
## Order of instance groups
|
|
|
|
A rolling update will update instances from one instance group at a time. First, it will update
|
|
bastion instance groups. Next, it will update master instance groups. Finally, it will update
|
|
node instance groups.
|
|
|
|
A rolling update may be restricted to instance groups of particular roles
|
|
("Bastion", "Master", and/or "Node") with the `--instance-group-roles` flag.
|
|
A rolling update may be restricted to particular instance groups with the `--instance-group` flag.
|
|
|
|
## Updating an instance group
|
|
|
|
The first thing rolling update will do when updating an instance group is validate the cluster,
|
|
as for [the `kops validate cluster` command](../cli/kops_validate_cluster.md).
|
|
If the cluster fails validation at this time then the entire rolling update will stop with an error.
|
|
|
|
Next, rolling update will apply a PreferNoSchedule (soft) taint to the
|
|
instance group's nodes that have been chosen to be updated. This will prevent new
|
|
pods, including replacements for evicted pods, from being scheduled on the old nodes
|
|
unless there is no other place to schedule them.
|
|
|
|
This validation and tainting will not be performed if either of the following is true:
|
|
|
|
* The instance group is of role "Bastion".
|
|
* The `--cloudonly` flag was given to the `kops rolling-update cluster` command.
|
|
|
|
Finally, rolling update will replace the instance group's chosen nodes, respecting the limits
|
|
configured in that group's rolling update strategy.
|
|
|
|
### Updating an instance
|
|
|
|
When being updated, a node is first cordoned to prevent any new pods from being scheduled on it.
|
|
The cordoning also causes some cloud provider load balancers to remove the node from the set of
|
|
available destinations. Next, the node is drained, voluntarily evicting all pods not managed by
|
|
a DaemonSet. This eviction respects any pod disruption budgets.
|
|
|
|
After all such pods have been evicted, rolling update will wait 5 seconds to allow TCP connections
|
|
to those pods to close. The amount of time to wait may be changed with the `--post-drain-delay` flag.
|
|
|
|
Instances will not be cordoned or drained if at least one of the following is true:
|
|
|
|
* They are bastions.
|
|
* They were not registered as nodes.
|
|
* The `--cloudonly` flag was given to the `kops rolling-update cluster` command.
|
|
|
|
Rolling update will then terminate the instance. Unless the instance had been detached for surging,
|
|
this will cause the cloud provider to create a new instance with the current specification.
|
|
|
|
Rolling update then waits for 15 seconds to allow the Kubernetes APIserver to notice the termination.
|
|
The amount of time to wait may be changed with the `--bastion-interval`, `--master-interval`, and/or
|
|
`--node-interval` flags.
|
|
|
|
Unless the `--cloudonly` flag was given, rolling update then waits until the cluster validates
|
|
successfully. This is done in order to ensure the
|
|
replacement instance is working before rolling update proceeds to update another instance.
|
|
|
|
### Configurable rolling update strategies
|
|
|
|
The behavior of rolling update within an instance group may be configured through the
|
|
`rollingUpdate` field of the group's
|
|
[InstanceGroupSpec](https://pkg.go.dev/k8s.io/kops/pkg/apis/kops#InstanceGroupSpec).
|
|
|
|
Cluster-wide defaults may be configured through the `rollingUpdate` field of the
|
|
[ClusterSpec](https://pkg.go.dev/k8s.io/kops/pkg/apis/kops#ClusterSpec).
|
|
|
|
#### maxUnavailable
|
|
|
|
The `maxUnavailable` field specifies the maximum number of nodes that can be unavailable
|
|
during the rolling update. Increasing this setting allows more instances to be updated
|
|
in parallel.
|
|
|
|
The value can be an absolute number (for example 5) or a percentage of the nodes
|
|
in the group (for example "10%"). The absolute number is calculated from a percentage by
|
|
rounding down.
|
|
|
|
For example, to permit two instances to be updated in parallel:
|
|
|
|
```yaml
|
|
spec:
|
|
rollingUpdate:
|
|
maxUnavailable: 2
|
|
```
|
|
|
|
This field defaults to `1` if the `maxSurge` field is `0`, otherwise it defaults to `0`.
|
|
|
|
If there are no instances that have been created with the current specification, then a rolling
|
|
update will start with updating a single instance. It does this to limit the damage in case the
|
|
new specification results in non-working nodes.
|
|
|
|
#### maxSurge
|
|
|
|
Surging is temporarily increasing the number of instances in an instance group during a rolling
|
|
update. Instead of first draining and terminating an instance and then creating a new one,
|
|
it effectively first creates a new instance and then drains and terminates the old one.
|
|
|
|
Surging is implemented by "detaching" instances, making them not count toward the desired
|
|
number of instances in the instance group. The detached instances are updated last;
|
|
when they are terminated the cloud provider does not replace them.
|
|
|
|
The `maxSurge` is the maximum number of extra instances that can be created during the update.
|
|
Increasing this setting allows more instances to be updated in parallel. Rolling update will
|
|
not create more new instances than the number of instances selected for update.
|
|
|
|
The value can be an absolute number (for example 5) or a percentage of the nodes
|
|
in the group (for example "10%"). The absolute number is calculated from a percentage by
|
|
rounding up.
|
|
|
|
Masters are unable to surge. Any cluster-wide default setting will be ignored for instance
|
|
groups of role "Master". Setting this value on the InstanceGroupSpec for an instance group of
|
|
role "Master" will result in an API validation error.
|
|
|
|
For example, to add a maximum of two additional instances to the group during a rolling update,
|
|
allowing two to be updated in parallel:
|
|
|
|
```yaml
|
|
spec:
|
|
rollingUpdate:
|
|
maxSurge: 2
|
|
```
|
|
|
|
If there are no instances that have been created with the current specification, then rolling
|
|
update will start with creating a single new instance. It does this to limit the damage in case the
|
|
new specification results in non-working nodes. Once the new instance validates successfully, it
|
|
then creates any remaining surge instances.
|
|
|
|
#### Disabling rolling updates
|
|
|
|
Rolling updates may be disabled for an instance group by setting both `maxSurge` and `maxUnavailable`
|
|
to `0`.
|
|
|
|
```yaml
|
|
spec:
|
|
rollingUpdate:
|
|
maxSurge: 0
|
|
maxUnavailable: 0
|
|
```
|