The rolling-update requires the apiserver (when called without --cloudonly),
so reconcile should wait for apiserver to start responding.
Implement this by reusing "validate cluster", but filtering to only the instance groups
and pods that we expect to be online.
This lets us use labels (or annotations), meaning we can experiment
with different clouds without changing the API.
We also add initial (experimental/undocumented) support for exposing a "Metal" provider.
This lets us safely make changes to otherwise immutable fields, in
particular for adding security groups to NLBs created without them.
We detect the older versions, and create deletion tasks to remove
them. These tasks can be deferred, and we expect them to be
deferred to a "prune" phase that runs after cluster apply.
Co-authored-by: Ciprian Hacman <ciprian@hakman.dev>
Sometimes, we observe the following error during a rolling update:
error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX
The sequence of events that lead to this problem is the following:
- A new ASG object is being built from the launch template
- Existing instances are being added to it
- An existing instance is being ignored because it's already terminating
W0205 08:01:32.593377 191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX
- Due to maxSurge, the terminating instance is trying to be detached
from the autoscaling group and fails.
As such, in case of EC@ ASG deatch failures we can simply try to detach
the next node instead of aborting the whole update operation.
When unrelated instance groups produce validation errors, the instance group
being updated produces a failure and is forced to wait for rolling update to continue.
This can be avoided as failures in different node instance groups usually don't affect
the instance group being affected in any way.
This is a follow-on to #8868; I believe the intent of that was to
expose the option to do more (or fewer) retries.
We previously had a single retry to prevent flapping; this basically
unifies the previous behaviour with the idea of making it
configurable.
* validate-count=0 effectively turns off validation.
* validate-count=1 will do a single validation, without flapping
detection.
* validate-count>=2 will require N succesful validations in a row,
waiting ValidateSuccessDuration in between.
A nice side-effect of this is that the tests now explicitly specify
ValidateCount=1 instead of setting ValidateSuccessDuration=0, which
had the side effect of doing the equivalent to ValidateCount=1.
The client-go signature for most methods adds a context.Context
object, and also makes Options mandatory. Feed through a
context.Context through many of our methods (but use context.TODO to
stop it getting totally out of hand!)