Commit Graph

116 Commits

Author SHA1 Message Date
justinsb f2d4eeb104 reconcile: wait for apiserver to response before trying rolling-update
The rolling-update requires the apiserver (when called without --cloudonly),
so reconcile should wait for apiserver to start responding.

Implement this by reusing "validate cluster", but filtering to only the instance groups
and pods that we expect to be online.
2025-01-13 17:47:48 -05:00
justinsb ebcfebe50e chore: add context to rolling update functions
Move it out of the struct, and into the function parameters.

This is more go idiomatic.
2024-12-27 14:22:51 -05:00
justinsb 3646a610b1 refactor: Move GetCloudProvider to cluster
This lets us use labels (or annotations), meaning we can experiment
with different clouds without changing the API.

We also add initial (experimental/undocumented) support for exposing a "Metal" provider.
2024-08-26 08:20:37 -04:00
justinsb 65fe6dc3c4 refactor: ApplyClusterCmd clearly returns results
By having an explicit return value, we set ourselves up for better reuse.
2024-07-04 14:54:00 -04:00
Ciprian Hacman 28939f865b azure: Implement DeleteInstance for rolling update 2024-04-07 16:02:22 +03:00
justinsb 2a9343a168 Generate revisions of NLB objects, and introduce cleanup phase
This lets us safely make changes to otherwise immutable fields, in
particular for adding security groups to NLBs created without them.

We detect the older versions, and create deletion tasks to remove
them.  These tasks can be deferred, and we expect them to be
deferred to a "prune" phase that runs after cluster apply.

Co-authored-by: Ciprian Hacman <ciprian@hakman.dev>
2024-02-17 11:41:15 -05:00
Leïla MARABESE c02fb479dc reconcile instancegroup 2023-08-29 17:42:19 +02:00
Jack Andersen 89dfafefe7
Make struct members private, alter formatting, add unwrap method
Signed-off-by: Jack Andersen <jandersen@plaid.com>
2022-12-21 09:30:19 -08:00
Jack Andersen f5f71f17f9
Satisfy the Is interface with ValidationTimeoutError and change callers of err check
Signed-off-by: Jack Andersen <jandersen@plaid.com>
2022-12-21 09:30:17 -08:00
Jack Andersen 2bd5403f37
Create a specific error type for validation timeouts and classify as exitable
Signed-off-by: Jack Andersen <jandersen@plaid.com>
2022-12-21 09:30:16 -08:00
John Gardiner Myers de9055b588 Update control-plane terminology in CLI output strings 2022-11-23 21:32:10 -08:00
John Gardiner Myers d39ba74bd7 Change the control-plane IG role to "ControlPlane" in v1alpha3 API 2022-11-22 17:05:29 -08:00
Ole Markus With a5b1722110 Ensure kOps doesn't surge on karpenter IGs 2022-10-17 15:22:39 +02:00
justinsb 4b2f773748 rolling-update: don't deregister our only apiserver
If we do, we can't drain the node afterwards.  We also are going to
have dropped connections in this case anyway.
2022-09-15 09:16:57 -04:00
Ole Markus With 1ea5243406 Warm pool-enabled ASGs scaled to zero will no longer panic 2022-09-09 11:08:00 +02:00
Ole Markus With c260cf69b3 Log errors from detachInstance 2022-06-27 19:58:16 +02:00
Ciprian Hacman b5f14b589b Add initial support for Hetzner Cloud 2022-05-09 06:12:15 +03:00
Ole Markus With 2ba9c1670f Only delete node object on GCE 2022-03-06 07:34:52 +01:00
John Gardiner Myers 70f7d9bdb2 Use function to get cloud provider from cluster spec 2022-03-02 21:59:47 -08:00
Bronson Mirafuentes 86b0ef0d0c add drain-timeout flag to rolling-update cluster 2022-01-20 14:05:55 -08:00
Jesse Haka b88d110f58 Drain OpenStack loadbalancers 2021-12-31 13:16:02 +02:00
Ole Markus With 5e944f1a15 Do not try to detach karpenter nodes from ASGs 2021-12-15 09:56:33 +01:00
Ciprian Hacman ea7df00719 Run hack/update-gofmt.sh 2021-12-01 22:39:50 +02:00
John Gardiner Myers 4396270d74 Fix out of bounds error when instance detach fails 2021-11-08 23:00:28 -08:00
John Gardiner Myers d46ee9c883 Exclude nodes from load balancers upon cordoning 2021-04-20 17:58:26 -07:00
Ole Markus With 09615935fd Make kOps CLI handle ASG warm pools 2021-04-15 11:10:23 +02:00
Ole Markus With ab1b85818d Pass ctx to drain helper
In some rare cases, we hit an NPR because the k8s code tries to use the
ctx we are not passing.
2021-03-26 10:29:11 +01:00
Markos Chandras 0a49650c70
aws: Graceful handling of EC2 detach errors
Sometimes, we observe the following error during a rolling update:

error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX

The sequence of events that lead to this problem is the following:

- A new ASG object is being built from the launch template
- Existing instances are being added to it
- An existing instance is being ignored because it's already terminating
W0205 08:01:32.593377     191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX
- Due to maxSurge, the terminating instance is trying to be detached
  from the autoscaling group and fails.

As such, in case of EC@ ASG deatch failures we can simply try to detach
the next node instead of aborting the whole update operation.
2021-03-05 15:01:30 +02:00
Ole Markus With 5a2f1274fb Don't try to detach masters 2020-11-28 09:44:42 +01:00
Kubernetes Prow Robot 0b5646e94a
Merge pull request #10266 from rifelpet/k8s120
Update k8s dependencies to 1.20.0-beta.2
2020-11-18 10:48:07 -08:00
Peter Rifel 47354ce010
Update kubectl drain fields for 1.20 2020-11-18 11:55:03 -06:00
Bharath Vedartham 208199ba85 instancegroups: Clear out the TODO comment
Now that we are  able to associate pod validation failures with the
instance groups. We can remove the TODO comment
2020-11-15 11:07:45 +05:30
Kubernetes Prow Robot 7b26ec4b6d
Merge pull request #10065 from bharath-123/feature/instancegroup-specific-validation
Avoid waiting on validation during rolling update for inapplicable instance groups
2020-11-05 22:38:50 -08:00
zouyu 2e6b50f9e4 Some typos
Signed-off-by: zouyu <zouy.fnst@cn.fujitsu.com>
2020-11-03 16:28:30 +08:00
Bharath Vedartham 7067f5f47a instancegroups: Ignore validation errors in unrelated instance groups
When unrelated instance groups produce validation errors, the instance group
being updated produces a failure and is forced to wait for rolling update to continue.

This can be avoided as failures in different node instance groups usually don't affect
the instance group being affected in any way.
2020-10-31 19:17:24 +05:30
Srikanth Rao 4d251fe900
[Digital Ocean] Implement Delete Instance logic for rolling update (#10000)
* Add delete Instance implementation for DO

* Add warning for DeleteInstance usage

* Use reconcile option for rolling update

* Update pkg/instancegroups/instancegroups.go

Co-authored-by: Ciprian Hacman <ciprianhacman@gmail.com>

Co-authored-by: Ciprian Hacman <ciprianhacman@gmail.com>
2020-10-13 10:06:27 -07:00
Ole Markus With aa66c4f6d8 Add rolling upgrade to openstack 2020-10-01 20:07:44 +02:00
Ole Markus With 63f13322d5 Don't pass ctx and cluster everywhere 2020-09-23 08:30:24 +02:00
Ole Markus With 0ec71686b9 Refactor cloudinstancegroupmember in a more independent cloud instance representation
Apply suggestions from code review

Co-authored-by: John Gardiner Myers <jgmyers@proofpoint.com>
2020-08-30 21:37:03 +02:00
Ole Markus With ff6c04938d Add kops delete instance command
Add support for deleting instance by k8s node name

Add yes flag
2020-08-28 08:43:30 +02:00
Peter Rifel 4d9f0128a3
Upgrade to klog2
This splits up the kubernetes 1.19 PR to make it easier to keep up to date until we get it sorted out.
2020-08-16 20:56:48 -05:00
John Gardiner Myers cc2b647d06 Create separate field for disabling rolling updates 2020-06-19 22:19:26 -07:00
ZouYu 2fc52ec6be fix some go-lint warning
Signed-off-by: ZouYu <zouy.fnst@cn.fujitsu.com>
2020-06-09 08:52:50 +08:00
John Gardiner Myers 091893fd20 Simplify rolling update internal methods 2020-05-29 10:52:03 -07:00
John Gardiner Myers dd884a6a64
fix missing space
Co-authored-by: Peter Rifel <rifelpet@users.noreply.github.com>
2020-05-29 10:35:15 -07:00
John Gardiner Myers 7756be7fbc Try validating multiple times before updating instancegroup 2020-05-22 20:26:02 -07:00
John Gardiner Myers df7e0b18b6 Ignore already-deleted nodes during rolling update 2020-04-26 21:41:54 -07:00
Justin Santa Barbara ffb6cd61aa Rolling-update validation harmonization
This is a follow-on to #8868; I believe the intent of that was to
expose the option to do more (or fewer) retries.

We previously had a single retry to prevent flapping; this basically
unifies the previous behaviour with the idea of making it
configurable.

* validate-count=0 effectively turns off validation.

* validate-count=1 will do a single validation, without flapping
  detection.

* validate-count>=2 will require N succesful validations in a row,
waiting ValidateSuccessDuration in between.

A nice side-effect of this is that the tests now explicitly specify
ValidateCount=1 instead of setting ValidateSuccessDuration=0, which
had the side effect of doing the equivalent to ValidateCount=1.
2020-04-17 01:40:02 -04:00
Justin Santa Barbara 31bb16d4d1 Add context.Context to most signatures
The client-go signature for most methods adds a context.Context
object, and also makes Options mandatory.  Feed through a
context.Context through many of our methods (but use context.TODO to
stop it getting totally out of hand!)
2020-04-11 14:44:17 -04:00
Jesse Haka 11eaacd53e validationtimes -> validationcount 2020-04-08 13:55:29 +03:00