Commit Graph

219 Commits

Author SHA1 Message Date
John Gardiner Myers 70f7d9bdb2 Use function to get cloud provider from cluster spec 2022-03-02 21:59:47 -08:00
Bronson Mirafuentes 86b0ef0d0c add drain-timeout flag to rolling-update cluster 2022-01-20 14:05:55 -08:00
Jesse Haka b88d110f58 Drain OpenStack loadbalancers 2021-12-31 13:16:02 +02:00
Ole Markus With 5e944f1a15 Do not try to detach karpenter nodes from ASGs 2021-12-15 09:56:33 +01:00
Ole Markus With b785965c50 Rename InstanceManager to Manager 2021-12-13 09:14:24 +01:00
Ole Markus With 1ccb7840ac make rolling update work 2021-12-12 19:33:41 +01:00
Ciprian Hacman ea7df00719 Run hack/update-gofmt.sh 2021-12-01 22:39:50 +02:00
Kubernetes Prow Robot ec7fe88868
Merge pull request #12730 from johngmyers/fix-deprecated
Fix use of deprecated method
2021-11-13 23:22:46 -08:00
John Gardiner Myers c5914d6ddb Fix use of deprecated method 2021-11-13 20:29:52 -08:00
John Gardiner Myers 4396270d74 Fix out of bounds error when instance detach fails 2021-11-08 23:00:28 -08:00
John Gardiner Myers d935a419f8 Simplify AddSSHPublicKey() interface 2021-07-24 08:59:57 -07:00
John Gardiner Myers e0915887ed Move asset copying out of apply_cluster 2021-06-05 21:17:50 -07:00
John Gardiner Myers d46ee9c883 Exclude nodes from load balancers upon cordoning 2021-04-20 17:58:26 -07:00
Ole Markus With 09615935fd Make kOps CLI handle ASG warm pools 2021-04-15 11:10:23 +02:00
Ole Markus With ab1b85818d Pass ctx to drain helper
In some rare cases, we hit an NPR because the k8s code tries to use the
ctx we are not passing.
2021-03-26 10:29:11 +01:00
Ole Markus With 20bd724f5e Add support for scaling out the control plane with dedicated apiserver nodes
Ensure apiserver role can only be used on AWS (because of firewalling)

Apply api-server label to CP as well

Consolidate node not ready validation message

Guard apiserver nodes with a feature flag

Rename Apiserver role to APIServer

Add an integration test for apiserver nodes

Rename Apiserver role to APIServer

Enumerate all roles in rolling update docs

Apply suggestions from code review

Co-authored-by: Steven E. Harris <seh@panix.com>
2021-03-20 20:57:00 +01:00
Markos Chandras 0a49650c70
aws: Graceful handling of EC2 detach errors
Sometimes, we observe the following error during a rolling update:

error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX

The sequence of events that lead to this problem is the following:

- A new ASG object is being built from the launch template
- Existing instances are being added to it
- An existing instance is being ignored because it's already terminating
W0205 08:01:32.593377     191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX
- Due to maxSurge, the terminating instance is trying to be detached
  from the autoscaling group and fails.

As such, in case of EC@ ASG deatch failures we can simply try to detach
the next node instead of aborting the whole update operation.
2021-03-05 15:01:30 +02:00
Jesse Haka 46de9f145e update gophercloud dependency 2021-01-11 14:48:22 +02:00
Ole Markus With 5a2f1274fb Don't try to detach masters 2020-11-28 09:44:42 +01:00
Kubernetes Prow Robot 0b5646e94a
Merge pull request #10266 from rifelpet/k8s120
Update k8s dependencies to 1.20.0-beta.2
2020-11-18 10:48:07 -08:00
Peter Rifel 47354ce010
Update kubectl drain fields for 1.20 2020-11-18 11:55:03 -06:00
Kubernetes Prow Robot 92911d7dcf
Merge pull request #10167 from olemarkus/cilium-ondelete
Make it possible to use OnDelete update strategy on addon daemonset
2020-11-16 12:38:03 -08:00
Ole Markus With 2659a30280 Make get instances respect needs-update annotation
Make it possible for addons to set needs-update annotation

Use onDelete update strategy for cilium and set needs-update annotation

Rename node roles
2020-11-16 08:26:17 +01:00
Bharath Vedartham 208199ba85 instancegroups: Clear out the TODO comment
Now that we are  able to associate pod validation failures with the
instance groups. We can remove the TODO comment
2020-11-15 11:07:45 +05:30
Jesse Haka bd2dcc93ca fix test 2020-11-06 11:17:23 +02:00
Kubernetes Prow Robot 7b26ec4b6d
Merge pull request #10065 from bharath-123/feature/instancegroup-specific-validation
Avoid waiting on validation during rolling update for inapplicable instance groups
2020-11-05 22:38:50 -08:00
zouyu 2e6b50f9e4 Some typos
Signed-off-by: zouyu <zouy.fnst@cn.fujitsu.com>
2020-11-03 16:28:30 +08:00
Bharath Vedartham 1e18a5d344 rollingupdate_test: add tests for rolling update
The tests create a cluster with 2 node instance groups and 1 master and bastion instance groups.
Only one node instance group requires rolling update.

instanceGroupNodeSpecificErrorClusterValidator mocks a validation failure for a given node group.
rolling update should not fail if the cluster validator reports an error in an unrelated instance group.
2020-10-31 19:17:45 +05:30
Bharath Vedartham 7067f5f47a instancegroups: Ignore validation errors in unrelated instance groups
When unrelated instance groups produce validation errors, the instance group
being updated produces a failure and is forced to wait for rolling update to continue.

This can be avoided as failures in different node instance groups usually don't affect
the instance group being affected in any way.
2020-10-31 19:17:24 +05:30
Ciprian Hacman e0332177b3 Skip failing test 2020-10-15 07:46:47 +03:00
Srikanth Rao 4d251fe900
[Digital Ocean] Implement Delete Instance logic for rolling update (#10000)
* Add delete Instance implementation for DO

* Add warning for DeleteInstance usage

* Use reconcile option for rolling update

* Update pkg/instancegroups/instancegroups.go

Co-authored-by: Ciprian Hacman <ciprianhacman@gmail.com>

Co-authored-by: Ciprian Hacman <ciprianhacman@gmail.com>
2020-10-13 10:06:27 -07:00
Ole Markus With aa66c4f6d8 Add rolling upgrade to openstack 2020-10-01 20:07:44 +02:00
Ole Markus With a39beb20c8 Rolling update test for OS 2020-10-01 20:07:44 +02:00
Ole Markus With 63f13322d5 Don't pass ctx and cluster everywhere 2020-09-23 08:30:24 +02:00
Ole Markus With 0ec71686b9 Refactor cloudinstancegroupmember in a more independent cloud instance representation
Apply suggestions from code review

Co-authored-by: John Gardiner Myers <jgmyers@proofpoint.com>
2020-08-30 21:37:03 +02:00
Ole Markus With ff6c04938d Add kops delete instance command
Add support for deleting instance by k8s node name

Add yes flag
2020-08-28 08:43:30 +02:00
Peter Rifel 4d9f0128a3
Upgrade to klog2
This splits up the kubernetes 1.19 PR to make it easier to keep up to date until we get it sorted out.
2020-08-16 20:56:48 -05:00
Ciprian Hacman 5a9cc3d216 Fix int to string conversions 2020-07-26 09:09:52 +03:00
Kubernetes Prow Robot 219147e2f4
Merge pull request #9348 from johngmyers/rollingupdate-disable
Create separate field for disabling rolling updates
2020-07-02 09:08:47 -07:00
John Gardiner Myers 494209884a Rolling update instance groups in consistent order 2020-06-20 10:58:06 -07:00
John Gardiner Myers cc2b647d06 Create separate field for disabling rolling updates 2020-06-19 22:19:26 -07:00
ZouYu 2fc52ec6be fix some go-lint warning
Signed-off-by: ZouYu <zouy.fnst@cn.fujitsu.com>
2020-06-09 08:52:50 +08:00
Kubernetes Prow Robot 2a613f1331
Merge pull request #9165 from johngmyers/retry-initial
Try validating multiple times before updating instancegroup
2020-05-29 12:07:33 -07:00
John Gardiner Myers 091893fd20 Simplify rolling update internal methods 2020-05-29 10:52:03 -07:00
John Gardiner Myers dd884a6a64
fix missing space
Co-authored-by: Peter Rifel <rifelpet@users.noreply.github.com>
2020-05-29 10:35:15 -07:00
John Gardiner Myers 7756be7fbc Try validating multiple times before updating instancegroup 2020-05-22 20:26:02 -07:00
John Gardiner Myers af90ecdddf Reduce test flakiness 2020-05-22 19:33:01 -07:00
John Gardiner Myers df7e0b18b6 Ignore already-deleted nodes during rolling update 2020-04-26 21:41:54 -07:00
Justin Santa Barbara ffb6cd61aa Rolling-update validation harmonization
This is a follow-on to #8868; I believe the intent of that was to
expose the option to do more (or fewer) retries.

We previously had a single retry to prevent flapping; this basically
unifies the previous behaviour with the idea of making it
configurable.

* validate-count=0 effectively turns off validation.

* validate-count=1 will do a single validation, without flapping
  detection.

* validate-count>=2 will require N succesful validations in a row,
waiting ValidateSuccessDuration in between.

A nice side-effect of this is that the tests now explicitly specify
ValidateCount=1 instead of setting ValidateSuccessDuration=0, which
had the side effect of doing the equivalent to ValidateCount=1.
2020-04-17 01:40:02 -04:00
Justin Santa Barbara 31bb16d4d1 Add context.Context to most signatures
The client-go signature for most methods adds a context.Context
object, and also makes Options mandatory.  Feed through a
context.Context through many of our methods (but use context.TODO to
stop it getting totally out of hand!)
2020-04-11 14:44:17 -04:00
Jesse Haka 11eaacd53e validationtimes -> validationcount 2020-04-08 13:55:29 +03:00
Jesse Haka e1e79790ef validate cluster n times in rolling update 2020-04-08 13:55:24 +03:00
John Gardiner Myers ea3b8d7710 make gomod 2020-04-05 10:22:51 -07:00
John Gardiner Myers 6844eef4ca Switch to the k/k implementation of drain.Helper 2020-04-05 10:22:49 -07:00
Peter Rifel a999b3ea61 fix OWNERS labels format
These need to be lists
2020-03-10 22:47:50 -05:00
Kubernetes Prow Robot 8ecc5edb73
Merge pull request #8272 from johngmyers/need-update-annotation
Support the kops.k8s.io/needs-update annotation on nodes
2020-03-10 09:01:35 -07:00
Kubernetes Prow Robot db435ee7cd
Merge pull request #8717 from rifelpet/owners-labels
Add labels to OWNERS files
2020-03-10 08:23:51 -07:00
Peter Rifel 237a125f2c Add labels to OWNERS files
This will automatically label PRs that touch these directories.

This makes it easier to query GitHub for PRs that affect certain areas of the code.

I mostly used existing labels but created some new ones as well.
2020-03-10 08:35:58 -05:00
John Gardiner Myers 33e23166e4 Support the kops.k8s.io/needs-update annotation on nodes 2020-03-09 22:43:09 -07:00
John Gardiner Myers 03eb8246c7 Refactor/simplify rolling update 2020-03-09 11:05:58 -07:00
John Gardiner Myers e104cdb982 Default maxSurge to 1 on AWS 2020-03-04 19:41:51 -08:00
John Gardiner Myers 99100dc4a0 Fix flaky test 2020-03-03 20:54:22 -08:00
John Gardiner Myers ed73726195 Address review comments 2020-02-28 21:05:43 -08:00
John Gardiner Myers 38b7219b14 Remove code made unnecessary by apimachinery validation 2020-01-27 20:45:17 -08:00
John Gardiner Myers ebfcf5d909 Implement recovery from previous failed surge rolling updates 2020-01-27 20:45:16 -08:00
John Gardiner Myers cee662d521 Implement MaxSurge happy path 2020-01-27 20:45:16 -08:00
John Gardiner Myers 4ddc58ca5e Add MaxSurge to resolveSettings 2020-01-27 20:45:16 -08:00
John Gardiner Myers 640f5f5b74 Terminate AWS instances through EC2 instead of Autoscaling 2020-01-27 20:15:10 -08:00
John Gardiner Myers d56ad41334 Address review comments 2020-01-26 14:18:51 -08:00
John Gardiner Myers 5f72d12132 Reduce test flakiness 2020-01-12 21:27:55 -08:00
John Gardiner Myers 10d6416b8e Allow MaxConcurrency for masters and bastions 2020-01-11 18:50:35 -08:00
John Gardiner Myers 0c3651c9c8 Implement MaxUnavailable 2020-01-05 12:09:55 -08:00
John Gardiner Myers 0952374027 Extract maybeValidate 2020-01-05 12:09:55 -08:00
John Gardiner Myers 91f4920537 Extract drainTerminateAndWait() 2020-01-05 12:09:55 -08:00
John Gardiner Myers adaf903b90 Create resolveSettings 2020-01-05 12:09:54 -08:00
Kubernetes Prow Robot 121d9f461f
Merge pull request #7909 from johngmyers/remove-drain-feature-flag
Remove DrainAndValidateRollingUpdate feature flag
2020-01-05 11:15:40 -08:00
Kubernetes Prow Robot a22af4fa80
Merge pull request #8239 from johngmyers/simplify-rolling
Simplify code for rolling updates of nodes
2020-01-04 13:13:40 -08:00
John Gardiner Myers 01dd793604 Specify number of NotReady instances in makeGroup() parameter 2020-01-04 10:47:08 -08:00
John Gardiner Myers 39f849271b Fold setUpCloud() into getGroups() 2020-01-04 09:08:00 -08:00
John Gardiner Myers 612e4ae484 Extract creation of the CloudInstanceGroup 2020-01-04 09:08:00 -08:00
John Gardiner Myers cba59afac4 Change taint key per review comment 2020-01-03 10:07:21 -08:00
John Gardiner Myers cd499f6f09 Remove unused code 2019-12-31 14:33:05 -08:00
John Gardiner Myers 0cbd76ecfb Simplify code for rolling updates of nodes 2019-12-31 10:25:55 -08:00
John Gardiner Myers 97ad2c3b54 Taint nodes needing update 2019-12-30 16:06:00 -08:00
John Gardiner Myers 5189cc1ef6 Add a third instance to each nodes group in rolling update tests 2019-12-30 13:48:37 -08:00
John Gardiner Myers 92581ab4a1 Create nodes for instances in rolling update tests 2019-12-30 13:48:37 -08:00
John Gardiner Myers 77769855af Return groups from getTestSetup() 2019-12-30 13:48:34 -08:00
John Gardiner Myers 1d3d5c1d2f pkg/instancegroups - fix static check 2019-12-22 20:56:27 -08:00
Justin Santa Barbara 84835ce0ba
Update pkg/instancegroups/rollingupdate_test.go
Co-Authored-By: John Gardiner Myers <jgmyers@proofpoint.com>
2019-12-17 21:25:18 -05:00
Peter Rifel a24d9b6455
remove more trailing whitespace 2019-12-17 13:03:16 -06:00
Peter Rifel 85a1d23c18
remove trailing whitespace that was breaking gofmt 2019-12-17 12:49:20 -06:00
Justin Santa Barbara 8373c9fc4d tests: increase timeout in rolling update tests
We never know when e.g. a GC is going to delay us, so we need a lot
more padding on these timeouts.
2019-12-17 09:59:21 -05:00
John Gardiner Myers 2850826a52 Improve logging of cluster revalidation 2019-12-13 13:48:47 -08:00
John Gardiner Myers 19e165759b Add unit test for flapping validation 2019-12-13 13:45:21 -08:00
Jesse Haka 44183aef7f validate cluster twice 2019-12-12 08:48:15 +02:00
John Gardiner Myers 1239c05e71 Validate after updating bastion 2019-12-09 18:45:51 -08:00
John Gardiner Myers 2e36124f77 Expose ValidateTickDuration for use by unit tests 2019-12-09 18:43:20 -08:00
Kashif Saadat fcf6f0098c Canal Typha spec and apimachinery 2019-12-06 15:36:48 +00:00
John Gardiner Myers 4eccd3d53f Remove DrainAndValidateRollingUpdate feature flag 2019-12-05 22:50:04 -08:00
John Gardiner Myers 38b19e53b4 Add a second master to rolling update tests 2019-11-19 16:55:39 -08:00