There is a race condition between node group size and deletion operation. Having actuation status update done before cache refresh should change incorrect behavior from node group size going below min value to having adding loop for scale-down (in the worst case).
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
(cherry picked from commit 144a64a402)
fix: set `replicated` to true if controller ref is set to `true`
- forgot to add this in the last commit
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
(cherry picked from commit f8f458295d)
fix: remove `checkReferences`
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
(cherry picked from commit 5df6e31f8b)
test(drain): add test for custom controller pod
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: add flag to allow scale down on custom controller pods
- set to `false` by default
- `false` will be set to `true` by default in the future
- right now, we want to ensure backwards compatibility and make the feature available if the flag is explicitly set to `true`
- TODO: this code might need some unit tests. Look into adding unit tests.
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: remove `at` symbol in prefix of `vadasambar`
- to keep it consistent with previous such mentions in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(utils): run all drain tests twice
- once for `allowScaleDownOnCustomControllerOwnedPods=false`
- and once for `allowScaleDownOnCustomControllerOwnedPods=true`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs(utils): add description for `testOpts` struct
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: update FAQ with info about `allow-scale-down-on-custom-controller-owned-pods` flag
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: rename `allow-scale-down-on-custom-controller-owned-pods` -> `skip-nodes-with-custom-controller-pods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: rename `allowScaleDownOnCustomControllerOwnedPods` -> `skipNodesWithCustomControllerPods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(utils/drain): fix failing tests
- refactor code to add cusom controller pod test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: fix long code comments
- clean-up print statements
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: move `expectFatal` right above where it is used
- makes the code easier to read
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: fix code comment wording
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: address PR comments
- abstract legacy code to check for replicated pods into a separate function so that it's easier to remove in the future
- fix param info in the FAQ.md
- simplify tests and remove the global variable used in the tests
- rename `--skip-nodes-with-custom-controller-pods` -> `--scale-down-nodes-with-custom-controller-pods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: rename flag `--scale-down-nodes-with-custom-controller-pods` -> `--skip-nodes-with-custom-controller-pods`
- refactor tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: update flag info
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: forgot to change flag name on a line in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: use `ControllerRef()` directly instead of `controllerRef`
- we don't need an extra variable
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: create tests consolidated test cases
- from looping over and tweaking shared test cases
- so that we don't have to duplicate shared test cases
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: append test flag to shared test description
- so that the failed test is easy to identify
- shallow copy tests and add comments so that others do the same
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
* Rename scaleup.Manager to scaleup.Orchestrator
* Remove factory and add Initialize function
* Rename the wrpapper package to orchestrator
* Rename NewOrchestrator func to just New
* Simplify the ScaleUp* functions parameter list
* Introduce the ScaleUpManagerFactory to allow greater expandability
* Simplify helper functions in scale up wrapper
* Make the SkippedReasons public and move those to a dedicated file
Without this, with aggressive settings, scale-down could be removing
registered upcoming nodes before they have a chance to become ready
(the duration of which should be unrelated to the scale-down settings).
The scaledown:nodedeletion metric duration was incorrectly being computed relative to the start of the RunOnce routine, instead of from the actual start of the deletion. Behavior in the start of the routine (like a long cloudproviderrefresh) would incorrectly skew the nodedeletion duration
Signed-off-by: Domenic Bozzuto <dom.bozzuto@datadoghq.com>
Node state is refreshed and checked again before deleting the node
It gives kube-scheduler time to acknowledge that nodes state has
changed and to stop scheduling pods on them
Various cloudproviders' `NodeGroupForNode()` implementations (including
aws, azure, and gce) can returns a `nil` error _and_ a `nil` nodegroup.
Eg. we're seeing AWS returning that on failed upscales on live clusters.
Checking that `deleteCreatedNodesWithErrors` doesn't return an error is
not enough to safely dereference the nodegroup (as returned by
`NodeGroupForNode()`) by calling nodegroup.Id().
In that situation, logging and returning early seems the safest option,
to give various caches (eg. clusterstateregistry's and cloud provider's)
the opportunity to eventually converge.
NodeDeletionTracker is now incremented asynchronously
for drained nodes, instead of synchronously. This shouldn't
change anything in actual behavior, but some tests
depended on that, so they had to be adapted.
The switch aims to mostly be a semantic no-op, with
the following exceptions:
* Nodes that fail to be tainted won't be included in
NodeDeleteResults, since they are now tainted
synchronously.
There's a race condition between DaemonSet pods getting scheduled to a
new node and Cluster Autoscaler caching that node for the sake of
predicting future nodes in a given node group. We can reduce the risk of
missing some DaemonSet by providing a grace period before accepting nodes in the
cache. 1 minute should be more than enough, except for some pathological
edge cases.