Commit Graph

138 Commits

Author SHA1 Message Date
Adam Oldak 1e3cede9aa
Do not remove healthy nodes from partially failing zero-or-max-scaling node pool scale-ups 2025-08-04 11:44:27 +00:00
Bartłomiej Wróblewski 2c7d8dc378 Rewrite TestCloudProvider to use builder pattern 2025-05-23 12:42:15 +00:00
Vlad Vasilyeu f32d6cd542 Move DRA provider to autoscaling context. 2025-05-08 09:30:55 +00:00
Maksym Fuhol 6cbf801235 Patch TestCleaningSoftTaintsInScaleDown to be compatible with new isScaleDownInCooldown signature. 2025-04-15 10:02:44 +00:00
Omran dd125d4ef1
Add unit test for cleaning deletion soft taints in scale down cool down 2025-04-09 08:21:49 +00:00
Yuriy Stryuchkov 105429c31e Fix log for node filtering in static autoscaler
Add missing tests
2025-03-19 15:49:34 +01:00
Yahia Naguib 241ad7af1e
update address description 2025-03-10 14:25:44 +00:00
Norbert Cyran aa479c92d7 Fix static autoscaler tests 2025-02-17 11:36:17 +01:00
Kuba Tużnik a45e6b7003 CA: implement DRA integration tests for StaticAutoscaler 2024-12-20 13:30:36 +01:00
Kuba Tużnik 55388f1136 CA: plumb the DRA provider to SetClusterState callsites, grab and pass DRA snapshot
The new logic is flag-guarded, it should be a no-op if DRA is disabled.
2024-12-20 13:30:36 +01:00
Kuba Tużnik a35f830f1d CA: extract a Handle to scheduleframework.Framework out of PredicateChecker
This decouples PredicateChecker from the Framework initialization logic,
and allows creating multiple PredicateChecker instances while only
initializing the framework once.

This commit also fixes how CA integrates with Framework metrics. Instead
of Registering them they're only Initialized so that CA doesn't expose
scheduler metrics. And the initialization is moved from multiple
different places to the Handle constructor.
2024-12-03 16:47:54 +01:00
Kuba Tużnik eb26816ce9 CA: refactor utils related to NodeInfos
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.

DeepCopyNodeInfo is changed to be a framework.NodeInfo method.

MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
2024-11-27 12:51:30 +01:00
Bartłomiej Wróblewski 15803158ed Split removeOldUnregisteredNodes method 2024-11-18 16:37:03 +00:00
Bartłomiej Wróblewski a0bf1082b5 Add flag to force remove long unregistered nodes 2024-11-18 13:55:15 +00:00
Kuba Tużnik 879c6a84a4 DRA: migrate all of CA to use the new internal NodeInfo/PodInfo
The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.

Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.

After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
2024-11-05 16:43:43 +01:00
Omran f945fc4add
Modify scale down set processor to add reasons to unremovable nodes 2024-10-29 10:28:37 +00:00
Bartłomiej Wróblewski 068ce78272 Register scheduler metrics 2024-10-23 16:47:34 +00:00
Daniel Kłobuszewski 7f30a7b8d1 Remove legacy scale down code 2024-08-28 11:07:40 +02:00
Omran e30bf14730
Add upcoming node groups state checker 2024-08-22 07:42:38 +00:00
Beata Lach (Skiba) 939123cb69 Do not break the loop after removing failed scale up nodes
Clean up cluster state after removing failed scale up nodes,
so that the loop can continue. Most importantly, update the
target for the affected node group, so that the deleted nodes
are not considered upcoming.
2024-08-12 09:45:25 +00:00
Beata Lach (Skiba) 14b33573f7 Extract getting nodes to delete for atomic node groups
Extract logic for overriding nodes to delete when deleting
nodes from a ZeroOrMaxNodeScaling node group.
Simplifies the code and removes code duplication.
2024-07-30 12:24:41 +00:00
Beata Lach (Skiba) 9ed9b46137 Return nodes with create errors by node group id
In order to simplify the deleteNodesWithErrors code, return nodeGroupID
as well as nodes with create errors. That way we avoid the additional
node group matching code.
2024-07-17 11:21:43 +00:00
Maksym Fuhol bed505891c Refactor StartDeletion usage patterns and enforce periodic scaledown status processor calls. 2024-03-15 20:31:21 +00:00
Yaroslava Serdiuk 5286b3f770
Add ProvisioningRequestProcessor (#6488) 2024-02-14 05:14:46 -08:00
Artur Żyliński 399b16e53c Move estimatorBuilder from AutoscalingContext to Orchestrator Init 2024-02-08 15:18:39 +01:00
Kubernetes Prow Robot 380259421e
Merge pull request #6273 from fische/fix-taint-unselected-node
Stop (un)tainting nodes from unselected node groups.
2024-02-06 06:09:48 -08:00
maxime e8e3ad0b1f Move to table-based tests. 2024-01-22 11:24:40 +00:00
vadasambar 5de49a11fb feat: support `--scale-down-delay-after-*` per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: update scale down status after every scale up
- move scaledown delay status to cluster state/registry
- enable scale down if  `ScaleDownDelayTypeLocal` is enabled
- add new funcs on cluster state to get and update scale down delay status
- use timestamp instead of booleans to track scale down delay status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use existing fields on clusterstate
- uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus`
- changed the above existing fields a little to make them more convenient for use
- moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove note saying only `scale-down-after-add` is supported
- because we are supporting all the flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add`
- because it is not true anymore
- we are supporting all `--scale-down-delay-after-*` flags per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix clusterstate tests failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: move back initializing processors logic to from static autoscaler to main
- we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors
- and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: revert changes related to `clusterstate`
- since I am going with observer pattern
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add observer interface for state of scaling
- to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same)
- refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove params passed to `clearScaleUpFailures`
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: revert clusterstate tests
- approach has changed
- I am not making any changes in clusterstate now
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add accidentally deleted lines for clusterstate test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: implement `Add` fn for scale state observer
- to easily add new observers
- re-word comments
- remove redundant params from `NewDefaultScaleDownCandidatesProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: CI complaining because no comments on fn definitions
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: initialize parent `ScaleDownCandidatesProcessor`
- instead  of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add scale state notifier to list of default processors
- initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn
- this allows more flexibility
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add observer interface
- create a separate observer directory
- implement `RegisterScaleUp` function in the clusterstate
- TODO: resolve syntax errors
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: use `scaleStateNotifier` in place of `clusterstate`
- delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory)
- register `clustertstate` with `scaleStateNotifier`
- use `Register` instead of `Add` function in `scaleStateNotifier`
- fix `go build`
- wip: fixing tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix syntax errors
- add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip track scale down failures along with scale up failures
- I was tracking scale up failures but not scale down failures
- fix copyright year 2017 -> 2023 for the new `pointers` package
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: register failed scale down with scale state notifier
- wip writing tests for `scale_down_candidates_delay_processor`
- fix CI lint errors
- remove test file for `scale_down_candidates_processor` (there is not much to test as of now)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add unit tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor`
- not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: better doc comments for `TestGetScaleDownCandidates`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't ignore error in `NGChangeObserver`
- return it instead and let the caller decide what to do with it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: change pointers to values in `NGChangeObserver` interface
- easier to work with
- remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now)
- add tests for clusterstate's `RegisterScaleUp`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: conditions in `GetScaleDownCandidates`
- set scale down in cool down if the number of scale down candidates is 0
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: use `ng1` instead of `ng2` in existing test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: assign directly instead of using `sdProcessor` variable
- variable is not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: first working test for static autoscaler
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: continue working on static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip second static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `Println` used for debugging
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add static_autoscaler tests for scale down delay per nodegroup flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: rebase off the latest `master`
- change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix clusterstate test failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix failing orchestrator test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor`
- describes the processor better
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver`
- makes it easier to understand for someone not familiar with the codebase
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: reword code comment `after` -> `for which`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't return error from `RegisterScaleDown`
- not needed as of now (no implementer function returns a non-nil error for this function)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address review comments around ng change observer interface
- change dir structure of nodegroup change observer package
- stop returning errors wherever it is not needed in the nodegroup change observer interface
- rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: make nodegroupchange observer thread-safe
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add TODO to consider using multiple mutexes in nodegroupchange observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `time.Now()` directly instead of assigning a variable to it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: share code for checking if there was a recent scale-up/down/failure
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: convert `ScaleDownCandidatesDelayProcessor` into table tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()`
- makes it easier to understand what the function does
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: replace scale state notifier `Register` -> `RegisterForNotifications` in test
- to fix syntax errors since it is already renamed in the actual code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `clusterStateRegistry` from `delete_in_batch` tests
- not needed anymore since we have `scaleStateNotifier`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address PR review comments
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: add empty `RegisterFailedScaleDown` for clusterstate
- fix syntax error in static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2024-01-11 21:46:42 +05:30
maxime 90e24ac528 Merge remote-tracking branch 'origin' into fix-taint-unselected-node 2023-12-12 17:58:58 +00:00
maxime 5f6eedc8f7 Split new test into a separate function. 2023-12-12 17:52:29 +00:00
maxime 368441dcd0 Fix compilation errors. 2023-12-12 17:16:01 +00:00
maxime cc6ecec2d3 Remove debug line. 2023-12-12 11:26:23 +00:00
Mahmoud Atwa 5115f1263e Update static_autoscaler tests & handle pod list processors errors as warnings 2023-11-22 11:19:19 +00:00
Mahmoud Atwa a1ae4d3b57 Update flags, Improve tests readability & use Bypass instead of ignore in naming 2023-11-22 11:18:55 +00:00
Mahmoud Atwa 4635a6dc04 Allow users to specify which schedulers to ignore 2023-11-22 11:18:44 +00:00
Mahmoud Atwa cfbfaa271a Add new test for new behaviour and revert changes made to other tests 2023-11-22 11:18:44 +00:00
Mahmoud Atwa a1ab7b9e20 Add new pod list processors for clearing TPU requests & filtering out
expendable pods

Treat non-processed pods yet as unschedulable
2023-11-22 11:16:33 +00:00
Maxime Fischer 91477aca4a Add test for node from unselected node group. 2023-11-13 22:37:25 +00:00
Artem Minyaylov 324a33ede8 Pass DeleteOptions once during default rule creation 2023-10-10 20:35:49 +00:00
Artem Minyaylov a68b748fd7 Refactor NodeDeleteOptions for use in drainability rules 2023-09-29 17:55:19 +00:00
Piotr ffe6537163 Unifies pod listing. 2023-09-11 17:11:11 +00:00
Bartłomiej Wróblewski 14655d219f Remove the MaxNodeProvisioningTimeProvider interface 2023-08-05 11:26:40 +00:00
Karol Wychowaniec 80053f6eca Support ZeroOrMaxNodeScaling node groups when cleaning up unregistered nodes 2023-08-03 08:44:46 +00:00
vadasambar eff7888f10 refactor: use `actuatorNodeGroupConfigGetter` param in `NewActuator`
- instead of passing all the processors (we only need `NodeGroupConfigProcessor`)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2023-07-06 10:48:58 +05:30
vadasambar 7941bab214 feat: set `IgnoreDaemonSetsUtilization` per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: test cases failing for actuator and scaledown/eligibility
- abstract default values into `config`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename global `IgnoreDaemonSetsUtilization` -> `GlobalIgnoreDaemonSetsUtilization` in code
- there is no change in the flag name
- rename `thresholdGetter` -> `configGetter` and tweak it to accomodate `GetIgnoreDaemonSetsUtilization`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: reset help text for `ignore-daemonsets-utilization` flag
- because per nodegroup override is supported only for AWS ASG tags as of now
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add info about overriding `--ignore-daemonsets-utilization` per ASG
- in AWS cloud provider README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use a limiting interface in actuator in place of `NodeGroupConfigProcessor` interface
- to limit the functions that can be used
- since we need it only for `GetIgnoreDaemonSetsUtilization`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: tests failing for actuator
- rename `staticNodeGroupConfigProcessor` -> `MockNodeGroupConfigGetter`
- move `MockNodeGroupConfigGetter` to test/common so that it can be used in different tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: go lint errors for `MockNodeGroupConfigGetter`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add tests for `IgnoreDaemonSetsUtilization` in cloud provider dir
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: update node group config processor tests for `IgnoreDaemonSetsUtilization`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: update eligibility test cases for `IgnoreDaemonSetsUtilization`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: run actuation tests for 2 NGS
- one with `IgnoreDaemonSetsUtilization`: `false`
- one with `IgnoreDaemonSetsUtilization`: `true`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add tests for `IgnoreDaemonSetsUtilization` in actuator
- add helper to generate multiple ds pods dynamically
- get rid of mock config processor because it is not required
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix failing tests for actuator
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `GlobalIgnoreDaemonSetUtilization` autoscaling option
- not required
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: warn message `DefaultScaleDownUnreadyTimeKey` -> `DefaultIgnoreDaemonSetsUtilizationKey`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `generateDsPods` instead of `generateDsPod`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: `globaIgnoreDaemonSetsUtilization` -> `ignoreDaemonSetsUtilization`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2023-07-06 10:31:45 +05:30
Daniel Gutowski 5fed449792 Add ClusterStateRegistry to the AutoscalingContext.
Due to the dependency of the MaxNodeProvisionTimeProvider on the context
the provider was extracted to a dedicated package and injected to the
ClusterStateRegistry after context creation.
2023-07-04 05:00:09 -07:00
Kubernetes Prow Robot 114a35961a
Merge pull request #5705 from damikag/fix-race-condition-between-ca-fetching
bugfix: fix race condition between CA fetching list of scheduled pods…
2023-05-12 05:23:01 -07:00
Damika Gamlath 3b4d6d62b9 bugfix: fix race condition between CA fetching list of scheduled pods and pods being scheduled 2023-05-12 11:53:50 +00:00
Bartłomiej Wróblewski b8d40fdd3c Add status taints option to template creation 2023-04-19 13:55:38 +00:00
Maria Oparka ca088d26c2 Move MaxNodeProvisionTime to NodeGroupAutoscalingOptions 2023-04-19 11:08:20 +02:00