autoscaler

Commit Graph

Author	SHA1	Message	Date
Adam Oldak	1e3cede9aa	Do not remove healthy nodes from partially failing zero-or-max-scaling node pool scale-ups	2025-08-04 11:44:27 +00:00
Bartłomiej Wróblewski	2c7d8dc378	Rewrite TestCloudProvider to use builder pattern	2025-05-23 12:42:15 +00:00
Vlad Vasilyeu	f32d6cd542	Move DRA provider to autoscaling context.	2025-05-08 09:30:55 +00:00
Maksym Fuhol	6cbf801235	Patch TestCleaningSoftTaintsInScaleDown to be compatible with new isScaleDownInCooldown signature.	2025-04-15 10:02:44 +00:00
Omran	dd125d4ef1	Add unit test for cleaning deletion soft taints in scale down cool down	2025-04-09 08:21:49 +00:00
Yuriy Stryuchkov	105429c31e	Fix log for node filtering in static autoscaler Add missing tests	2025-03-19 15:49:34 +01:00
Yahia Naguib	241ad7af1e	update address description	2025-03-10 14:25:44 +00:00
Norbert Cyran	aa479c92d7	Fix static autoscaler tests	2025-02-17 11:36:17 +01:00
Kuba Tużnik	a45e6b7003	CA: implement DRA integration tests for StaticAutoscaler	2024-12-20 13:30:36 +01:00
Kuba Tużnik	55388f1136	CA: plumb the DRA provider to SetClusterState callsites, grab and pass DRA snapshot The new logic is flag-guarded, it should be a no-op if DRA is disabled.	2024-12-20 13:30:36 +01:00
Kuba Tużnik	a35f830f1d	CA: extract a Handle to scheduleframework.Framework out of PredicateChecker This decouples PredicateChecker from the Framework initialization logic, and allows creating multiple PredicateChecker instances while only initializing the framework once. This commit also fixes how CA integrates with Framework metrics. Instead of Registering them they're only Initialized so that CA doesn't expose scheduler metrics. And the initialization is moved from multiple different places to the Handle constructor.	2024-12-03 16:47:54 +01:00
Kuba Tużnik	eb26816ce9	CA: refactor utils related to NodeInfos simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate, and scheduler_utils.DeepCopyTemplateNode all had very similar logic for sanitizing and copying NodeInfos. They're all consolidated to one file in simulator, sharing common logic. DeepCopyNodeInfo is changed to be a framework.NodeInfo method. MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to correlate Nodes to scheduled pods, instead of using a live Pod lister. This means that the snapshot now has to be properly initialized in a bunch of tests.	2024-11-27 12:51:30 +01:00
Bartłomiej Wróblewski	15803158ed	Split removeOldUnregisteredNodes method	2024-11-18 16:37:03 +00:00
Bartłomiej Wróblewski	a0bf1082b5	Add flag to force remove long unregistered nodes	2024-11-18 13:55:15 +00:00
Kuba Tużnik	879c6a84a4	DRA: migrate all of CA to use the new internal NodeInfo/PodInfo The new wrapper types should behave like the direct schedulerframework types for most purposes, so most of the migration is just changing the imported package. Constructors look a bit different, so they have to be adapted - mostly in test code. Accesses to the Pods field have to be changed to a method call. After this, the schedulerframework types are only used in the new wrappers, and in the parts of simulator/ that directly interact with the scheduler framework. The rest of CA codebase operates on the new wrapper types.	2024-11-05 16:43:43 +01:00
Omran	f945fc4add	Modify scale down set processor to add reasons to unremovable nodes	2024-10-29 10:28:37 +00:00
Bartłomiej Wróblewski	068ce78272	Register scheduler metrics	2024-10-23 16:47:34 +00:00
Daniel Kłobuszewski	7f30a7b8d1	Remove legacy scale down code	2024-08-28 11:07:40 +02:00
Omran	e30bf14730	Add upcoming node groups state checker	2024-08-22 07:42:38 +00:00
Beata Lach (Skiba)	939123cb69	Do not break the loop after removing failed scale up nodes Clean up cluster state after removing failed scale up nodes, so that the loop can continue. Most importantly, update the target for the affected node group, so that the deleted nodes are not considered upcoming.	2024-08-12 09:45:25 +00:00
Beata Lach (Skiba)	14b33573f7	Extract getting nodes to delete for atomic node groups Extract logic for overriding nodes to delete when deleting nodes from a ZeroOrMaxNodeScaling node group. Simplifies the code and removes code duplication.	2024-07-30 12:24:41 +00:00
Beata Lach (Skiba)	9ed9b46137	Return nodes with create errors by node group id In order to simplify the deleteNodesWithErrors code, return nodeGroupID as well as nodes with create errors. That way we avoid the additional node group matching code.	2024-07-17 11:21:43 +00:00
Maksym Fuhol	bed505891c	Refactor StartDeletion usage patterns and enforce periodic scaledown status processor calls.	2024-03-15 20:31:21 +00:00
Yaroslava Serdiuk	5286b3f770	Add ProvisioningRequestProcessor (#6488 )	2024-02-14 05:14:46 -08:00
Artur Żyliński	399b16e53c	Move estimatorBuilder from AutoscalingContext to Orchestrator Init	2024-02-08 15:18:39 +01:00
Kubernetes Prow Robot	380259421e	Merge pull request #6273 from fische/fix-taint-unselected-node Stop (un)tainting nodes from unselected node groups.	2024-02-06 06:09:48 -08:00
maxime	e8e3ad0b1f	Move to table-based tests.	2024-01-22 11:24:40 +00:00
vadasambar	5de49a11fb	feat: support `--scale-down-delay-after-` per nodegroup Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: update scale down status after every scale up - move scaledown delay status to cluster state/registry - enable scale down if `ScaleDownDelayTypeLocal` is enabled - add new funcs on cluster state to get and update scale down delay status - use timestamp instead of booleans to track scale down delay status Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use existing fields on clusterstate - uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus` - changed the above existing fields a little to make them more convenient for use - moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove note saying only `scale-down-after-add` is supported - because we are supporting all the flags Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add` - because it is not true anymore - we are supporting all `--scale-down-delay-after-` flags per nodegroup Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix clusterstate tests failing Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: move back initializing processors logic to from static autoscaler to main - we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors - and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: revert changes related to `clusterstate` - since I am going with observer pattern Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: add observer interface for state of scaling - to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same) - refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove params passed to `clearScaleUpFailures` - not needed anymore Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: revert clusterstate tests - approach has changed - I am not making any changes in clusterstate now Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: add accidentally deleted lines for clusterstate test Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: implement `Add` fn for scale state observer - to easily add new observers - re-word comments - remove redundant params from `NewDefaultScaleDownCandidatesProcessor` Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: CI complaining because no comments on fn definitions Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: initialize parent `ScaleDownCandidatesProcessor` - instead of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: add scale state notifier to list of default processors - initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn - this allows more flexibility Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: add observer interface - create a separate observer directory - implement `RegisterScaleUp` function in the clusterstate - TODO: resolve syntax errors Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: use `scaleStateNotifier` in place of `clusterstate` - delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory) - register `clustertstate` with `scaleStateNotifier` - use `Register` instead of `Add` function in `scaleStateNotifier` - fix `go build` - wip: fixing tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix syntax errors - add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable) Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: wip track scale down failures along with scale up failures - I was tracking scale up failures but not scale down failures - fix copyright year 2017 -> 2023 for the new `pointers` package Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: register failed scale down with scale state notifier - wip writing tests for `scale_down_candidates_delay_processor` - fix CI lint errors - remove test file for `scale_down_candidates_processor` (there is not much to test as of now) Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: wip tests for `ScaleDownCandidatesDelayProcessor` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add unit tests for `ScaleDownCandidatesDelayProcessor` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor` - not needed Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: better doc comments for `TestGetScaleDownCandidates` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: don't ignore error in `NGChangeObserver` - return it instead and let the caller decide what to do with it Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: change pointers to values in `NGChangeObserver` interface - easier to work with - remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now) - add tests for clusterstate's `RegisterScaleUp` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: conditions in `GetScaleDownCandidates` - set scale down in cool down if the number of scale down candidates is 0 Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: use `ng1` instead of `ng2` in existing test Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: wip static autoscaler tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: assign directly instead of using `sdProcessor` variable - variable is not needed Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: first working test for static autoscaler Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: continue working on static autoscaler tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: wip second static autoscaler test Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove `Println` used for debugging Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add static_autoscaler tests for scale down delay per nodegroup flags Signed-off-by: vadasambar <surajrbanakar@gmail.com> chore: rebase off the latest `master` - change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix clusterstate test failing Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix failing orchestrator test Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor` - describes the processor better Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver` - makes it easier to understand for someone not familiar with the codebase Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: reword code comment `after` -> `for which` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: don't return error from `RegisterScaleDown` - not needed as of now (no implementer function returns a non-nil error for this function) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: address review comments around ng change observer interface - change dir structure of nodegroup change observer package - stop returning errors wherever it is not needed in the nodegroup change observer interface - rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: make nodegroupchange observer thread-safe Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: add TODO to consider using multiple mutexes in nodegroupchange observer Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use `time.Now()` directly instead of assigning a variable to it Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: share code for checking if there was a recent scale-up/down/failure Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: convert `ScaleDownCandidatesDelayProcessor` into table tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()` - makes it easier to understand what the function does Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: replace scale state notifier `Register` -> `RegisterForNotifications` in test - to fix syntax errors since it is already renamed in the actual code Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove `clusterStateRegistry` from `delete_in_batch` tests - not needed anymore since we have `scaleStateNotifier` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: address PR review comments Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: add empty `RegisterFailedScaleDown` for clusterstate - fix syntax error in static autoscaler test Signed-off-by: vadasambar <surajrbanakar@gmail.com>	2024-01-11 21:46:42 +05:30
maxime	90e24ac528	Merge remote-tracking branch 'origin' into fix-taint-unselected-node	2023-12-12 17:58:58 +00:00
maxime	5f6eedc8f7	Split new test into a separate function.	2023-12-12 17:52:29 +00:00
maxime	368441dcd0	Fix compilation errors.	2023-12-12 17:16:01 +00:00
maxime	cc6ecec2d3	Remove debug line.	2023-12-12 11:26:23 +00:00
Mahmoud Atwa	5115f1263e	Update static_autoscaler tests & handle pod list processors errors as warnings	2023-11-22 11:19:19 +00:00
Mahmoud Atwa	a1ae4d3b57	Update flags, Improve tests readability & use Bypass instead of ignore in naming	2023-11-22 11:18:55 +00:00
Mahmoud Atwa	4635a6dc04	Allow users to specify which schedulers to ignore	2023-11-22 11:18:44 +00:00
Mahmoud Atwa	cfbfaa271a	Add new test for new behaviour and revert changes made to other tests	2023-11-22 11:18:44 +00:00
Mahmoud Atwa	a1ab7b9e20	Add new pod list processors for clearing TPU requests & filtering out expendable pods Treat non-processed pods yet as unschedulable	2023-11-22 11:16:33 +00:00
Maxime Fischer	91477aca4a	Add test for node from unselected node group.	2023-11-13 22:37:25 +00:00
Artem Minyaylov	324a33ede8	Pass DeleteOptions once during default rule creation	2023-10-10 20:35:49 +00:00
Artem Minyaylov	a68b748fd7	Refactor NodeDeleteOptions for use in drainability rules	2023-09-29 17:55:19 +00:00
Piotr	ffe6537163	Unifies pod listing.	2023-09-11 17:11:11 +00:00
Bartłomiej Wróblewski	14655d219f	Remove the MaxNodeProvisioningTimeProvider interface	2023-08-05 11:26:40 +00:00
Karol Wychowaniec	80053f6eca	Support ZeroOrMaxNodeScaling node groups when cleaning up unregistered nodes	2023-08-03 08:44:46 +00:00
vadasambar	eff7888f10	refactor: use `actuatorNodeGroupConfigGetter` param in `NewActuator` - instead of passing all the processors (we only need `NodeGroupConfigProcessor`) Signed-off-by: vadasambar <surajrbanakar@gmail.com>	2023-07-06 10:48:58 +05:30
vadasambar	7941bab214	feat: set `IgnoreDaemonSetsUtilization` per nodegroup Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: test cases failing for actuator and scaledown/eligibility - abstract default values into `config` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename global `IgnoreDaemonSetsUtilization` -> `GlobalIgnoreDaemonSetsUtilization` in code - there is no change in the flag name - rename `thresholdGetter` -> `configGetter` and tweak it to accomodate `GetIgnoreDaemonSetsUtilization` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: reset help text for `ignore-daemonsets-utilization` flag - because per nodegroup override is supported only for AWS ASG tags as of now Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: add info about overriding `--ignore-daemonsets-utilization` per ASG - in AWS cloud provider README Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use a limiting interface in actuator in place of `NodeGroupConfigProcessor` interface - to limit the functions that can be used - since we need it only for `GetIgnoreDaemonSetsUtilization` Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: tests failing for actuator - rename `staticNodeGroupConfigProcessor` -> `MockNodeGroupConfigGetter` - move `MockNodeGroupConfigGetter` to test/common so that it can be used in different tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: go lint errors for `MockNodeGroupConfigGetter` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add tests for `IgnoreDaemonSetsUtilization` in cloud provider dir Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: update node group config processor tests for `IgnoreDaemonSetsUtilization` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: update eligibility test cases for `IgnoreDaemonSetsUtilization` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: run actuation tests for 2 NGS - one with `IgnoreDaemonSetsUtilization`: `false` - one with `IgnoreDaemonSetsUtilization`: `true` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add tests for `IgnoreDaemonSetsUtilization` in actuator - add helper to generate multiple ds pods dynamically - get rid of mock config processor because it is not required Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix failing tests for actuator Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove `GlobalIgnoreDaemonSetUtilization` autoscaling option - not required Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: warn message `DefaultScaleDownUnreadyTimeKey` -> `DefaultIgnoreDaemonSetsUtilizationKey` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use `generateDsPods` instead of `generateDsPod` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: `globaIgnoreDaemonSetsUtilization` -> `ignoreDaemonSetsUtilization` Signed-off-by: vadasambar <surajrbanakar@gmail.com>	2023-07-06 10:31:45 +05:30
Daniel Gutowski	5fed449792	Add ClusterStateRegistry to the AutoscalingContext. Due to the dependency of the MaxNodeProvisionTimeProvider on the context the provider was extracted to a dedicated package and injected to the ClusterStateRegistry after context creation.	2023-07-04 05:00:09 -07:00
Kubernetes Prow Robot	114a35961a	Merge pull request #5705 from damikag/fix-race-condition-between-ca-fetching bugfix: fix race condition between CA fetching list of scheduled pods…	2023-05-12 05:23:01 -07:00
Damika Gamlath	3b4d6d62b9	bugfix: fix race condition between CA fetching list of scheduled pods and pods being scheduled	2023-05-12 11:53:50 +00:00
Bartłomiej Wróblewski	b8d40fdd3c	Add status taints option to template creation	2023-04-19 13:55:38 +00:00
Maria Oparka	ca088d26c2	Move MaxNodeProvisionTime to NodeGroupAutoscalingOptions	2023-04-19 11:08:20 +02:00

1 2 3

138 Commits