autoscaler

Commit Graph

Author	SHA1	Message	Date
Kuba Tużnik	879c6a84a4	DRA: migrate all of CA to use the new internal NodeInfo/PodInfo The new wrapper types should behave like the direct schedulerframework types for most purposes, so most of the migration is just changing the imported package. Constructors look a bit different, so they have to be adapted - mostly in test code. Accesses to the Pods field have to be changed to a method call. After this, the schedulerframework types are only used in the new wrappers, and in the parts of simulator/ that directly interact with the scheduler framework. The rest of CA codebase operates on the new wrapper types.	2024-11-05 16:43:43 +01:00
Kubernetes Prow Robot	64a64322d4	Merge pull request #7376 from damikag/cleanup-remove-or-update-logs Remove/update spamming logs	2024-10-16 13:37:03 +01:00
mendelski	72ec806382	Filter upcoming nodes in clusterstate and scale-up executor	2024-10-15 14:02:23 +00:00
Damika Gamlath	e20e5e600b	Remove spamming logs in compare_nodegroups.go and filter_out_daemon_sets.go Change the log lovel and type of spamming logs in clusterstate.go and pre_filtering_processor.go	2024-10-10 08:48:24 +00:00
Omran	e30bf14730	Add upcoming node groups state checker	2024-08-22 07:42:38 +00:00
mendelski	c06ec4b324	Add async node group creation	2024-08-20 12:02:01 +00:00
Beata Lach (Skiba)	9ed9b46137	Return nodes with create errors by node group id In order to simplify the deleteNodesWithErrors code, return nodeGroupID as well as nodes with create errors. That way we avoid the additional node group matching code.	2024-07-17 11:21:43 +00:00
Kubernetes Prow Robot	6ca84143b5	Merge pull request #6528 from yarinm/yarinm/expectedToRegister-fix Fix expectedToRegister to respect instances with nil status	2024-02-22 07:22:38 -08:00
Will Bowers	4477707256	remove RemoveBackoff from updateScaleRequests	2024-02-13 07:12:40 -08:00
Will Bowers	8e867f66c5	revert optionally keeping node group backoff	2024-02-13 07:09:44 -08:00
Will Bowers	00fd3a802c	attach errors to scale-up request and add comments	2024-02-13 02:06:01 -08:00
Will Bowers	aa1af03862	add option to keep node group backoff on OutOfResource error	2024-02-13 02:04:15 -08:00
Yarin Miran	7128cb795f	Fix expectedToRegister to respect instances with nil status	2024-02-13 11:18:01 +02:00
Guo Peng	68e661f1ed	feat:add node group health and back off metrics	2024-01-23 19:39:18 +08:00
Guo Peng	ae0ab53060	feat:add node group health and back off metrics	2024-01-13 18:58:28 +08:00
guopeng	23843ad4af	Merge branch 'kubernetes:master' into feature/node_group_healthy_metrics	2024-01-13 00:52:57 +08:00
vadasambar	5de49a11fb	feat: support `--scale-down-delay-after-` per nodegroup Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: update scale down status after every scale up - move scaledown delay status to cluster state/registry - enable scale down if `ScaleDownDelayTypeLocal` is enabled - add new funcs on cluster state to get and update scale down delay status - use timestamp instead of booleans to track scale down delay status Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use existing fields on clusterstate - uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus` - changed the above existing fields a little to make them more convenient for use - moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove note saying only `scale-down-after-add` is supported - because we are supporting all the flags Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add` - because it is not true anymore - we are supporting all `--scale-down-delay-after-` flags per nodegroup Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix clusterstate tests failing Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: move back initializing processors logic to from static autoscaler to main - we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors - and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: revert changes related to `clusterstate` - since I am going with observer pattern Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: add observer interface for state of scaling - to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same) - refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove params passed to `clearScaleUpFailures` - not needed anymore Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: revert clusterstate tests - approach has changed - I am not making any changes in clusterstate now Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: add accidentally deleted lines for clusterstate test Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: implement `Add` fn for scale state observer - to easily add new observers - re-word comments - remove redundant params from `NewDefaultScaleDownCandidatesProcessor` Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: CI complaining because no comments on fn definitions Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: initialize parent `ScaleDownCandidatesProcessor` - instead of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: add scale state notifier to list of default processors - initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn - this allows more flexibility Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: add observer interface - create a separate observer directory - implement `RegisterScaleUp` function in the clusterstate - TODO: resolve syntax errors Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: use `scaleStateNotifier` in place of `clusterstate` - delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory) - register `clustertstate` with `scaleStateNotifier` - use `Register` instead of `Add` function in `scaleStateNotifier` - fix `go build` - wip: fixing tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix syntax errors - add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable) Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: wip track scale down failures along with scale up failures - I was tracking scale up failures but not scale down failures - fix copyright year 2017 -> 2023 for the new `pointers` package Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: register failed scale down with scale state notifier - wip writing tests for `scale_down_candidates_delay_processor` - fix CI lint errors - remove test file for `scale_down_candidates_processor` (there is not much to test as of now) Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: wip tests for `ScaleDownCandidatesDelayProcessor` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add unit tests for `ScaleDownCandidatesDelayProcessor` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor` - not needed Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: better doc comments for `TestGetScaleDownCandidates` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: don't ignore error in `NGChangeObserver` - return it instead and let the caller decide what to do with it Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: change pointers to values in `NGChangeObserver` interface - easier to work with - remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now) - add tests for clusterstate's `RegisterScaleUp` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: conditions in `GetScaleDownCandidates` - set scale down in cool down if the number of scale down candidates is 0 Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: use `ng1` instead of `ng2` in existing test Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: wip static autoscaler tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: assign directly instead of using `sdProcessor` variable - variable is not needed Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: first working test for static autoscaler Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: continue working on static autoscaler tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: wip second static autoscaler test Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove `Println` used for debugging Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add static_autoscaler tests for scale down delay per nodegroup flags Signed-off-by: vadasambar <surajrbanakar@gmail.com> chore: rebase off the latest `master` - change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix clusterstate test failing Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: fix failing orchestrator test Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor` - describes the processor better Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver` - makes it easier to understand for someone not familiar with the codebase Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: reword code comment `after` -> `for which` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: don't return error from `RegisterScaleDown` - not needed as of now (no implementer function returns a non-nil error for this function) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: address review comments around ng change observer interface - change dir structure of nodegroup change observer package - stop returning errors wherever it is not needed in the nodegroup change observer interface - rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: make nodegroupchange observer thread-safe Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: add TODO to consider using multiple mutexes in nodegroupchange observer Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use `time.Now()` directly instead of assigning a variable to it Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: share code for checking if there was a recent scale-up/down/failure Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: convert `ScaleDownCandidatesDelayProcessor` into table tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()` - makes it easier to understand what the function does Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: replace scale state notifier `Register` -> `RegisterForNotifications` in test - to fix syntax errors since it is already renamed in the actual code Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: remove `clusterStateRegistry` from `delete_in_batch` tests - not needed anymore since we have `scaleStateNotifier` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: address PR review comments Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: add empty `RegisterFailedScaleDown` for clusterstate - fix syntax error in static autoscaler test Signed-off-by: vadasambar <surajrbanakar@gmail.com>	2024-01-11 21:46:42 +05:30
guopeng	849e9e7332	Merge branch 'master' into feature/node_group_healthy_metrics	2024-01-02 12:02:37 +08:00
Guo Peng	89241e40c4	feat:add node group health and back off metrics	2024-01-02 11:50:56 +08:00
Walid Ghallab	4b639932ef	Truncate error messages in CA config map to 500 characters per node group. Max size of configmap is 1MB. Change-Id: I615d25781e4f8dafb6a08f752c085544bcd49e5a	2023-12-28 19:04:32 +00:00
Walid Ghallab	11a084699c	Convert status in cluster-autoscaler-status to yaml and add error info for backoff and more node counts. Change-Id: Ic68e0d67b7ce9912b605b6c0a3356b4d0e177911	2023-12-28 18:52:55 +00:00
Guo Peng	044c03d09f	feat:add node group health and back off metrics	2023-12-21 17:59:14 +08:00
Kubernetes Prow Robot	f0001eb008	Merge pull request #6327 from walidghallab/autoscaler-status Add error details to autoscaling backoff.	2023-12-14 15:53:18 +01:00
Walid Ghallab	cf6176f80d	Add error details to autoscaling backoff. Change-Id: I3b5c62ba13c2e048ce2d7170016af07182c11eee	2023-12-14 13:45:55 +00:00
Guo Peng	eb5ef4bc83	feat: add metrics to show target size of every node group	2023-12-08 23:40:24 +08:00
Aleksandra Gacek	4470430007	Fix klog formating directives in cluster-autoscaler package.	2023-11-07 16:13:57 +01:00
Hakan Bostan	833e4cbf43	Add HasNodeGroupStartedScaleUp to cluster state registry. - HasNodeGroupStartedScaleUp checks wheter a scale up request exists without checking any upcoming nodes.	2023-10-13 08:24:43 +00:00
Bartłomiej Wróblewski	14655d219f	Remove the MaxNodeProvisioningTimeProvider interface	2023-08-05 11:26:40 +00:00
Karol Wychowaniec	8e621b23c4	Don't pass nil nodes to GetGpuInfoForMetrics	2023-08-04 09:28:34 +00:00
Karol Wychowaniec	80053f6eca	Support ZeroOrMaxNodeScaling node groups when cleaning up unregistered nodes	2023-08-03 08:44:46 +00:00
Karol Wychowaniec	2eba540d27	Add metrics for improved observability: * pending_node_deletions * failed_gpu_scale_ups_total	2023-07-25 13:01:36 +00:00
Artur Żyliński	e5bc070c8c	Fix: Do not inject fakeNode for instance which has errors on create	2023-07-17 11:54:30 +02:00
Daniel Gutowski	5fed449792	Add ClusterStateRegistry to the AutoscalingContext. Due to the dependency of the MaxNodeProvisionTimeProvider on the context the provider was extracted to a dedicated package and injected to the ClusterStateRegistry after context creation.	2023-07-04 05:00:09 -07:00
Bartłomiej Wróblewski	67d3e7ebc4	Include short unregistered nodes in calculation of incorrect node group sizes	2023-06-29 10:28:48 +00:00
Maria Oparka	ca088d26c2	Move MaxNodeProvisionTime to NodeGroupAutoscalingOptions	2023-04-19 11:08:20 +02:00
Bartłomiej Wróblewski	b5ead036a8	Merge taint utils into one package, make taint modifying methods public	2023-02-13 11:29:45 +00:00
Kuba Tużnik	7e6762535b	CA: stop passing registered upcoming nodes as scale-down candidates Without this, with aggressive settings, scale-down could be removing registered upcoming nodes before they have a chance to become ready (the duration of which should be unrelated to the scale-down settings).	2023-02-10 14:46:19 +01:00
Kuba Tużnik	6978ff8829	CA: Make CSR's Readiness keep lists of node names instead of just their count This does make us call len() in a bunch of places within CSR, but allows for greater flexibility - it's possible to act on the sets of nodes determined by Readiness.	2023-02-06 21:13:54 +01:00
Clint Fooken	1198fbcd90	Updating error messaging and fallback behavior of hasCloudProviderInstance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests.	2022-12-05 12:44:39 -08:00
Clint Fooken	08dfc7e20f	Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.	2022-11-04 17:54:05 -07:00
Clint Fooken	7fc1f6be01	Fixing errors due to merge on branches.	2022-10-17 15:45:55 -07:00
Clint	cf67a3004e	Implementing new cloud provider method for node deletion detection (#1 ) * Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.	2022-10-17 14:58:38 -07:00
Clint Fooken	776d7311a1	Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates. Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.	2022-10-17 14:40:01 -07:00
Aleksandra Gacek	ab2cc2fb8a	Bump k/k dependencies to v1.25.0 together with go.mod go version.	2022-08-26 13:38:07 +02:00
Daniel Kłobuszewski	66bfe55077	Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes"	2022-07-13 10:08:03 +02:00
Clint Fooken	a278255519	Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Including code changes first introduced in PR#4211, which will remove taints from all nodes on restarts.	2022-05-17 12:37:42 -07:00
weidongcai	03a0475502	Expose backoff time parameters	2022-05-12 15:34:28 +08:00
Daniel Kłobuszewski	26769e4c1b	Expose nodes with unready GPU in CA status This change simplifies debugging GPU issues: without it, all nodes can be Ready as far as Kubernetes API is concerned, but CA will still report some of them as unready if are missing GPU resource. Explicitly calling them out in the status ConfigMap will point into the right direction.	2022-03-03 14:59:31 +01:00
Marwan Ahmed	8039af647e	move annotations to cloudprovider package	2021-06-08 10:56:35 -07:00
Marwan Ahmed	36460df246	annotate fakeNodes so that cloudprovider implementations can identify them if needed	2021-06-06 13:54:05 -07:00

1 2 3

137 Commits