Commit Graph

137 Commits

Author SHA1 Message Date
Kuba Tużnik 879c6a84a4 DRA: migrate all of CA to use the new internal NodeInfo/PodInfo
The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.

Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.

After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
2024-11-05 16:43:43 +01:00
Kubernetes Prow Robot 64a64322d4
Merge pull request #7376 from damikag/cleanup-remove-or-update-logs
Remove/update spamming logs
2024-10-16 13:37:03 +01:00
mendelski 72ec806382
Filter upcoming nodes in clusterstate and scale-up executor 2024-10-15 14:02:23 +00:00
Damika Gamlath e20e5e600b Remove spamming logs in compare_nodegroups.go and filter_out_daemon_sets.go
Change the log lovel and type of spamming logs in clusterstate.go and pre_filtering_processor.go
2024-10-10 08:48:24 +00:00
Omran e30bf14730
Add upcoming node groups state checker 2024-08-22 07:42:38 +00:00
mendelski c06ec4b324
Add async node group creation 2024-08-20 12:02:01 +00:00
Beata Lach (Skiba) 9ed9b46137 Return nodes with create errors by node group id
In order to simplify the deleteNodesWithErrors code, return nodeGroupID
as well as nodes with create errors. That way we avoid the additional
node group matching code.
2024-07-17 11:21:43 +00:00
Kubernetes Prow Robot 6ca84143b5
Merge pull request #6528 from yarinm/yarinm/expectedToRegister-fix
Fix expectedToRegister to respect instances with nil status
2024-02-22 07:22:38 -08:00
Will Bowers 4477707256 remove RemoveBackoff from updateScaleRequests 2024-02-13 07:12:40 -08:00
Will Bowers 8e867f66c5 revert optionally keeping node group backoff 2024-02-13 07:09:44 -08:00
Will Bowers 00fd3a802c attach errors to scale-up request and add comments 2024-02-13 02:06:01 -08:00
Will Bowers aa1af03862 add option to keep node group backoff on OutOfResource error 2024-02-13 02:04:15 -08:00
Yarin Miran 7128cb795f Fix expectedToRegister to respect instances with nil status 2024-02-13 11:18:01 +02:00
Guo Peng 68e661f1ed feat:add node group health and back off metrics 2024-01-23 19:39:18 +08:00
Guo Peng ae0ab53060 feat:add node group health and back off metrics 2024-01-13 18:58:28 +08:00
guopeng 23843ad4af
Merge branch 'kubernetes:master' into feature/node_group_healthy_metrics 2024-01-13 00:52:57 +08:00
vadasambar 5de49a11fb feat: support `--scale-down-delay-after-*` per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: update scale down status after every scale up
- move scaledown delay status to cluster state/registry
- enable scale down if  `ScaleDownDelayTypeLocal` is enabled
- add new funcs on cluster state to get and update scale down delay status
- use timestamp instead of booleans to track scale down delay status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use existing fields on clusterstate
- uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus`
- changed the above existing fields a little to make them more convenient for use
- moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove note saying only `scale-down-after-add` is supported
- because we are supporting all the flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add`
- because it is not true anymore
- we are supporting all `--scale-down-delay-after-*` flags per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix clusterstate tests failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: move back initializing processors logic to from static autoscaler to main
- we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors
- and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: revert changes related to `clusterstate`
- since I am going with observer pattern
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add observer interface for state of scaling
- to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same)
- refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove params passed to `clearScaleUpFailures`
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: revert clusterstate tests
- approach has changed
- I am not making any changes in clusterstate now
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add accidentally deleted lines for clusterstate test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: implement `Add` fn for scale state observer
- to easily add new observers
- re-word comments
- remove redundant params from `NewDefaultScaleDownCandidatesProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: CI complaining because no comments on fn definitions
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: initialize parent `ScaleDownCandidatesProcessor`
- instead  of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add scale state notifier to list of default processors
- initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn
- this allows more flexibility
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add observer interface
- create a separate observer directory
- implement `RegisterScaleUp` function in the clusterstate
- TODO: resolve syntax errors
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: use `scaleStateNotifier` in place of `clusterstate`
- delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory)
- register `clustertstate` with `scaleStateNotifier`
- use `Register` instead of `Add` function in `scaleStateNotifier`
- fix `go build`
- wip: fixing tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix syntax errors
- add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip track scale down failures along with scale up failures
- I was tracking scale up failures but not scale down failures
- fix copyright year 2017 -> 2023 for the new `pointers` package
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: register failed scale down with scale state notifier
- wip writing tests for `scale_down_candidates_delay_processor`
- fix CI lint errors
- remove test file for `scale_down_candidates_processor` (there is not much to test as of now)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add unit tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor`
- not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: better doc comments for `TestGetScaleDownCandidates`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't ignore error in `NGChangeObserver`
- return it instead and let the caller decide what to do with it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: change pointers to values in `NGChangeObserver` interface
- easier to work with
- remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now)
- add tests for clusterstate's `RegisterScaleUp`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: conditions in `GetScaleDownCandidates`
- set scale down in cool down if the number of scale down candidates is 0
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: use `ng1` instead of `ng2` in existing test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: assign directly instead of using `sdProcessor` variable
- variable is not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: first working test for static autoscaler
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: continue working on static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip second static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `Println` used for debugging
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add static_autoscaler tests for scale down delay per nodegroup flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: rebase off the latest `master`
- change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix clusterstate test failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix failing orchestrator test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor`
- describes the processor better
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver`
- makes it easier to understand for someone not familiar with the codebase
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: reword code comment `after` -> `for which`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't return error from `RegisterScaleDown`
- not needed as of now (no implementer function returns a non-nil error for this function)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address review comments around ng change observer interface
- change dir structure of nodegroup change observer package
- stop returning errors wherever it is not needed in the nodegroup change observer interface
- rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: make nodegroupchange observer thread-safe
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add TODO to consider using multiple mutexes in nodegroupchange observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `time.Now()` directly instead of assigning a variable to it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: share code for checking if there was a recent scale-up/down/failure
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: convert `ScaleDownCandidatesDelayProcessor` into table tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()`
- makes it easier to understand what the function does
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: replace scale state notifier `Register` -> `RegisterForNotifications` in test
- to fix syntax errors since it is already renamed in the actual code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `clusterStateRegistry` from `delete_in_batch` tests
- not needed anymore since we have `scaleStateNotifier`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address PR review comments
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: add empty `RegisterFailedScaleDown` for clusterstate
- fix syntax error in static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2024-01-11 21:46:42 +05:30
guopeng 849e9e7332
Merge branch 'master' into feature/node_group_healthy_metrics 2024-01-02 12:02:37 +08:00
Guo Peng 89241e40c4 feat:add node group health and back off metrics 2024-01-02 11:50:56 +08:00
Walid Ghallab 4b639932ef Truncate error messages in CA config map to 500 characters per node group.
Max size of configmap is 1MB.

Change-Id: I615d25781e4f8dafb6a08f752c085544bcd49e5a
2023-12-28 19:04:32 +00:00
Walid Ghallab 11a084699c Convert status in cluster-autoscaler-status to yaml and add error info for backoff and more node counts.
Change-Id: Ic68e0d67b7ce9912b605b6c0a3356b4d0e177911
2023-12-28 18:52:55 +00:00
Guo Peng 044c03d09f feat:add node group health and back off metrics 2023-12-21 17:59:14 +08:00
Kubernetes Prow Robot f0001eb008
Merge pull request #6327 from walidghallab/autoscaler-status
Add error details to autoscaling backoff.
2023-12-14 15:53:18 +01:00
Walid Ghallab cf6176f80d Add error details to autoscaling backoff.
Change-Id: I3b5c62ba13c2e048ce2d7170016af07182c11eee
2023-12-14 13:45:55 +00:00
Guo Peng eb5ef4bc83 feat: add metrics to show target size of every node group 2023-12-08 23:40:24 +08:00
Aleksandra Gacek 4470430007 Fix klog formating directives in cluster-autoscaler package. 2023-11-07 16:13:57 +01:00
Hakan Bostan 833e4cbf43 Add HasNodeGroupStartedScaleUp to cluster state registry.
- HasNodeGroupStartedScaleUp checks wheter a scale up request exists
  without checking any upcoming nodes.
2023-10-13 08:24:43 +00:00
Bartłomiej Wróblewski 14655d219f Remove the MaxNodeProvisioningTimeProvider interface 2023-08-05 11:26:40 +00:00
Karol Wychowaniec 8e621b23c4 Don't pass nil nodes to GetGpuInfoForMetrics 2023-08-04 09:28:34 +00:00
Karol Wychowaniec 80053f6eca Support ZeroOrMaxNodeScaling node groups when cleaning up unregistered nodes 2023-08-03 08:44:46 +00:00
Karol Wychowaniec 2eba540d27 Add metrics for improved observability:
* pending_node_deletions
* failed_gpu_scale_ups_total
2023-07-25 13:01:36 +00:00
Artur Żyliński e5bc070c8c Fix: Do not inject fakeNode for instance which has errors on create 2023-07-17 11:54:30 +02:00
Daniel Gutowski 5fed449792 Add ClusterStateRegistry to the AutoscalingContext.
Due to the dependency of the MaxNodeProvisionTimeProvider on the context
the provider was extracted to a dedicated package and injected to the
ClusterStateRegistry after context creation.
2023-07-04 05:00:09 -07:00
Bartłomiej Wróblewski 67d3e7ebc4 Include short unregistered nodes in calculation of incorrect node group
sizes
2023-06-29 10:28:48 +00:00
Maria Oparka ca088d26c2 Move MaxNodeProvisionTime to NodeGroupAutoscalingOptions 2023-04-19 11:08:20 +02:00
Bartłomiej Wróblewski b5ead036a8 Merge taint utils into one package, make taint modifying methods public 2023-02-13 11:29:45 +00:00
Kuba Tużnik 7e6762535b CA: stop passing registered upcoming nodes as scale-down candidates
Without this, with aggressive settings, scale-down could be removing
registered upcoming nodes before they have a chance to become ready
(the duration of which should be unrelated to the scale-down settings).
2023-02-10 14:46:19 +01:00
Kuba Tużnik 6978ff8829 CA: Make CSR's Readiness keep lists of node names instead of just their count
This does make us call len() in a bunch of places within CSR, but allows
for greater flexibility - it's possible to act on the sets of nodes determined
by Readiness.
2023-02-06 21:13:54 +01:00
Clint Fooken 1198fbcd90 Updating error messaging and fallback behavior of hasCloudProviderInstance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests. 2022-12-05 12:44:39 -08:00
Clint Fooken 08dfc7e20f Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance. 2022-11-04 17:54:05 -07:00
Clint Fooken 7fc1f6be01 Fixing errors due to merge on branches. 2022-10-17 15:45:55 -07:00
Clint cf67a3004e
Implementing new cloud provider method for node deletion detection (#1)
* Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.
2022-10-17 14:58:38 -07:00
Clint Fooken 776d7311a1 Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates.
Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.
2022-10-17 14:40:01 -07:00
Aleksandra Gacek ab2cc2fb8a Bump k/k dependencies to v1.25.0 together with go.mod go version. 2022-08-26 13:38:07 +02:00
Daniel Kłobuszewski 66bfe55077
Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes" 2022-07-13 10:08:03 +02:00
Clint Fooken a278255519 Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Including code changes first introduced in PR#4211, which will remove taints from all nodes on restarts. 2022-05-17 12:37:42 -07:00
weidongcai 03a0475502 Expose backoff time parameters 2022-05-12 15:34:28 +08:00
Daniel Kłobuszewski 26769e4c1b Expose nodes with unready GPU in CA status
This change simplifies debugging GPU issues: without it, all nodes can
be Ready as far as Kubernetes API is concerned, but CA will still report
some of them as unready if are missing GPU resource. Explicitly calling
them out in the status ConfigMap will point into the right direction.
2022-03-03 14:59:31 +01:00
Marwan Ahmed 8039af647e move annotations to cloudprovider package 2021-06-08 10:56:35 -07:00
Marwan Ahmed 36460df246 annotate fakeNodes so that cloudprovider implementations can identify them if needed 2021-06-06 13:54:05 -07:00