autoscaler

Commit Graph

Author	SHA1	Message	Date
Krzysztof Siedlecki	10609dc9fe	Snapshot actuation status before refreshing cache There is a race condition between node group size and deletion operation. Having actuation status update done before cache refresh should change incorrect behavior from node group size going below min value to having adding loop for scale-down (in the worst case).	2023-05-22 15:10:42 +02:00
Kubernetes Prow Robot	114a35961a	Merge pull request #5705 from damikag/fix-race-condition-between-ca-fetching bugfix: fix race condition between CA fetching list of scheduled pods…	2023-05-12 05:23:01 -07:00
Damika Gamlath	3b4d6d62b9	bugfix: fix race condition between CA fetching list of scheduled pods and pods being scheduled	2023-05-12 11:53:50 +00:00
Bartłomiej Wróblewski	b8d40fdd3c	Add status taints option to template creation	2023-04-19 13:55:38 +00:00
Maria Oparka	ca088d26c2	Move MaxNodeProvisionTime to NodeGroupAutoscalingOptions	2023-04-19 11:08:20 +02:00
Bartłomiej Wróblewski	d5d0a3c7b7	Fix drain logic when skipNodesWithCustomControllerPods=false, set NodeDeleteOptions correctly	2023-04-04 09:50:26 +00:00
Kubernetes Prow Robot	b8ba2334e4	Merge pull request #5507 from vadasambar/feature/5387/allow-scale-down-with-custom-controller-pods-2 feat: check only controller ref to decide if a pod is replicated	2023-03-24 02:56:31 -07:00
vadasambar	ff6fe5833d	feat: check only controller ref to decide if a pod is replicated Signed-off-by: vadasambar <surajrbanakar@gmail.com> (cherry picked from commit `144a64a402`) fix: set `replicated` to true if controller ref is set to `true` - forgot to add this in the last commit Signed-off-by: vadasambar <surajrbanakar@gmail.com> (cherry picked from commit `f8f458295d`) fix: remove `checkReferences` - not needed anymore Signed-off-by: vadasambar <surajrbanakar@gmail.com> (cherry picked from commit `5df6e31f8b`) test(drain): add test for custom controller pod Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: add flag to allow scale down on custom controller pods - set to `false` by default - `false` will be set to `true` by default in the future - right now, we want to ensure backwards compatibility and make the feature available if the flag is explicitly set to `true` - TODO: this code might need some unit tests. Look into adding unit tests. Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: remove `at` symbol in prefix of `vadasambar` - to keep it consistent with previous such mentions in the code Signed-off-by: vadasambar <surajrbanakar@gmail.com> test(utils): run all drain tests twice - once for `allowScaleDownOnCustomControllerOwnedPods=false` - and once for `allowScaleDownOnCustomControllerOwnedPods=true` Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs(utils): add description for `testOpts` struct Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: update FAQ with info about `allow-scale-down-on-custom-controller-owned-pods` flag Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `allow-scale-down-on-custom-controller-owned-pods` -> `skip-nodes-with-custom-controller-pods` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `allowScaleDownOnCustomControllerOwnedPods` -> `skipNodesWithCustomControllerPods` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test(utils/drain): fix failing tests - refactor code to add cusom controller pod test Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: fix long code comments - clean-up print statements Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: move `expectFatal` right above where it is used - makes the code easier to read Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: fix code comment wording Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: address PR comments - abstract legacy code to check for replicated pods into a separate function so that it's easier to remove in the future - fix param info in the FAQ.md - simplify tests and remove the global variable used in the tests - rename `--skip-nodes-with-custom-controller-pods` -> `--scale-down-nodes-with-custom-controller-pods` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename flag `--scale-down-nodes-with-custom-controller-pods` -> `--skip-nodes-with-custom-controller-pods` - refactor tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: update flag info Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: forgot to change flag name on a line in the code Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use `ControllerRef()` directly instead of `controllerRef` - we don't need an extra variable Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: create tests consolidated test cases - from looping over and tweaking shared test cases - so that we don't have to duplicate shared test cases Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: append test flag to shared test description - so that the failed test is easy to identify - shallow copy tests and add comments so that others do the same Signed-off-by: vadasambar <surajrbanakar@gmail.com>	2023-03-22 10:51:07 +05:30
Daniel Gutowski	5b6c50e1c6	Apply code reivew remarks: * Rename scaleup.Manager to scaleup.Orchestrator * Remove factory and add Initialize function * Rename the wrpapper package to orchestrator * Rename NewOrchestrator func to just New	2023-03-20 10:16:53 -07:00
Daniel Gutowski	88cdd7ab4e	ScaleUp logic refactors * Simplify the ScaleUp* functions parameter list * Introduce the ScaleUpManagerFactory to allow greater expandability * Simplify helper functions in scale up wrapper * Make the SkippedReasons public and move those to a dedicated file	2023-03-14 03:22:05 -07:00
Daniel Gutowski	675ca31c36	Add ScaleUpManager interface * Add ScaleUpManager interface, which is copy of existing stand-alone functions * Add a wrapper which contains the current scale up logic code	2023-03-14 03:22:00 -07:00
Daniel Gutowski	2be8f8ed93	Create core/scaleup package. * Move resource manager to a dedicated package * Move pod equivalence groups to dedicated package	2023-03-14 03:21:55 -07:00
Kubernetes Prow Robot	edf8779bda	Merge pull request #5472 from DataDog/scaledown-nodedeletion-metric-fix Fix scaledown:nodedeletion metric calculation	2023-02-28 07:25:17 -08:00
Kubernetes Prow Robot	c8d612725c	Merge pull request #5521 from qianlei90/fix-delete-panic fix(*): refresh node instance cache when nodegroup not found in deleteCreatedNodesWithErrors	2023-02-27 02:36:20 -08:00
Bartłomiej Wróblewski	43b459bf84	Track PDBRemainingDisruptions in AutoscalingContext	2023-02-24 12:43:29 +00:00
qianlei.qianl	0d3a642cef	fix(*): refresh node instance cache when nodegroup not found in deleteCreatedNodesWithErrors	2023-02-16 19:06:53 +08:00
Bartłomiej Wróblewski	b5ead036a8	Merge taint utils into one package, make taint modifying methods public	2023-02-13 11:29:45 +00:00
Kuba Tużnik	7e6762535b	CA: stop passing registered upcoming nodes as scale-down candidates Without this, with aggressive settings, scale-down could be removing registered upcoming nodes before they have a chance to become ready (the duration of which should be unrelated to the scale-down settings).	2023-02-10 14:46:19 +01:00
dom.bozzuto	1150fcd27a	Fix scaledown:nodedeletion metric calculation The scaledown:nodedeletion metric duration was incorrectly being computed relative to the start of the RunOnce routine, instead of from the actual start of the deletion. Behavior in the start of the routine (like a long cloudproviderrefresh) would incorrectly skew the nodedeletion duration Signed-off-by: Domenic Bozzuto <dom.bozzuto@datadoghq.com>	2023-02-02 12:03:38 -05:00
Yaroslava Serdiuk	97159df69b	Add scale down candidates observer	2023-01-19 16:04:42 +00:00
yasin.lachiny	7a1668ef12	update prometheus metric min maxNodesCount and a.MaxNodesTotal Signed-off-by: yasin.lachiny <yasin.lachiny@gmail.com>	2022-12-14 20:51:26 +01:00
yasin.lachiny	6d9fed5211	set cluster_autoscaler_max_nodes_count dynamically Signed-off-by: yasin.lachiny <yasin.lachiny@gmail.com>	2022-12-11 00:18:03 +01:00
Yaroslava Serdiuk	ae45571af9	Create a Planner object if --parallelDrain=true	2022-12-07 11:36:05 +00:00
Xintong Liu	524886fca5	Support scaling up node groups to the configured min size if needed	2022-11-02 21:47:00 -07:00
Bartłomiej Wróblewski	4373c467fe	Add ScaleDown.Actuator to AutoscalingContext	2022-11-02 13:12:25 +00:00
Daniel Kłobuszewski	18f2e67c4f	Split out code from simulator package	2022-10-18 11:51:44 +02:00
Daniel Kłobuszewski	95fd1ed645	Remove ScaleDown dependency on clusterStateRegistry	2022-10-17 21:11:44 +02:00
Kubernetes Prow Robot	dc73ea9076	Merge pull request #5235 from UiPath/fix_node_delete Add option to wait for a period of time after node tainting/cordoning	2022-10-17 04:29:07 -07:00
Kubernetes Prow Robot	d022e260a1	Merge pull request #4956 from damirda/feature/scale-up-delay-annotations Add podScaleUpDelay annotation support	2022-10-13 09:29:02 -07:00
Alexandru Matei	0ee2a359e7	Add option to wait for a period of time after node tainting/cordoning Node state is refreshed and checked again before deleting the node It gives kube-scheduler time to acknowledge that nodes state has changed and to stop scheduling pods on them	2022-10-13 10:37:56 +03:00
Kubernetes Prow Robot	b3c6b60e1c	Merge pull request #5060 from yaroslava-serdiuk/deleting-in-batch Introduce NodeDeleterBatcher to ScaleDown actuator	2022-09-22 10:11:06 -07:00
Yaroslava Serdiuk	65b0d78e6e	Introduce NodeDeleterBatcher to ScaleDown actuator	2022-09-22 16:19:45 +00:00
Clint Fooken	6edb3f26b8	Modifying taint removal logic on startup to consider all nodes instead of ready nodes.	2022-09-19 11:37:38 -07:00
Damir Markovic	11d150e920	Add podScaleUpDelay annotation support	2022-09-05 20:24:19 +02:00
Daniel Kłobuszewski	66bfe55077	Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes"	2022-07-13 10:08:03 +02:00
Kubernetes Prow Robot	af5fb0722b	Merge pull request #4896 from fookenc/master Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes	2022-07-04 05:13:24 -07:00
Benjamin Pineau	a726944273	Don't deref nil nodegroup in deleteCreatedNodesWithErrors Various cloudproviders' `NodeGroupForNode()` implementations (including aws, azure, and gce) can returns a `nil` error _and_ a `nil` nodegroup. Eg. we're seeing AWS returning that on failed upscales on live clusters. Checking that `deleteCreatedNodesWithErrors` doesn't return an error is not enough to safely dereference the nodegroup (as returned by `NodeGroupForNode()`) by calling nodegroup.Id(). In that situation, logging and returning early seems the safest option, to give various caches (eg. clusterstateregistry's and cloud provider's) the opportunity to eventually converge.	2022-05-30 18:47:14 +02:00
Kuba Tużnik	6bd2432894	CA: switch legacy ScaleDown to use the new Actuator NodeDeletionTracker is now incremented asynchronously for drained nodes, instead of synchronously. This shouldn't change anything in actual behavior, but some tests depended on that, so they had to be adapted. The switch aims to mostly be a semantic no-op, with the following exceptions: * Nodes that fail to be tainted won't be included in NodeDeleteResults, since they are now tainted synchronously.	2022-05-27 15:13:44 +02:00
Kuba Tużnik	bf89c74572	CA: Extract deletion utils out of legacy scale-down Function signatures are simplified to take the whole *AutoscalingContext object instead of its individual fields.	2022-05-26 16:55:59 +02:00
Clint Fooken	a278255519	Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Including code changes first introduced in PR#4211, which will remove taints from all nodes on restarts.	2022-05-17 12:37:42 -07:00
Daniel Kłobuszewski	d0f8cc7806	Move the condition for ScaleDownInProgress to legacy scaledown code	2022-05-04 09:24:10 +02:00
Daniel Kłobuszewski	c550b77020	Make NodeDeletionTracker implement ActuationStatus interface	2022-04-28 17:08:10 +02:00
Daniel Kłobuszewski	7f8b2da9e3	Separate ScaleDown logic with a new interface	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	5a78f49bc2	Move soft tainting logic to a separate package	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	7686a1f326	Move existing ScaleDown code to a separate package	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	a55135fb47	Stop referencing unneededNodes in static_autoscaler	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	627284bdae	Remove direct access to ScaleDown fields	2022-04-26 08:48:45 +02:00
Yaroslava Serdiuk	8a7b99c7eb	Continue CA loop when unregistered nodes were removed	2022-04-12 07:49:42 +00:00
Kubernetes Prow Robot	b64d2949a5	Merge pull request #4633 from jayantjain93/debugging-snapshot-1 CA: Debugging snapshot adding a new field for TemplateNode.	2022-01-27 03:02:25 -08:00
Daniel Kłobuszewski	9944137fae	Don't cache NodeInfo for recently Ready nodes There's a race condition between DaemonSet pods getting scheduled to a new node and Cluster Autoscaler caching that node for the sake of predicting future nodes in a given node group. We can reduce the risk of missing some DaemonSet by providing a grace period before accepting nodes in the cache. 1 minute should be more than enough, except for some pathological edge cases.	2022-01-26 20:18:53 +01:00

1 2 3 4 5 ...

266 Commits