autoscaler

Commit Graph

Author	SHA1	Message	Date
Bartłomiej Wróblewski	b5ead036a8	Merge taint utils into one package, make taint modifying methods public	2023-02-13 11:29:45 +00:00
Kuba Tużnik	7e6762535b	CA: stop passing registered upcoming nodes as scale-down candidates Without this, with aggressive settings, scale-down could be removing registered upcoming nodes before they have a chance to become ready (the duration of which should be unrelated to the scale-down settings).	2023-02-10 14:46:19 +01:00
dom.bozzuto	1150fcd27a	Fix scaledown:nodedeletion metric calculation The scaledown:nodedeletion metric duration was incorrectly being computed relative to the start of the RunOnce routine, instead of from the actual start of the deletion. Behavior in the start of the routine (like a long cloudproviderrefresh) would incorrectly skew the nodedeletion duration Signed-off-by: Domenic Bozzuto <dom.bozzuto@datadoghq.com>	2023-02-02 12:03:38 -05:00
Yaroslava Serdiuk	97159df69b	Add scale down candidates observer	2023-01-19 16:04:42 +00:00
yasin.lachiny	7a1668ef12	update prometheus metric min maxNodesCount and a.MaxNodesTotal Signed-off-by: yasin.lachiny <yasin.lachiny@gmail.com>	2022-12-14 20:51:26 +01:00
yasin.lachiny	6d9fed5211	set cluster_autoscaler_max_nodes_count dynamically Signed-off-by: yasin.lachiny <yasin.lachiny@gmail.com>	2022-12-11 00:18:03 +01:00
Yaroslava Serdiuk	ae45571af9	Create a Planner object if --parallelDrain=true	2022-12-07 11:36:05 +00:00
Xintong Liu	524886fca5	Support scaling up node groups to the configured min size if needed	2022-11-02 21:47:00 -07:00
Bartłomiej Wróblewski	4373c467fe	Add ScaleDown.Actuator to AutoscalingContext	2022-11-02 13:12:25 +00:00
Daniel Kłobuszewski	18f2e67c4f	Split out code from simulator package	2022-10-18 11:51:44 +02:00
Daniel Kłobuszewski	95fd1ed645	Remove ScaleDown dependency on clusterStateRegistry	2022-10-17 21:11:44 +02:00
Kubernetes Prow Robot	dc73ea9076	Merge pull request #5235 from UiPath/fix_node_delete Add option to wait for a period of time after node tainting/cordoning	2022-10-17 04:29:07 -07:00
Kubernetes Prow Robot	d022e260a1	Merge pull request #4956 from damirda/feature/scale-up-delay-annotations Add podScaleUpDelay annotation support	2022-10-13 09:29:02 -07:00
Alexandru Matei	0ee2a359e7	Add option to wait for a period of time after node tainting/cordoning Node state is refreshed and checked again before deleting the node It gives kube-scheduler time to acknowledge that nodes state has changed and to stop scheduling pods on them	2022-10-13 10:37:56 +03:00
Kubernetes Prow Robot	b3c6b60e1c	Merge pull request #5060 from yaroslava-serdiuk/deleting-in-batch Introduce NodeDeleterBatcher to ScaleDown actuator	2022-09-22 10:11:06 -07:00
Yaroslava Serdiuk	65b0d78e6e	Introduce NodeDeleterBatcher to ScaleDown actuator	2022-09-22 16:19:45 +00:00
Clint Fooken	6edb3f26b8	Modifying taint removal logic on startup to consider all nodes instead of ready nodes.	2022-09-19 11:37:38 -07:00
Damir Markovic	11d150e920	Add podScaleUpDelay annotation support	2022-09-05 20:24:19 +02:00
Daniel Kłobuszewski	66bfe55077	Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes"	2022-07-13 10:08:03 +02:00
Kubernetes Prow Robot	af5fb0722b	Merge pull request #4896 from fookenc/master Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes	2022-07-04 05:13:24 -07:00
Benjamin Pineau	a726944273	Don't deref nil nodegroup in deleteCreatedNodesWithErrors Various cloudproviders' `NodeGroupForNode()` implementations (including aws, azure, and gce) can returns a `nil` error _and_ a `nil` nodegroup. Eg. we're seeing AWS returning that on failed upscales on live clusters. Checking that `deleteCreatedNodesWithErrors` doesn't return an error is not enough to safely dereference the nodegroup (as returned by `NodeGroupForNode()`) by calling nodegroup.Id(). In that situation, logging and returning early seems the safest option, to give various caches (eg. clusterstateregistry's and cloud provider's) the opportunity to eventually converge.	2022-05-30 18:47:14 +02:00
Kuba Tużnik	6bd2432894	CA: switch legacy ScaleDown to use the new Actuator NodeDeletionTracker is now incremented asynchronously for drained nodes, instead of synchronously. This shouldn't change anything in actual behavior, but some tests depended on that, so they had to be adapted. The switch aims to mostly be a semantic no-op, with the following exceptions: * Nodes that fail to be tainted won't be included in NodeDeleteResults, since they are now tainted synchronously.	2022-05-27 15:13:44 +02:00
Kuba Tużnik	bf89c74572	CA: Extract deletion utils out of legacy scale-down Function signatures are simplified to take the whole *AutoscalingContext object instead of its individual fields.	2022-05-26 16:55:59 +02:00
Clint Fooken	a278255519	Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Including code changes first introduced in PR#4211, which will remove taints from all nodes on restarts.	2022-05-17 12:37:42 -07:00
Daniel Kłobuszewski	d0f8cc7806	Move the condition for ScaleDownInProgress to legacy scaledown code	2022-05-04 09:24:10 +02:00
Daniel Kłobuszewski	c550b77020	Make NodeDeletionTracker implement ActuationStatus interface	2022-04-28 17:08:10 +02:00
Daniel Kłobuszewski	7f8b2da9e3	Separate ScaleDown logic with a new interface	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	5a78f49bc2	Move soft tainting logic to a separate package	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	7686a1f326	Move existing ScaleDown code to a separate package	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	a55135fb47	Stop referencing unneededNodes in static_autoscaler	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	627284bdae	Remove direct access to ScaleDown fields	2022-04-26 08:48:45 +02:00
Yaroslava Serdiuk	8a7b99c7eb	Continue CA loop when unregistered nodes were removed	2022-04-12 07:49:42 +00:00
Kubernetes Prow Robot	b64d2949a5	Merge pull request #4633 from jayantjain93/debugging-snapshot-1 CA: Debugging snapshot adding a new field for TemplateNode.	2022-01-27 03:02:25 -08:00
Daniel Kłobuszewski	9944137fae	Don't cache NodeInfo for recently Ready nodes There's a race condition between DaemonSet pods getting scheduled to a new node and Cluster Autoscaler caching that node for the sake of predicting future nodes in a given node group. We can reduce the risk of missing some DaemonSet by providing a grace period before accepting nodes in the cache. 1 minute should be more than enough, except for some pathological edge cases.	2022-01-26 20:18:53 +01:00
Jayant Jain	537e07fdb1	CA: Debugging snapshot adding a new field for TemplateNode. This captures all the templates for nodegroups present	2022-01-24 17:12:57 +00:00
Jayant Jain	729038ff2d	Adding support for Debugging Snapshot	2021-12-30 09:08:05 +00:00
Jayant Jain	da5ff3d971	Introduce Empty Cluster Processor This refactors the handling of cases when the cluster is empty/not ready by CA into a processors in empty_cluster_processor.go	2021-10-13 13:30:30 +00:00
Maciek Pytel	a0109324a2	Change parameter order of TemplateNodeInfoProvider Every other processors (and, I think, function in CA?) that takes AutoscalingContext has it as first parameter. Changing the new processor for consistency.	2021-09-13 15:08:14 +02:00
Benjamin Pineau	8485cf2052	Move GetNodeInfosForGroups to it's own processor Supports providing different NodeInfos sources (either upstream or in local forks, eg. to properly implement variants like in #4000). This also moves a large and specialized code chunk out of core, and removes the need to maintain and pass the GetNodeInfosForGroups() cache from the side, as processors can hold their states themselves. No functional changes to GetNodeInfosForGroups(), outside mechanical changes due to the move: remotely call a few utils functions in core/utils package, pick context attributes (the processor takes the context as arg rather than ListerRegistry + PredicateChecker + CloudProvider), and use the builtin cache rather than receiving it from arguments.	2021-08-16 19:43:10 +02:00
Kubernetes Prow Robot	9f84d391f6	Merge pull request #4022 from amrmahdi/amrh/nodegroupminmaxmetrics [cluster-autoscaler] Publish node group min/max metrics	2021-07-05 07:38:54 -07:00
Bartłomiej Wróblewski	5076047bf8	Skip iteration loop if node creation failed	2021-06-16 14:40:15 +00:00
Benjamin Pineau	986fe3ae20	Metric for CloudProvider.Refresh() duration This function can take an variable amount of time due to various conditions (ie. many nodegroups changes causing forced refreshes, caches time to live expiries, ...). Monitoring that duration is useful to diagnose those variations, and to uncover external issues (ie. throttling from cloud provider) affecting cluster-autoscaler.	2021-05-31 15:55:28 +02:00
Kubernetes Prow Robot	02985973c6	Merge pull request #4104 from brett-elliott/stopcooldown Don't start CA in cooldown mode.	2021-05-27 09:12:23 -07:00
Brett Elliott	1880fe6937	Don't start CA in cooldown mode.	2021-05-27 17:53:52 +02:00
Amr Hanafi (MAHDI))	3ac32b817c	Update node group min/max on cloud provider refresh	2021-05-20 17:36:51 -07:00
Benjamin Pineau	030a2152b0	Fix templated nodeinfo names collisions in BinpackingNodeEstimator Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses the same shared DeepCopyTemplateNode function and inherits its naming pattern, which is great as that fixes a long standing bug. Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with generated nodeinfos and nodes having predictable names (using template name + an incremental ordinal starting at 0) for upcoming nodes. Later, when it looks for fitting nodes for unschedulable pods (when upcoming nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity, or pods antiaffinity, ...), the binpacking estimator will also build virtual nodes and place them in a snapshot fork to evaluate scheduler predicates. Those temporary virtual nodes are built using the same pattern (template name and an index ordinal also starting at 0) as the one previously used by `getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes names for nodegroups having upcoming nodes. But adding nodes by the same name in an existing cluster snapshot isn't allowed, and the evaluation attempt will fail. Practically this blocks re-upscales for nodegroups having upcoming nodes, which can cause a significant delay.	2021-05-19 12:05:40 +02:00
Kubernetes Prow Robot	2beea02a29	Merge pull request #3983 from elmiko/cluster-resource-consumption-metrics Cluster resource consumption metrics	2021-05-13 15:32:04 -07:00
Bartłomiej Wróblewski	1698e0e583	Separate and refactor custom resources logic	2021-04-07 10:31:11 +00:00
Michael McCune	a24ea6c66b	add cluster cores and memory bytes count metrics This change adds 4 metrics that can be used to monitor the minimum and maximum limits for CPU and memory, as well as the current counts in cores and bytes, respectively. The four metrics added are: * `cluster_autoscaler_cpu_limits_cores` * `cluster_autoscaler_cluster_cpu_current_cores` * `cluster_autoscaler_memory_limits_bytes` * `cluster_autoscaler_cluster_memory_current_bytes` This change also adds the `max_cores_total` metric to the metrics proposal doc, as it was previously not recorded there. User story: As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage.	2021-04-06 10:35:21 -04:00
Kubernetes Prow Robot	43ab030969	Merge pull request #3888 from mrak/master Allow name of cluster-autoscaler status ConfigMap to be specificed	2021-03-11 03:22:24 -08:00
Michael McCune	7ecf933e7b	add a metric for unregistered nodes removed by cluster autoscaler This change adds a new metric which counts the number of nodes removed by the cluster autoscaler due to being unregistered with kubernetes. User Story As a cluster-autoscaler user, I would like to know when the autoscaler is cleaning up nodes that have failed to register with kubernetes. I would like to monitor the rate at which failed nodes are being removed so that I can better alert on infrastructure issues which may go unnoticed elsewhere.	2021-03-04 19:23:03 -05:00
Eric Mrak and Brett Kochendorfer	43dd34074e	Allow name of cluster-autoscaler status ConfigMap to be specificed This allows us to run two instances of cluster-autoscaler in our cluster, targeting two different types of autoscaling groups that require different command-line settings to be passed.	2021-02-17 21:52:54 +00:00
Kubernetes Prow Robot	1fc6705724	Merge pull request #3690 from evgenii-petrov-arrival/master Add unremovable_nodes_count metric	2021-02-17 04:13:06 -08:00
Maciek Pytel	9831623810	Set different hostname label for upcoming nodes Function copying template node to use for upcoming nodes was not chaning hostname label, meaning that features relying on this label (ex. pod antiaffinity on hostname topology) would treat all upcoming nodes as a single node. This resulted in triggering too many scale-ups for pods using such features. Analogous function in binpacking didn't have the same bug (but it didn't set unique UID or pod names). I extracted the functionality to a util function used in both places to avoid the two functions getting out of sync again.	2021-02-12 19:41:04 +01:00
Evgenii Petrov	b6f5d5567d	Add unremovable_nodes_count metric	2021-02-12 15:47:34 +00:00
Maciek Pytel	3e42b26a22	Per NodeGroup config for scale-down options This is the implementation of https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343.	2021-01-25 11:00:17 +01:00
Kubernetes Prow Robot	58be2b7505	Merge pull request #3649 from ClearTax/cordon-node-issue-3648 Adding functionality to cordon the node before destroying it.	2021-01-14 04:19:04 -08:00
atul	7670d7b6af	Adding functionality to cordon the node before destroying it. This helps load balancer to remove the node from healthy hosts (ALB does have this support). This won't fix the issue of 502 completely as there is some time node has to live even after cordoning as to serve In-Flight request but load balancer can be configured to remove Cordon nodes from healthy host list. This feature is enabled by cordon-node-before-terminating flag with default value as false to retain existing behavior.	2021-01-14 17:21:37 +05:30
Bartłomiej Wróblewski	0fb897b839	Update imports after scheduler scheduler/framework/v1alpha1 removal	2020-11-30 10:48:52 +00:00
Jakub Tużnik	bf18d57871	Remove ScaleDownNodeDeleted status since we no longer delete nodes synchronously	2020-10-01 11:12:45 +02:00
Jakub Tużnik	3958c6645d	Add an annotation identifying upcoming nodes	2020-07-24 15:20:34 +02:00
Maciek Pytel	655b4081f4	Migrate to klog v2	2020-06-05 17:22:26 +02:00
Jakub Tużnik	73a5cdf928	Address recent breaking changes in scheduler The following things changed in scheduler and needed to be fixed: * NodeInfo was moved to schedulerframework * Some fields on NodeInfo are now exposed directly instead of via getters * NodeInfo.Pods is now a list of schedulerframework.PodInfo, not apiv1.Pod * SharedLister and NodeInfoLister were moved to schedulerframework * PodLister was removed	2020-04-24 17:54:47 +02:00
Jakub Tużnik	8f1efc9866	Add NodeInfoProcessor for proccesing nodeInfosForNodeGroups	2020-03-20 15:19:18 +01:00
Łukasz Osipiuk	a6023265e7	Add clarifying comment regarding podDestination and scaleDownCandidates variables	2020-03-10 15:18:52 +01:00
Aleksandra Malinowska	ce18f7119c	change order of arguments for TryToScaleDown	2020-03-10 11:36:57 +01:00
Aleksandra Malinowska	0b7c45e88a	stop passing scheduled pods around	2020-03-03 16:23:49 +01:00
Aleksandra Malinowska	572bad61ce	use nodes from snapshot in scale down	2020-03-03 16:23:49 +01:00
Aleksandra Malinowska	9c6a0f9aab	Filter out expendable pods before initializing snapshot	2020-03-03 12:05:58 +01:00
Kubernetes Prow Robot	dbbd4572af	Merge pull request #2861 from aleksandra-malinowska/delta-snapshot-15 Cleanup todo	2020-03-02 05:52:44 -08:00
Aleksandra Malinowska	0c13ce7248	add pods from upcoming nodes to snapshot	2020-02-27 14:12:31 +01:00
Aleksandra Malinowska	7ac3d27cf7	cleanup todo - no op	2020-02-27 11:13:37 +01:00
Julien Balestra	628128f65e	cluster-autoscaler/taints: refactor current taint logics in the same package Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>	2020-02-25 13:57:23 +01:00
Julien Balestra	af270b05f6	cluster-autoscaler/taints: ignore taints on existing nodes Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>	2020-02-25 13:55:17 +01:00
Kubernetes Prow Robot	bbeead26ac	Merge pull request #2853 from aleksandra-malinowska/fix-ifs Cleanup ifs in static autoscaler	2020-02-21 06:23:34 -08:00
Aleksandra Malinowska	c4d376b9c2	Cleanup ifs in static autoscaler	2020-02-21 15:03:01 +01:00
Aleksandra Malinowska	468061dcfc	move initializing snapshot after empty cluster check and API calls	2020-02-21 14:50:27 +01:00
Kubernetes Prow Robot	af1dd84305	Merge pull request #2799 from aleksandra-malinowska/delta-snapshot-4 Add delta snapshot implementation	2020-02-14 09:20:17 -08:00
Jakub Tużnik	7a188ab50d	Provide ScaleDownStatusProcessor with info about unremovable nodes	2020-02-11 15:27:33 +01:00
Aleksandra Malinowska	9c018ddb7a	Cleanup cluster snapshot interface	2020-02-05 13:33:03 +01:00
Łukasz Osipiuk	4b30a6f499	Rename propagateClusterSnapshot to initializeClusterSnapshot	2020-02-04 20:52:08 +01:00
Łukasz Osipiuk	6ed2636f10	Drop PredicateChecker.SnapshotClusterState	2020-02-04 20:51:52 +01:00
Łukasz Osipiuk	98efd05b4b	Do not add Pods pointing to inexistent nodes to snapshot	2020-02-04 20:51:49 +01:00
Łukasz Osipiuk	d7770e3044	Use ClusterSnapshot in ScaleDown	2020-02-04 20:51:48 +01:00
Łukasz Osipiuk	9bb2fd15d7	Add TODO	2020-02-04 20:51:42 +01:00
Łukasz Osipiuk	69800ab176	Simulate scheduling of pods waiting for preemption in ClusterSnapshot	2020-02-04 20:51:37 +01:00
Łukasz Osipiuk	d9891ae3ad	Simplify PodListProcessor interface	2020-02-04 20:51:35 +01:00
Łukasz Osipiuk	7e62105cb9	Add upcoming nodes to ClusterSnapshot	2020-02-04 20:51:31 +01:00
Łukasz Osipiuk	83d1c4ff8a	Add GetAllPods and GetAllNodes to ClusterSnapshot	2020-02-04 20:51:30 +01:00
Łukasz Osipiuk	fa2c6e4d9e	Propagate cluster state to ClusterSnapshot	2020-02-04 20:51:27 +01:00
Łukasz Osipiuk	036103c553	Add ClusterSnapshot to AutoscalingContext	2020-02-04 20:51:26 +01:00
Łukasz Osipiuk	373c558303	Extract PredicateChecker interface	2020-02-04 20:51:18 +01:00
Łukasz Osipiuk	b01f2fca8f	Drop ConfigurePredicateCheckerForLoop	2020-02-04 20:51:14 +01:00
dasydong	68433abb7c	Remove duplicate comments	2019-12-28 01:06:22 +08:00
Kubernetes Prow Robot	f6ed9c114a	Merge pull request #2588 from losipiuk/lo/snapshot Snapshot cluster state for scheduler every loop	2019-11-28 05:25:03 -08:00
Łukasz Osipiuk	b67854e800	Snapshot cluster state for scheduler every loop Change-Id: If9d162b83ccc914fe1b02e4689bfe1f4b264407f	2019-11-28 14:02:08 +01:00
Łukasz Osipiuk	17a7bc5164	Ignore NominatedNodeName on Pod if node is gone Change-Id: I4a119f46e55ca2223f9f0fdd3e75ce3f279e293b	2019-11-27 20:26:00 +01:00
Vivek Bagade	910e75365c	remove temporary nodes logic	2019-11-12 11:58:29 +01:00
Jarvis-Zhou	7c9d6e3518	Do not assign return values to variables when not needed	2019-10-25 19:28:00 +08:00
Łukasz Osipiuk	7f083d2393	Move core/utils.go to separate package and split into multiple files	2019-10-22 14:23:40 +02:00

1 2 3 4 5 ...

300 Commits