autoscaler

Commit Graph

Author	SHA1	Message	Date
Daniel Kłobuszewski	c550b77020	Make NodeDeletionTracker implement ActuationStatus interface	2022-04-28 17:08:10 +02:00
Daniel Kłobuszewski	7f8b2da9e3	Separate ScaleDown logic with a new interface	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	5a78f49bc2	Move soft tainting logic to a separate package	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	7686a1f326	Move existing ScaleDown code to a separate package	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	a55135fb47	Stop referencing unneededNodes in static_autoscaler	2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski	627284bdae	Remove direct access to ScaleDown fields	2022-04-26 08:48:45 +02:00
Yaroslava Serdiuk	8a7b99c7eb	Continue CA loop when unregistered nodes were removed	2022-04-12 07:49:42 +00:00
Kubernetes Prow Robot	b64d2949a5	Merge pull request #4633 from jayantjain93/debugging-snapshot-1 CA: Debugging snapshot adding a new field for TemplateNode.	2022-01-27 03:02:25 -08:00
Daniel Kłobuszewski	9944137fae	Don't cache NodeInfo for recently Ready nodes There's a race condition between DaemonSet pods getting scheduled to a new node and Cluster Autoscaler caching that node for the sake of predicting future nodes in a given node group. We can reduce the risk of missing some DaemonSet by providing a grace period before accepting nodes in the cache. 1 minute should be more than enough, except for some pathological edge cases.	2022-01-26 20:18:53 +01:00
Jayant Jain	537e07fdb1	CA: Debugging snapshot adding a new field for TemplateNode. This captures all the templates for nodegroups present	2022-01-24 17:12:57 +00:00
Jayant Jain	729038ff2d	Adding support for Debugging Snapshot	2021-12-30 09:08:05 +00:00
Jayant Jain	da5ff3d971	Introduce Empty Cluster Processor This refactors the handling of cases when the cluster is empty/not ready by CA into a processors in empty_cluster_processor.go	2021-10-13 13:30:30 +00:00
Maciek Pytel	a0109324a2	Change parameter order of TemplateNodeInfoProvider Every other processors (and, I think, function in CA?) that takes AutoscalingContext has it as first parameter. Changing the new processor for consistency.	2021-09-13 15:08:14 +02:00
Benjamin Pineau	8485cf2052	Move GetNodeInfosForGroups to it's own processor Supports providing different NodeInfos sources (either upstream or in local forks, eg. to properly implement variants like in #4000). This also moves a large and specialized code chunk out of core, and removes the need to maintain and pass the GetNodeInfosForGroups() cache from the side, as processors can hold their states themselves. No functional changes to GetNodeInfosForGroups(), outside mechanical changes due to the move: remotely call a few utils functions in core/utils package, pick context attributes (the processor takes the context as arg rather than ListerRegistry + PredicateChecker + CloudProvider), and use the builtin cache rather than receiving it from arguments.	2021-08-16 19:43:10 +02:00
Kubernetes Prow Robot	9f84d391f6	Merge pull request #4022 from amrmahdi/amrh/nodegroupminmaxmetrics [cluster-autoscaler] Publish node group min/max metrics	2021-07-05 07:38:54 -07:00
Bartłomiej Wróblewski	5076047bf8	Skip iteration loop if node creation failed	2021-06-16 14:40:15 +00:00
Benjamin Pineau	986fe3ae20	Metric for CloudProvider.Refresh() duration This function can take an variable amount of time due to various conditions (ie. many nodegroups changes causing forced refreshes, caches time to live expiries, ...). Monitoring that duration is useful to diagnose those variations, and to uncover external issues (ie. throttling from cloud provider) affecting cluster-autoscaler.	2021-05-31 15:55:28 +02:00
Kubernetes Prow Robot	02985973c6	Merge pull request #4104 from brett-elliott/stopcooldown Don't start CA in cooldown mode.	2021-05-27 09:12:23 -07:00
Brett Elliott	1880fe6937	Don't start CA in cooldown mode.	2021-05-27 17:53:52 +02:00
Amr Hanafi (MAHDI))	3ac32b817c	Update node group min/max on cloud provider refresh	2021-05-20 17:36:51 -07:00
Benjamin Pineau	030a2152b0	Fix templated nodeinfo names collisions in BinpackingNodeEstimator Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses the same shared DeepCopyTemplateNode function and inherits its naming pattern, which is great as that fixes a long standing bug. Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with generated nodeinfos and nodes having predictable names (using template name + an incremental ordinal starting at 0) for upcoming nodes. Later, when it looks for fitting nodes for unschedulable pods (when upcoming nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity, or pods antiaffinity, ...), the binpacking estimator will also build virtual nodes and place them in a snapshot fork to evaluate scheduler predicates. Those temporary virtual nodes are built using the same pattern (template name and an index ordinal also starting at 0) as the one previously used by `getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes names for nodegroups having upcoming nodes. But adding nodes by the same name in an existing cluster snapshot isn't allowed, and the evaluation attempt will fail. Practically this blocks re-upscales for nodegroups having upcoming nodes, which can cause a significant delay.	2021-05-19 12:05:40 +02:00
Kubernetes Prow Robot	2beea02a29	Merge pull request #3983 from elmiko/cluster-resource-consumption-metrics Cluster resource consumption metrics	2021-05-13 15:32:04 -07:00
Bartłomiej Wróblewski	1698e0e583	Separate and refactor custom resources logic	2021-04-07 10:31:11 +00:00
Michael McCune	a24ea6c66b	add cluster cores and memory bytes count metrics This change adds 4 metrics that can be used to monitor the minimum and maximum limits for CPU and memory, as well as the current counts in cores and bytes, respectively. The four metrics added are: * `cluster_autoscaler_cpu_limits_cores` * `cluster_autoscaler_cluster_cpu_current_cores` * `cluster_autoscaler_memory_limits_bytes` * `cluster_autoscaler_cluster_memory_current_bytes` This change also adds the `max_cores_total` metric to the metrics proposal doc, as it was previously not recorded there. User story: As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage.	2021-04-06 10:35:21 -04:00
Kubernetes Prow Robot	43ab030969	Merge pull request #3888 from mrak/master Allow name of cluster-autoscaler status ConfigMap to be specificed	2021-03-11 03:22:24 -08:00
Michael McCune	7ecf933e7b	add a metric for unregistered nodes removed by cluster autoscaler This change adds a new metric which counts the number of nodes removed by the cluster autoscaler due to being unregistered with kubernetes. User Story As a cluster-autoscaler user, I would like to know when the autoscaler is cleaning up nodes that have failed to register with kubernetes. I would like to monitor the rate at which failed nodes are being removed so that I can better alert on infrastructure issues which may go unnoticed elsewhere.	2021-03-04 19:23:03 -05:00
Eric Mrak and Brett Kochendorfer	43dd34074e	Allow name of cluster-autoscaler status ConfigMap to be specificed This allows us to run two instances of cluster-autoscaler in our cluster, targeting two different types of autoscaling groups that require different command-line settings to be passed.	2021-02-17 21:52:54 +00:00
Kubernetes Prow Robot	1fc6705724	Merge pull request #3690 from evgenii-petrov-arrival/master Add unremovable_nodes_count metric	2021-02-17 04:13:06 -08:00
Maciek Pytel	9831623810	Set different hostname label for upcoming nodes Function copying template node to use for upcoming nodes was not chaning hostname label, meaning that features relying on this label (ex. pod antiaffinity on hostname topology) would treat all upcoming nodes as a single node. This resulted in triggering too many scale-ups for pods using such features. Analogous function in binpacking didn't have the same bug (but it didn't set unique UID or pod names). I extracted the functionality to a util function used in both places to avoid the two functions getting out of sync again.	2021-02-12 19:41:04 +01:00
Evgenii Petrov	b6f5d5567d	Add unremovable_nodes_count metric	2021-02-12 15:47:34 +00:00
Maciek Pytel	3e42b26a22	Per NodeGroup config for scale-down options This is the implementation of https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343.	2021-01-25 11:00:17 +01:00
Kubernetes Prow Robot	58be2b7505	Merge pull request #3649 from ClearTax/cordon-node-issue-3648 Adding functionality to cordon the node before destroying it.	2021-01-14 04:19:04 -08:00
atul	7670d7b6af	Adding functionality to cordon the node before destroying it. This helps load balancer to remove the node from healthy hosts (ALB does have this support). This won't fix the issue of 502 completely as there is some time node has to live even after cordoning as to serve In-Flight request but load balancer can be configured to remove Cordon nodes from healthy host list. This feature is enabled by cordon-node-before-terminating flag with default value as false to retain existing behavior.	2021-01-14 17:21:37 +05:30
Bartłomiej Wróblewski	0fb897b839	Update imports after scheduler scheduler/framework/v1alpha1 removal	2020-11-30 10:48:52 +00:00
Jakub Tużnik	bf18d57871	Remove ScaleDownNodeDeleted status since we no longer delete nodes synchronously	2020-10-01 11:12:45 +02:00
Jakub Tużnik	3958c6645d	Add an annotation identifying upcoming nodes	2020-07-24 15:20:34 +02:00
Maciek Pytel	655b4081f4	Migrate to klog v2	2020-06-05 17:22:26 +02:00
Jakub Tużnik	73a5cdf928	Address recent breaking changes in scheduler The following things changed in scheduler and needed to be fixed: * NodeInfo was moved to schedulerframework * Some fields on NodeInfo are now exposed directly instead of via getters * NodeInfo.Pods is now a list of schedulerframework.PodInfo, not apiv1.Pod * SharedLister and NodeInfoLister were moved to schedulerframework * PodLister was removed	2020-04-24 17:54:47 +02:00
Jakub Tużnik	8f1efc9866	Add NodeInfoProcessor for proccesing nodeInfosForNodeGroups	2020-03-20 15:19:18 +01:00
Łukasz Osipiuk	a6023265e7	Add clarifying comment regarding podDestination and scaleDownCandidates variables	2020-03-10 15:18:52 +01:00
Aleksandra Malinowska	ce18f7119c	change order of arguments for TryToScaleDown	2020-03-10 11:36:57 +01:00
Aleksandra Malinowska	0b7c45e88a	stop passing scheduled pods around	2020-03-03 16:23:49 +01:00
Aleksandra Malinowska	572bad61ce	use nodes from snapshot in scale down	2020-03-03 16:23:49 +01:00
Aleksandra Malinowska	9c6a0f9aab	Filter out expendable pods before initializing snapshot	2020-03-03 12:05:58 +01:00
Kubernetes Prow Robot	dbbd4572af	Merge pull request #2861 from aleksandra-malinowska/delta-snapshot-15 Cleanup todo	2020-03-02 05:52:44 -08:00
Aleksandra Malinowska	0c13ce7248	add pods from upcoming nodes to snapshot	2020-02-27 14:12:31 +01:00
Aleksandra Malinowska	7ac3d27cf7	cleanup todo - no op	2020-02-27 11:13:37 +01:00
Julien Balestra	628128f65e	cluster-autoscaler/taints: refactor current taint logics in the same package Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>	2020-02-25 13:57:23 +01:00
Julien Balestra	af270b05f6	cluster-autoscaler/taints: ignore taints on existing nodes Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>	2020-02-25 13:55:17 +01:00
Kubernetes Prow Robot	bbeead26ac	Merge pull request #2853 from aleksandra-malinowska/fix-ifs Cleanup ifs in static autoscaler	2020-02-21 06:23:34 -08:00

1 2 3 4 5

225 Commits