Commit Graph

225 Commits

Author SHA1 Message Date
Daniel Kłobuszewski c550b77020 Make NodeDeletionTracker implement ActuationStatus interface 2022-04-28 17:08:10 +02:00
Daniel Kłobuszewski 7f8b2da9e3 Separate ScaleDown logic with a new interface 2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski 5a78f49bc2 Move soft tainting logic to a separate package 2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski 7686a1f326 Move existing ScaleDown code to a separate package 2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski a55135fb47 Stop referencing unneededNodes in static_autoscaler 2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski 627284bdae Remove direct access to ScaleDown fields 2022-04-26 08:48:45 +02:00
Yaroslava Serdiuk 8a7b99c7eb Continue CA loop when unregistered nodes were removed 2022-04-12 07:49:42 +00:00
Kubernetes Prow Robot b64d2949a5
Merge pull request #4633 from jayantjain93/debugging-snapshot-1
CA: Debugging snapshot adding a new field for TemplateNode.
2022-01-27 03:02:25 -08:00
Daniel Kłobuszewski 9944137fae Don't cache NodeInfo for recently Ready nodes
There's a race condition between DaemonSet pods getting scheduled to a
new node and Cluster Autoscaler caching that node for the sake of
predicting future nodes in a given node group. We can reduce the risk of
missing some DaemonSet by providing a grace period before accepting nodes in the
cache. 1 minute should be more than enough, except for some pathological
edge cases.
2022-01-26 20:18:53 +01:00
Jayant Jain 537e07fdb1 CA: Debugging snapshot adding a new field for TemplateNode. This captures all the templates for nodegroups present 2022-01-24 17:12:57 +00:00
Jayant Jain 729038ff2d Adding support for Debugging Snapshot 2021-12-30 09:08:05 +00:00
Jayant Jain da5ff3d971 Introduce Empty Cluster Processor
This refactors the handling of cases when the cluster is empty/not ready by CA into a processors in empty_cluster_processor.go
2021-10-13 13:30:30 +00:00
Maciek Pytel a0109324a2 Change parameter order of TemplateNodeInfoProvider
Every other processors (and, I think, function in CA?) that takes
AutoscalingContext has it as first parameter. Changing the new processor
for consistency.
2021-09-13 15:08:14 +02:00
Benjamin Pineau 8485cf2052 Move GetNodeInfosForGroups to it's own processor
Supports providing different NodeInfos sources (either upstream or in
local forks, eg. to properly implement variants like in #4000).

This also moves a large and specialized code chunk out of core, and removes
the need to maintain and pass the GetNodeInfosForGroups() cache from the side,
as processors can hold their states themselves.

No functional changes to GetNodeInfosForGroups(), outside mechanical changes
due to the move: remotely call a few utils functions in core/utils package,
pick context attributes (the processor takes the context as arg rather than
ListerRegistry + PredicateChecker + CloudProvider), and use the builtin cache
rather than receiving it from arguments.
2021-08-16 19:43:10 +02:00
Kubernetes Prow Robot 9f84d391f6
Merge pull request #4022 from amrmahdi/amrh/nodegroupminmaxmetrics
[cluster-autoscaler] Publish node group min/max metrics
2021-07-05 07:38:54 -07:00
Bartłomiej Wróblewski 5076047bf8 Skip iteration loop if node creation failed 2021-06-16 14:40:15 +00:00
Benjamin Pineau 986fe3ae20 Metric for CloudProvider.Refresh() duration
This function can take an variable amount of time due to various
conditions (ie. many nodegroups changes causing forced refreshes,
caches time to live expiries, ...).

Monitoring that duration is useful to diagnose those variations,
and to uncover external issues (ie. throttling from cloud provider)
affecting cluster-autoscaler.
2021-05-31 15:55:28 +02:00
Kubernetes Prow Robot 02985973c6
Merge pull request #4104 from brett-elliott/stopcooldown
Don't start CA in cooldown mode.
2021-05-27 09:12:23 -07:00
Brett Elliott 1880fe6937 Don't start CA in cooldown mode. 2021-05-27 17:53:52 +02:00
Amr Hanafi (MAHDI)) 3ac32b817c Update node group min/max on cloud provider refresh 2021-05-20 17:36:51 -07:00
Benjamin Pineau 030a2152b0 Fix templated nodeinfo names collisions in BinpackingNodeEstimator
Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses
the same shared DeepCopyTemplateNode function and inherits its naming
pattern, which is great as that fixes a long standing bug.

Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with
generated nodeinfos and nodes having predictable names (using template name
+ an incremental ordinal starting at 0) for upcoming nodes.

Later, when it looks for fitting nodes for unschedulable pods (when upcoming
nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity,
or pods antiaffinity, ...), the binpacking estimator will also build virtual
nodes and place them in a snapshot fork to evaluate scheduler predicates.

Those temporary virtual nodes are built using the same pattern (template name
and an index ordinal also starting at 0) as the one previously used by
`getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes
names for nodegroups having upcoming nodes.

But adding nodes by the same name in an existing cluster snapshot isn't
allowed, and the evaluation attempt will fail.

Practically this blocks re-upscales for nodegroups having upcoming nodes,
which can cause a significant delay.
2021-05-19 12:05:40 +02:00
Kubernetes Prow Robot 2beea02a29
Merge pull request #3983 from elmiko/cluster-resource-consumption-metrics
Cluster resource consumption metrics
2021-05-13 15:32:04 -07:00
Bartłomiej Wróblewski 1698e0e583 Separate and refactor custom resources logic 2021-04-07 10:31:11 +00:00
Michael McCune a24ea6c66b add cluster cores and memory bytes count metrics
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.

The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`

This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.

User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
2021-04-06 10:35:21 -04:00
Kubernetes Prow Robot 43ab030969
Merge pull request #3888 from mrak/master
Allow name of cluster-autoscaler status ConfigMap to be specificed
2021-03-11 03:22:24 -08:00
Michael McCune 7ecf933e7b add a metric for unregistered nodes removed by cluster autoscaler
This change adds a new metric which counts the number of nodes removed
by the cluster autoscaler due to being unregistered with kubernetes.

User Story

As a cluster-autoscaler user, I would like to know when the autoscaler
is cleaning up nodes that have failed to register with kubernetes. I
would like to monitor the rate at which failed nodes are being removed
so that I can better alert on infrastructure issues which may go
unnoticed elsewhere.
2021-03-04 19:23:03 -05:00
Eric Mrak and Brett Kochendorfer 43dd34074e Allow name of cluster-autoscaler status ConfigMap to be specificed
This allows us to run two instances of cluster-autoscaler in our
cluster, targeting two different types of autoscaling groups that
require different command-line settings to be passed.
2021-02-17 21:52:54 +00:00
Kubernetes Prow Robot 1fc6705724
Merge pull request #3690 from evgenii-petrov-arrival/master
Add unremovable_nodes_count metric
2021-02-17 04:13:06 -08:00
Maciek Pytel 9831623810 Set different hostname label for upcoming nodes
Function copying template node to use for upcoming nodes was
not chaning hostname label, meaning that features relying on
this label (ex. pod antiaffinity on hostname topology) would
treat all upcoming nodes as a single node.
This resulted in triggering too many scale-ups for pods
using such features. Analogous function in binpacking didn't
have the same bug (but it didn't set unique UID or pod names).
I extracted the functionality to a util function used in both
places to avoid the two functions getting out of sync again.
2021-02-12 19:41:04 +01:00
Evgenii Petrov b6f5d5567d Add unremovable_nodes_count metric 2021-02-12 15:47:34 +00:00
Maciek Pytel 3e42b26a22 Per NodeGroup config for scale-down options
This is the implementation of
https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343.
2021-01-25 11:00:17 +01:00
Kubernetes Prow Robot 58be2b7505
Merge pull request #3649 from ClearTax/cordon-node-issue-3648
Adding functionality to cordon the node before destroying it.
2021-01-14 04:19:04 -08:00
atul 7670d7b6af Adding functionality to cordon the node before destroying it. This helps load balancer to remove the node from healthy hosts (ALB does have this support).
This won't fix the issue of 502 completely as there is some time node has to live even after cordoning as to serve In-Flight request but load balancer can be configured to remove Cordon nodes from healthy host list.
This feature is enabled by cordon-node-before-terminating flag with default value as false to retain existing behavior.
2021-01-14 17:21:37 +05:30
Bartłomiej Wróblewski 0fb897b839 Update imports after scheduler scheduler/framework/v1alpha1 removal 2020-11-30 10:48:52 +00:00
Jakub Tużnik bf18d57871 Remove ScaleDownNodeDeleted status since we no longer delete nodes synchronously 2020-10-01 11:12:45 +02:00
Jakub Tużnik 3958c6645d Add an annotation identifying upcoming nodes 2020-07-24 15:20:34 +02:00
Maciek Pytel 655b4081f4 Migrate to klog v2 2020-06-05 17:22:26 +02:00
Jakub Tużnik 73a5cdf928 Address recent breaking changes in scheduler
The following things changed in scheduler and needed to be fixed:
* NodeInfo was moved to schedulerframework
* Some fields on NodeInfo are now exposed directly instead of via getters
* NodeInfo.Pods is now a list of *schedulerframework.PodInfo, not *apiv1.Pod
* SharedLister and NodeInfoLister were moved to schedulerframework
* PodLister was removed
2020-04-24 17:54:47 +02:00
Jakub Tużnik 8f1efc9866 Add NodeInfoProcessor for proccesing nodeInfosForNodeGroups 2020-03-20 15:19:18 +01:00
Łukasz Osipiuk a6023265e7 Add clarifying comment regarding podDestination and scaleDownCandidates variables 2020-03-10 15:18:52 +01:00
Aleksandra Malinowska ce18f7119c change order of arguments for TryToScaleDown 2020-03-10 11:36:57 +01:00
Aleksandra Malinowska 0b7c45e88a stop passing scheduled pods around 2020-03-03 16:23:49 +01:00
Aleksandra Malinowska 572bad61ce use nodes from snapshot in scale down 2020-03-03 16:23:49 +01:00
Aleksandra Malinowska 9c6a0f9aab Filter out expendable pods before initializing snapshot 2020-03-03 12:05:58 +01:00
Kubernetes Prow Robot dbbd4572af
Merge pull request #2861 from aleksandra-malinowska/delta-snapshot-15
Cleanup todo
2020-03-02 05:52:44 -08:00
Aleksandra Malinowska 0c13ce7248 add pods from upcoming nodes to snapshot 2020-02-27 14:12:31 +01:00
Aleksandra Malinowska 7ac3d27cf7 cleanup todo - no op 2020-02-27 11:13:37 +01:00
Julien Balestra 628128f65e cluster-autoscaler/taints: refactor current taint logics in the same package
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2020-02-25 13:57:23 +01:00
Julien Balestra af270b05f6 cluster-autoscaler/taints: ignore taints on existing nodes
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2020-02-25 13:55:17 +01:00
Kubernetes Prow Robot bbeead26ac
Merge pull request #2853 from aleksandra-malinowska/fix-ifs
Cleanup ifs in static autoscaler
2020-02-21 06:23:34 -08:00