Commit Graph

166 Commits

Author SHA1 Message Date
Damir Markovic 11d150e920 Add podScaleUpDelay annotation support 2022-09-05 20:24:19 +02:00
James Ravn 1b98b3823a
Allow balancing by labels exclusively
Adds a new flag `--balance-label` which allows users to balance between
node groups exclusively via labels.

This gives users the flexibility to specify the similarity logic
themselves when --balance-similar-node-groups is in use.
2022-07-06 10:34:18 +01:00
Maciek Pytel ab891418f6 Limit binpacking based on #new_nodes or time
The binpacking algorithm is O(#pending_pods * #new_nodes) and
calculating a very large scale-up can get stuck for minutes or even
hours, leading to CA failing it's healthcheck and going down.
The new limiting prevents this scenario by stopping binpacking after
reaching specified threshold. Any pods that remain pending as a result
of shorter binpacking will be processed next autoscaler loop.

The thresholds used can be controlled with newly introduced flags:
--max-nodes-per-scaleup and --max-nodegroup-binpacking-duration. The
limiting can be disabled by setting both flags to 0 (not recommended,
especially for --max-nodegroup-binpacking-duration).
2022-06-20 17:02:51 +02:00
Michael McCune 8c27f76933 add a flag to allow event duplication
this change brings in a new command line flag,
`--record-duplicated-events`, which allows a user to enable the
duplication of events bypassing the 5 minute de-duplication window.
2022-06-03 14:26:38 -04:00
Yaroslava Serdiuk d919ce3fbf Define AnnotationNodeInfoProvider processor 2022-06-03 16:12:16 +00:00
Yaroslava Serdiuk 7fe27ddf99 GCE: Add --gce-expander-ephemeral-storage-support flag 2022-06-03 16:12:09 +00:00
Kuba Tużnik 7dc0d4f57c CA: implement Actuator boilerplate + cropping nodes to paralellism budgets 2022-05-27 14:24:10 +02:00
weidongcai 03a0475502 Expose backoff time parameters 2022-05-12 15:34:28 +08:00
Grigoris Thanasoulas 719a53e8d7 cluster-autoscaler: Add --max-pod-eviction-time flag
Add a flag to allow the user configure then MaxPodEvictionTime to values
other than the default 2m. This is needed in cases a pod takes more than
2 minutes to be evicted.

Signed-off-by: Grigoris Thanasoulas <gregth@arrikto.com>
2022-04-30 08:52:41 +03:00
Daniel Kłobuszewski e07fd1e130 Move filter_out_schedulable to a separate package 2022-04-26 08:48:45 +02:00
Kubernetes Prow Robot 0123869b7a
Merge pull request #4452 from airbnb/es--grpc-expander-plugin
Add gRPC expander plugin
2022-02-21 06:54:14 -08:00
Evan Sheng 4504f55485 Add grpc expander and tests 2022-02-16 12:34:06 -08:00
Yaroslava Serdiuk a9a7d98f2c Add expire time for nodeInfo cache items 2022-02-09 09:38:32 +00:00
Jayant Jain 729038ff2d Adding support for Debugging Snapshot 2021-12-30 09:08:05 +00:00
ialidzhikov 986d62fb96 Add `--feature-gates` flag to support scale up on volume limits (CSI migration enabled)
Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>
2021-12-19 15:38:17 +02:00
Diego Bonfigli 1b4fcf6bf7 Re-add default expander 2021-12-09 18:27:46 +01:00
Michael McCune 99a242a9e6 add ClusterAPI nodegroupset processor
This allows the ClusterAPI provider to ignore the
`topology.ebs.csi.aws.com/zone` label by adding a custom nodegroupset
processor. It also adds unit tests to exercise the new processor.
2021-11-10 17:01:27 -05:00
Ryan McNamara 068af5bf7e Allow specification of multiple expanders
Multiple expanders can now be specified, expanders now "filter to the
tied for best" instead of "selecting the best" so the output of one
expander is now fed to the input of the next. Each expander may only
be used once to disallow bad configuration. This should not be a change
in functionality as in the event of a tie the random expander is still
used.
2021-09-23 14:31:39 -06:00
Kubernetes Prow Robot 9f84d391f6
Merge pull request #4022 from amrmahdi/amrh/nodegroupminmaxmetrics
[cluster-autoscaler] Publish node group min/max metrics
2021-07-05 07:38:54 -07:00
Daniel Kłobuszewski 081c4664d3 Add a flag to control DaemonSet eviction on non-empty nodes 2021-06-25 11:06:10 +02:00
Amr Hanafi (MAHDI)) f5c2ab7328 Emit the node group metrics behind a flag 2021-05-20 16:49:39 -07:00
Kubernetes Prow Robot 2beea02a29
Merge pull request #3983 from elmiko/cluster-resource-consumption-metrics
Cluster resource consumption metrics
2021-05-13 15:32:04 -07:00
Kubernetes Prow Robot 200415e990
Merge pull request #3940 from mcristina422/patch-1
Release leader election lock on shutdown
2021-05-04 07:21:11 -07:00
Brett Elliott 3b48a3193f Set cluster autoscaler-specific user agent.
Refactored mocks to remove redundancy.
2021-04-06 17:49:35 +02:00
Michael McCune a24ea6c66b add cluster cores and memory bytes count metrics
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.

The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`

This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.

User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
2021-04-06 10:35:21 -04:00
Michael Cristina 4cf9a98679
Release leader election lock on shutdown 2021-03-12 12:51:03 -06:00
Eric Mrak and Brett Kochendorfer 43dd34074e Allow name of cluster-autoscaler status ConfigMap to be specificed
This allows us to run two instances of cluster-autoscaler in our
cluster, targeting two different types of autoscaling groups that
require different command-line settings to be passed.
2021-02-17 21:52:54 +00:00
Kubernetes Prow Robot b470c62bfa
Merge pull request #3630 from marc-sensenich/configurable-leader-election-resource-lock-name
Allow for the leader election resourcelock to have a configurable name
2021-01-27 04:59:40 -08:00
Maciek Pytel 65b3c8d3cc Rename default options to NodeGroupDefaults 2021-01-25 13:21:30 +01:00
Maciek Pytel 3e42b26a22 Per NodeGroup config for scale-down options
This is the implementation of
https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343.
2021-01-25 11:00:17 +01:00
Maciek Pytel 08d18a7bd0 Define interfaces for per NodeGroup config.
This is the first step of implementing
https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343.
New method was added to cloudprovider interface. All existing providers
were updated with a no-op stub implementation that will result in no
behavior change.
The config values specified per NodeGroup are not yet applied.
2021-01-25 11:00:16 +01:00
Yaroslava Serdiuk 7068bc48f6 add DaemonSet eviction option for empty nodes 2021-01-20 18:58:16 +00:00
Kubernetes Prow Robot bb3977764b
Merge pull request #3704 from DataDog/gcp-faster-startup
gcp: faster startup and refreshes with many MIGs
2021-01-15 06:11:51 -08:00
atul 7670d7b6af Adding functionality to cordon the node before destroying it. This helps load balancer to remove the node from healthy hosts (ALB does have this support).
This won't fix the issue of 502 completely as there is some time node has to live even after cordoning as to serve In-Flight request but load balancer can be configured to remove Cordon nodes from healthy host list.
This feature is enabled by cordon-node-before-terminating flag with default value as false to retain existing behavior.
2021-01-14 17:21:37 +05:30
Benjamin Pineau 087df8951d gcp: faster startup and refreshes with many MIGs
With 1.5k MIGs attached to a cluster, cluster-autoscaler needs about
40mn to start. Refreshing MIGs+ITs concurrently brings that down to
about 5mn.

While bulk GCE API calls (triggered at startup and on Refresh() calls)
and a few stateless functions (called by GetMigInstanceTemplate) become
concurrent, cache accesses remains lock protected. To that effect:
* Set RegenerateInstancesCache to run parallels RegenerateInstanceCacheForMig
  (slightly adapted so the slow GceService.FetchMigInstances call isn't locked)
* Set fetchAutoMigs to run parallels registerMig: rework GetMigInstanceTemplate
  so the slow InstanceGroupManagers.Get+InstanceTemplates.Get calls aren't locked

Tested on a large k8s cluster (> 1k MIGs) with intense scaling activity,
and tested on live clusters with "go build -race" cluster-autoscaler builds.
2020-12-24 11:33:49 +01:00
Marc Sensenich 2b402b5670 Allow for the leader election resourcelock to have a configurable name 2020-10-19 08:53:29 -04:00
Benjamin Pineau bfd6fe7fed Ignore topology.gke.io/zone when comparing groups
Commit bb2eed1cff introduced a new `topology.gke.io/zone` label to
GCE nodes templates, for CSI needs.

That label holds zone name, making nodeInfo templates dissimilar
for groups belonging to different zones. The CA otherwise tries to
ignore those zonal labels (ie. it ignores the standards LabelZoneRegion
and LabelZoneFailureDomain) when it looks for nodegroups similarities.
2020-10-12 15:14:21 +02:00
Jason DeTiberus 150dbdeb64
[cluster-autoscaler] Support using --cloud-config for clusterapi provider
- Leverage --cloud-config to allow for providing a separate kubeconfig for Cluster API management and workload cluster resources
- Allow for fallback to previous behavior when --cloud-config is not specified for backward compatibility
- Provides a --clusterapi-cloud-config-authoritative flag to disable the above fallback behavior and allow for both the management and workload cluster clients to use the in-cluster config
2020-09-21 10:38:06 -04:00
M. Habib Rosyad b7e02047f7 expose max-nodes-total as a metric 2020-08-19 17:43:39 +07:00
Maciek Pytel 3c7727a603 Fixes after vendor update 2020-06-05 17:22:26 +02:00
Maciek Pytel 655b4081f4 Migrate to klog v2 2020-06-05 17:22:26 +02:00
Jakub Tużnik c65a6bb4ea Scalability: Switch to deltaClusterSnapshot
This massively speeds up checking predicates.
2020-04-14 15:27:55 +02:00
Adam Malcontenti-Wilson 8313e969c7 Add support for passing in custom ignore labels 2020-03-17 14:30:03 +11:00
Adam Malcontenti-Wilson 5476125063 Use builder methods to create NodeInfoComparator functions 2020-03-17 13:51:15 +11:00
Andrew McDermott 1efc258b3c config/options: add KubeConfigPath
Access to this is required by cloudprovider/clusterapi.
2020-03-10 10:27:34 +00:00
Aleksandra Malinowska 70ef92a12a Fixes in CA for vendor update 2020-02-13 15:28:29 +01:00
Łukasz Osipiuk dd9fe48f46 Remove filterOutSchedulableSimple 2020-01-29 13:11:38 +01:00
Łukasz Osipiuk a5abeaf94c Add /debug/pprof http handler 2020-01-28 11:13:29 +01:00
Łukasz Osipiuk b4c8bbb12c Fixes around metrics/ handler 2019-11-22 14:07:10 +01:00
Colin Murphy 7f0a42b023 Add additional AWS labels.
Whitelist additional node labels for AWS CNI custom networking and
EC2 lifecycle.

Move AWS ignored node labels to AWS specific file.
2019-10-25 17:17:02 -04:00