History

Kuba Tużnik 6bd2432894 CA: switch legacy ScaleDown to use the new Actuator NodeDeletionTracker is now incremented asynchronously for drained nodes, instead of synchronously. This shouldn't change anything in actual behavior, but some tests depended on that, so they had to be adapted. The switch aims to mostly be a semantic no-op, with the following exceptions: * Nodes that fail to be tainted won't be included in NodeDeleteResults, since they are now tainted synchronously.		2022-05-27 15:13:44 +02:00
..
cloudprovider	Merge pull request #4906 from jayantjain93/arch-label	2022-05-24 06:06:06 -07:00
clusterstate	Expose backoff time parameters	2022-05-12 15:34:28 +08:00
config	CA: implement Actuator boilerplate + cropping nodes to paralellism budgets	2022-05-27 14:24:10 +02:00
context	Adding support for Debugging Snapshot	2021-12-30 09:08:05 +00:00
core	CA: switch legacy ScaleDown to use the new Actuator	2022-05-27 15:13:44 +02:00
debuggingsnapshot	CA: Debugging snapshotter locking optimisation for better transactions	2022-01-27 11:36:19 +00:00
estimator	Fix templated nodeinfo names collisions in BinpackingNodeEstimator	2021-05-19 12:05:40 +02:00
expander	add starter code and readme for grpc expander usage	2022-02-16 12:36:37 -08:00
hack	Minor bugfix to update-vendor script	2022-04-07 18:35:34 +02:00
metrics	Limit caching pods per owner reference	2022-03-15 10:03:04 +01:00
processors	CA: implement the final part of node deletion in Actuator	2022-05-27 15:13:01 +02:00
proposals	Design proposal for parallel drain	2022-04-12 15:16:30 +02:00
simulator	Merge pull request #4865 from ahaysx/4745	2022-05-09 06:47:19 -07:00
utils	Make NodeDeletionTracker implement ActuationStatus interface	2022-04-28 17:08:10 +02:00
vendor	CA: implement Actuator boilerplate + cropping nodes to paralellism budgets	2022-05-27 14:24:10 +02:00
version	Vendor Update to K8s v1.25.0-alpha.0	2022-05-05 12:54:55 +00:00
.gitignore	add arch specific cluster-autoscaler targets to gitignore	2021-03-03 16:05:40 -05:00
Dockerfile.amd64	Add build support for ARM64	2020-11-26 13:20:28 +02:00
Dockerfile.arm64	Add build support for ARM64	2020-11-26 13:20:28 +02:00
FAQ.md	Update FAQ.md	2022-05-03 00:01:28 +04:00
Makefile	Merge pull request #3863 from elmiko/remove-extra-build-cmd	2021-03-11 06:24:24 -08:00
OWNERS	Cluster Autoscaler: remove vivekbagade, add towca as an approver in OWNERS	2021-04-27 16:00:09 +02:00
README.md	Merge pull request #4843 from deitch/cherry-cluster-autoscaler	2022-05-05 12:00:42 -07:00
cloudbuild.yaml	…
go.mod	CA: implement Actuator boilerplate + cropping nodes to paralellism budgets	2022-05-27 14:24:10 +02:00
go.sum	bump cloud-provider-azure version in CA	2022-05-11 12:03:33 -07:00
main.go	CA: implement Actuator boilerplate + cropping nodes to paralellism budgets	2022-05-27 14:24:10 +02:00
main_test.go	Fix error format strings according to best practices from CodeReviewComments	2019-01-11 09:10:31 +13:00
push_image.sh	…
update_toc.py	Migrate CA off python2 to python3	2022-03-14 12:52:32 +00:00

README.md

Cluster Autoscaler

Introduction

Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster when one of the following conditions is true:

there are pods that failed to run in the cluster due to insufficient resources.
there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes.

FAQ/Documentation

An FAQ is available HERE.

You should also take a look at the notes and "gotchas" for your specific cloud provider:

Releases

We recommend using Cluster Autoscaler with the Kubernetes control plane (previously referred to as master) version for which it was meant. The below combinations have been tested on GCP. We don't do cross version testing or compatibility testing in other environments. Some user reports indicate successful use of a newer version of Cluster Autoscaler with older clusters, however, there is always a chance that it won't work as expected.

Starting from Kubernetes 1.12, versioning scheme was changed to match Kubernetes minor releases exactly.

Kubernetes Version	CA Version
1.22.X	1.22.X
1.21.X	1.21.X
1.20.X	1.20.X
1.19.X	1.19.X
1.18.X	1.18.X
1.17.X	1.17.X
1.16.X	1.16.X
1.15.X	1.15.X
1.14.X	1.14.X
1.13.X	1.13.X
1.12.X	1.12.X
1.11.X	1.3.X
1.10.X	1.2.X
1.9.X	1.1.X
1.8.X	1.0.X
1.7.X	0.6.X
1.6.X	0.5.X, 0.6.X^*
1.5.X	0.4.X
1.4.X	0.3.X

^*Cluster Autoscaler 0.5.X is the official version shipped with k8s 1.6. We've done some basic tests using k8s 1.6 / CA 0.6 and we're not aware of any problems with this setup. However, Cluster Autoscaler internally simulates Kubernetes' scheduler and using different versions of scheduler code can lead to subtle issues.

Notable changes

For CA 1.1.2 and later, please check release notes.

CA version 1.1.1:

Fixes around metrics in the multiple kube apiserver configuration.
Fixes for unready nodes issues when quota is overrun.

CA version 1.1.0:

Added Azure support.
Added support for pod priorities. More details here.

CA version 1.0.3:

Adds support for safe-to-evict annotation on pod. Pods with this annotation can be evicted even if they don't meet other requirements for it.
Fixes an issue when too many nodes with GPUs could be added during scale-up (https://github.com/kubernetes/kubernetes/issues/54959).

CA Version 1.0.2:

Fixes issues with scaling node groups using GPU from 0 to 1 on GKE (https://github.com/kubernetes/autoscaler/pull/401) and AWS (https://github.com/kubernetes/autoscaler/issues/321).
Fixes a bug where goroutines performing API calls were leaking when using dynamic config on AWS (https://github.com/kubernetes/autoscaler/issues/252).
Node Autoprovisioning support for GKE (the implementation was included in 1.0.0, but this release includes some bugfixes and introduces metrics and events).

CA Version 1.0.1:

Fixes a bug in handling nodes that, at the same time, fail to register in Kubernetes and can't be deleted from cloud provider (https://github.com/kubernetes/autoscaler/issues/369).
Improves estimation of resources available on a node when performing scale-from-0 on GCE (https://github.com/kubernetes/autoscaler/issues/326).
Bugfixes in the new GKE cloud provider implementation.

CA Version 1.0:

With this release we graduated Cluster Autoscaler to GA.

Support for 1000 nodes running 30 pods each. See: Scalability testing report
Support for 10 min graceful termination.
Improved eventing and monitoring.
Node allocatable support.
Removed Azure support. See: PR removing support with reasoning behind this decision
cluster-autoscaler.kubernetes.io/scale-down-disabled annotation for marking nodes that should not be scaled down.
scale-down-delay-after-delete and scale-down-delay-after-failure flags replaced scale-down-trial-interval

CA Version 0.6:

Allows scaling node groups to 0 (currently only in GCE/GKE, other cloud providers are coming). See: How can I scale a node group to 0?
Price-based expander (currently only in GCE/GKE, other cloud providers are coming). See: What are Expanders?
Similar node groups are balanced (to be enabled with a flag). See: I'm running cluster with nodes in multiple zones for HA purposes. Is that supported by Cluster Autoscaler?
It is possible to scale-down nodes with kube-system pods if PodDisruptionBudget is provided. See: How can I scale my cluster to just 1 node?
Automatic node group discovery on AWS (to be enabled with a flag). See: AWS doc.
CA exposes runtime metrics. See: How can I monitor Cluster Autoscaler?
CA exposes an endpoint for liveness probe.
max-grateful-termination-sec flag renamed to max-graceful-termination-sec.
Lower AWS API traffic to DescribeAutoscalingGroup.

CA Version 0.5.4:

Fixes problems with node drain when pods are ignoring SIGTERM.

CA Version 0.5.3:

Fixes problems with pod anti-affinity in scale up https://github.com/kubernetes/autoscaler/issues/33.

CA Version 0.5.2:

Fixes problems with pods using persistent volume claims in scale up https://github.com/kubernetes/contrib/issues/2507.

CA Version 0.5.1:

Fixes problems with slow network route creations on cluster scale up https://github.com/kubernetes/kubernetes/issues/43709.

CA Version 0.5:

CA continues to operate even if some nodes are unready and is able to scale-down them.
CA exports its status to kube-system/cluster-autoscaler-status config map.
CA respects PodDisruptionBudgets.
Azure support.
Alpha support for dynamic config changes.
Multiple expanders to decide which node group to scale up.

CA Version 0.4:

Bulk empty node deletions.
Better scale-up estimator based on binpacking.
Improved logging.

CA Version 0.3:

AWS support.
Performance improvements around scale down.

Deployment

Cluster Autoscaler is designed to run on Kubernetes control plane (previously referred to as master) node. This is the default deployment strategy on GCP. It is possible to run a customized deployment of Cluster Autoscaler on worker nodes, but extra care needs to be taken to ensure that Cluster Autoscaler remains up and running. Users can put it into kube-system namespace (Cluster Autoscaler doesn't scale down node with non-mirrored kube-system pods running on them) and set a priorityClassName: system-cluster-critical property on your pod spec (to prevent your pod from being evicted).

Supported cloud providers: