Commit Graph

112 Commits

Author SHA1 Message Date
Bartłomiej Wróblewski 0470fdfc35 Clean up DS utils: remove unused cluster snapshot and predicate checker 2023-01-23 14:14:53 +00:00
Kubernetes Prow Robot f507519916
Merge pull request #5423 from yaroslava-serdiuk/sd-sorting
Add scale down candidates observer
2023-01-19 10:14:16 -08:00
Yaroslava Serdiuk 541ce04e4b Add previous scale down candidate sorting 2023-01-19 16:04:50 +00:00
Yaroslava Serdiuk 97159df69b Add scale down candidates observer 2023-01-19 16:04:42 +00:00
michael mccune 955396e857 remove clusterapi nodegroupset processor
as discussed with the cluster api community[0], the nodegroupset
processor is being removed from the clusterapi provider implementation
in favor of instructing our community on the use of the
--balancing-ignore-label flag. due to the wide variety of provider
infrastructures that clusterapi can be deployed on, we would prefer to
not encode all of these labels in the autoscaler itself. see the linked
recording for more information.

[0] https://www.youtube.com/watch?v=jbhca_9oPuQ
2023-01-12 15:05:37 -05:00
bsoghigian 0f8ed0b81f Configurable difference ratios 2023-01-09 22:40:16 -08:00
Kubernetes Prow Robot d9ffb8f5ce
Merge pull request #5317 from grosser/grosser/ref2
cluster-autoscaler: refactor BalanceScaleUpBetweenGroups
2022-12-19 00:49:44 -08:00
Kubernetes Prow Robot bc483274e4
Merge pull request #5325 from x13n/master
Log node group min and current size when skipping scale down
2022-11-24 02:24:03 -08:00
Daniel Kłobuszewski d9100cd707 Log node group min and current size when skipping scale down 2022-11-23 13:23:07 +01:00
Michael Grosser 62f29d23af
cluster-autoscaler: refactor BalanceScaleUpBetweenGroups 2022-11-15 13:21:29 -08:00
Bartłomiej Wróblewski 4373c467fe Add ScaleDown.Actuator to AutoscalingContext 2022-11-02 13:12:25 +00:00
Daniel Kłobuszewski 18f2e67c4f Split out code from simulator package 2022-10-18 11:51:44 +02:00
Flavian f1b6d4ded6 handle directx nodes the same as gpu nodes 2022-09-23 09:55:14 +02:00
Michael McCune ba9c164463 update clusterapi nodegroups processor
this change adds labels that are used on Alibaba Cloud and IBM Cloud for
CSI and CCM.
2022-08-18 15:55:35 -04:00
Yaroslava Serdiuk 887e16c3fc CA: Iterate through existed node groups in AnnotationNodeInfoProvider 2022-08-09 12:28:28 +00:00
James Ravn 1b98b3823a
Allow balancing by labels exclusively
Adds a new flag `--balance-label` which allows users to balance between
node groups exclusively via labels.

This gives users the flexibility to specify the similarity logic
themselves when --balance-similar-node-groups is in use.
2022-07-06 10:34:18 +01:00
Yaroslava Serdiuk 466052aeb4 Add nodeTemplate annotations to node annotations 2022-06-03 16:12:16 +00:00
Yaroslava Serdiuk d919ce3fbf Define AnnotationNodeInfoProvider processor 2022-06-03 16:12:16 +00:00
Kuba Tużnik b228f789dd CA: implement the final part of node deletion in Actuator 2022-05-27 15:13:01 +02:00
Daniel Kłobuszewski c550b77020 Make NodeDeletionTracker implement ActuationStatus interface 2022-04-28 17:08:10 +02:00
Daniel Kłobuszewski 627284bdae Remove direct access to ScaleDown fields 2022-04-26 08:48:45 +02:00
Daniel Kłobuszewski 358f3a9218 Extract utilization info to a separate package 2022-04-26 08:48:45 +02:00
Kubernetes Prow Robot 3e53cc4b8d
Merge pull request #4674 from x13n/nodestatus
Expose nodes with unready GPU in CA status
2022-03-03 06:17:48 -08:00
Daniel Kłobuszewski 26769e4c1b Expose nodes with unready GPU in CA status
This change simplifies debugging GPU issues: without it, all nodes can
be Ready as far as Kubernetes API is concerned, but CA will still report
some of them as unready if are missing GPU resource. Explicitly calling
them out in the status ConfigMap will point into the right direction.
2022-03-03 14:59:31 +01:00
Yaroslava Serdiuk a9a7d98f2c Add expire time for nodeInfo cache items 2022-02-09 09:38:32 +00:00
Daniel Kłobuszewski 9944137fae Don't cache NodeInfo for recently Ready nodes
There's a race condition between DaemonSet pods getting scheduled to a
new node and Cluster Autoscaler caching that node for the sake of
predicting future nodes in a given node group. We can reduce the risk of
missing some DaemonSet by providing a grace period before accepting nodes in the
cache. 1 minute should be more than enough, except for some pathological
edge cases.
2022-01-26 20:18:53 +01:00
Daniel Gutowski a230b47fec Add AutoscalingContext to the scale-down post-processor 2022-01-18 07:58:53 +00:00
Daniel Gutowski 8064d6d1fd Introduce the scale down processor that picks the final scale down candidates. 2022-01-03 16:05:36 +00:00
Marwan Ahmed 26569925db ignore azure csi topology label for similarity checks and populate it for scale from zero 2021-12-21 20:44:49 +02:00
Michael McCune 99a242a9e6 add ClusterAPI nodegroupset processor
This allows the ClusterAPI provider to ignore the
`topology.ebs.csi.aws.com/zone` label by adding a custom nodegroupset
processor. It also adds unit tests to exercise the new processor.
2021-11-10 17:01:27 -05:00
Michael McCune 828663e97a add topology.ebs.csi.aws.com/zone label to aws nodegroupset processor
This change adds the aforementioned label to the list of ignored labels
in the AWS nodegroupset processor. This change is being made in response
to the addition of this label by the aws-ebs-csi-driver. This label will
eventually be deprecated by the driver, but its use will prevent AWS
users from properly balancing similar nodes. Also adds unit test for the
AWS processor.

ref: https://github.com/kubernetes/autoscaler/issues/3230
ref: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/729
2021-11-10 17:01:08 -05:00
Marwan Ahmed f318400c9e add recent AKS agentpool label to ignore for similarity checks 2021-10-25 14:18:06 -07:00
Kubernetes Prow Robot c7c14381f5
Merge pull request #4391 from jayantjain93/scale-from-0-processer
Introduce Empty Cluster Processor
2021-10-13 06:59:51 -07:00
Jayant Jain da5ff3d971 Introduce Empty Cluster Processor
This refactors the handling of cases when the cluster is empty/not ready by CA into a processors in empty_cluster_processor.go
2021-10-13 13:30:30 +00:00
Aleksandra Gacek b5677acc80 Extend ScaleUpStatus with node groups that failed scale up. 2021-10-13 12:53:43 +02:00
Yaroslava Serdiuk 511d47a6f2 Add descriptive log for pre_filtering_processor 2021-10-06 14:41:43 +00:00
Maciek Pytel a0109324a2 Change parameter order of TemplateNodeInfoProvider
Every other processors (and, I think, function in CA?) that takes
AutoscalingContext has it as first parameter. Changing the new processor
for consistency.
2021-09-13 15:08:14 +02:00
Benjamin Pineau 8485cf2052 Move GetNodeInfosForGroups to it's own processor
Supports providing different NodeInfos sources (either upstream or in
local forks, eg. to properly implement variants like in #4000).

This also moves a large and specialized code chunk out of core, and removes
the need to maintain and pass the GetNodeInfosForGroups() cache from the side,
as processors can hold their states themselves.

No functional changes to GetNodeInfosForGroups(), outside mechanical changes
due to the move: remotely call a few utils functions in core/utils package,
pick context attributes (the processor takes the context as arg rather than
ListerRegistry + PredicateChecker + CloudProvider), and use the builtin cache
rather than receiving it from arguments.
2021-08-16 19:43:10 +02:00
Aleksandra Gacek b194c6f252 Extend ScaleUpStatus structure with ScaleUpError field. 2021-08-12 10:40:58 +02:00
Brett Elliott 5cf64a2b3c Update vendor to v1.22.0-alpha.1 2021-05-20 22:02:41 +02:00
Bartłomiej Wróblewski 1698e0e583 Separate and refactor custom resources logic 2021-04-07 10:31:11 +00:00
Maciek Pytel 65b3c8d3cc Rename default options to NodeGroupDefaults 2021-01-25 13:21:30 +01:00
Maciek Pytel 3e42b26a22 Per NodeGroup config for scale-down options
This is the implementation of
https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343.
2021-01-25 11:00:17 +01:00
Maciek Pytel 08d18a7bd0 Define interfaces for per NodeGroup config.
This is the first step of implementing
https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343.
New method was added to cloudprovider interface. All existing providers
were updated with a no-op stub implementation that will result in no
behavior change.
The config values specified per NodeGroup are not yet applied.
2021-01-25 11:00:16 +01:00
Bartłomiej Wróblewski 0fb897b839 Update imports after scheduler scheduler/framework/v1alpha1 removal 2020-11-30 10:48:52 +00:00
AutoExtractor 20d7bc36d2 Improve error message by removing a confussing statement 2020-11-25 22:26:55 +01:00
Benjamin Pineau bfd6fe7fed Ignore topology.gke.io/zone when comparing groups
Commit bb2eed1cff introduced a new `topology.gke.io/zone` label to
GCE nodes templates, for CSI needs.

That label holds zone name, making nodeInfo templates dissimilar
for groups belonging to different zones. The CA otherwise tries to
ignore those zonal labels (ie. it ignores the standards LabelZoneRegion
and LabelZoneFailureDomain) when it looks for nodegroups similarities.
2020-10-12 15:14:21 +02:00
Jakub Tużnik bf18d57871 Remove ScaleDownNodeDeleted status since we no longer delete nodes synchronously 2020-10-01 11:12:45 +02:00
Kubernetes Prow Robot 67dce2e824
Merge pull request #3124 from JoelSpeed/memory-tolerance-quantity
Allow small tolerance on memory capacity when comparing nodegroups
2020-06-24 04:25:17 -07:00
Joel Speed be1d9cb8d6
Allow 1.5% tolerance in memory capacity when comparing nodegroups
In testing, AWS M5 instances can on occasion display approximately a 1% difference
in memory capacity between availability zones, deployed with the same launch
configuration and same AMI.
Allow a 1.5% tolerance to give some buffer on the actual amount of memory discrepancy
since in testing, some examples were just over 1% (eg 1.05%, 1.1%).
Tests are included with capacity values taken from real instances to prevent future
regression.
2020-06-10 12:00:39 +01:00