Commit Graph

2408 Commits

Author SHA1 Message Date
jesse.millan 3fd510bb5a
Upgrade OCI provider SDK to v65.90.0. Required for Go 1.24. 2025-05-10 22:57:16 -07:00
Kubernetes Prow Robot 9cdcc284ea
Merge pull request #8047 from raykrueger/aws-eks-hybrid-nodes-fix
fix: AWSCloudProvider should ignore unrecognized provider IDs
2025-05-04 15:25:56 -07:00
Kubernetes Prow Robot 41630404f3
Merge pull request #7817 from karsten42/feature/hetzner-config-from-configmap
added possibility to retrieve hcloud cluster config from file
2025-05-02 03:55:54 -07:00
Karsten van Baal ea764b4ef7 chore: refactored config parsing 2025-05-02 12:25:25 +02:00
Kubernetes Prow Robot 24494f3c06
Merge pull request #7804 from ttsuuubasa/capi-scale-from-0-nodes
cluster-api: node template in scale-from-0-nodes scenario with DRA
2025-05-01 16:17:54 -07:00
Tsubasa Watanabe 2291b74a2d Make InstanceResourceSlices func more efficient and make comments about DRA annotation in capi more recognizable
Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
2025-05-01 12:09:48 +09:00
Piotr Betkier ac1c7b5463 use k8s.io/component-helpers/resource for pod request calculations 2025-04-22 17:36:17 +02:00
Ray Krueger 3a1973872f fix: AWSCloudProvider ignores unrecognized provider IDs
The AWSCloudProvider only supports aws://zone/name ProviderIDs. It
should ignore ProviderIDs it does not recognize. Prior to this fix, an
unrecognized ProviderID, such as eks-hybrid://zone/cluster/my-node which
is used by EKS Hybrid Nodes, will break the Autoscaler loop.

This fix returns logs a warning, and returns nil, nil instead of
returning the error.
2025-04-17 17:27:46 +00:00
Pierre Ozoux e51dcfb60b
Update cluster-autoscaler/cloudprovider/clusterapi/README.md 2025-04-17 09:18:28 +02:00
Pierre Ozoux 6eebb82f0d
Update cluster-autoscaler/cloudprovider/clusterapi/README.md 2025-04-17 09:17:44 +02:00
Maksym Fuhol 99584890b4 Clean instance templates for untracked migs. 2025-04-15 12:29:19 +00:00
karsten 22dc4e06f6 chore: added paragraph to readme for new HCLOUD_CLUSTER_CONFIG_FILE 2025-04-11 07:29:01 +02:00
Daniel Kłobuszewski f1a44d89cf
Remove outdated GCE cloudprovider owners 2025-04-08 13:24:20 +02:00
Kubernetes Prow Robot 4bc861d097
Merge pull request #7923 from Uladzislau97/nap-resilience
Improve resilience of diskTypes requests.
2025-04-08 04:22:40 -07:00
Kubernetes Prow Robot 7c28f52f93
Merge pull request #7854 from AppliedIntuition/master
Fix 2 bugs in the OCI integration
2025-04-07 09:14:42 -07:00
Vlad Vasilyeu 93e21d05e2 Replace diskTypes.aggregatedList request with diskTypes.list in FetchAvailableDiskTypes. 2025-04-07 07:50:29 +00:00
Kubernetes Prow Robot 1de2160986
Merge pull request #7908 from Preisschild/fix/capi-patch-instead-update
CA: Use Patch to Scale clusterapi nodepools
2025-04-03 07:16:48 -07:00
Kubernetes Prow Robot dc91330f6a
Merge pull request #7989 from loick111/feature/clusterapi-instances-status
ClusterAPI: Report machine phases to improve cluster-autoscaler decisions
2025-04-01 07:44:38 -07:00
Florian Ströger ecb572a945 Use Patch to Scale clusterapi nodepools to avoid modification conflicts
Issue: https://github.com/kubernetes/autoscaler/issues/7872
Signed-off-by: Florian Ströger <stroeger@youniqx.com>
2025-04-01 08:26:45 +02:00
Pierre Ozoux 8a954bc021
docs(autoscaler): add details about flags
It is currently slightly confusing if you skim through the documentation.

For instance, see the discussion here:
https://github.com/kubernetes/autoscaler/pull/7974

I hope that by adding these 2 Important section the reader would be warned about the key difference, and need for these 2 options.
2025-03-28 15:47:14 +01:00
Loick MAHIEUX 005a42b9af feat(cluster-autoscaler): improve nodes listing in ClusterAPI provider
Add improved error handling for machines phase in the ClusterAPI node group
implementation. When a machine is in Deleting/Failed/Pending phase, mark the cloudprovider.Instance
with a status for cluster-autoscaler recovery actions.

The changes:
- Enhance Nodes listing to allow reporting the machine phase in Instance status
- Add error status reporting for failed machines

This change helps identify and manage failed machines more effectively,
allowing the autoscaler to make better scaling decisions.
2025-03-28 15:07:34 +01:00
Kubernetes Prow Robot 7b6996469b
Merge pull request #7973 from jincong8973/master
feat: add ignoreDaemonSetsUtilization and zeroOrMaxNodeScaling to NodeGroupAutoscalingOptions
2025-03-27 00:00:35 -07:00
KrJin e713b51bd6 feat: add missing field zeroOrMaxNodeScaling and ignoreDaemonSetsUtilization to NodeGroupAutoscalingOptions
[squashed]Add field IgnoreDaemonSetsUtilization and zeroOrMaxNodeScaling that missing in externalgrpc proto
2025-03-27 11:28:12 +08:00
Kubernetes Prow Robot 2ca5b44652
Merge pull request #7977 from elmiko/refactor-findscalableproviderids
refactor findScalableResourceProviderIDs in clusterapi
2025-03-26 10:22:43 -07:00
elmiko 5e1fc195a3 refactor findScalableResourceProviderIDs in clusterapi
this change refactors the function so that it each distinct machine
state can be filtered more easily. the unit tests have been
supplemented, but not changed to ensure that the functionality continues
to work as expected. these changes are to help better detect edge cases
where machines can be transiting through pending phase and might be
removed by the autoscaler.
2025-03-26 12:41:09 -04:00
Kubernetes Prow Robot 63309979ba
Merge pull request #7826 from Azure/rakechill/update-skewer-version-master
Update skewer version to v0.0.19 (master)
2025-03-26 01:30:34 -07:00
Veer Singh a226478f53 pricing changes: updated z3 pricing information 2025-03-24 04:06:26 +00:00
eric-higgins-ai 8da9a7b4af add log messages 2025-03-21 14:02:10 -07:00
eric-higgins-ai 370c8eb78e Revert "Address comment"
This reverts commit 233d5c6e4d.
2025-03-21 13:58:56 -07:00
Jack Francis 7b5e10156e s/nodeHasValidProviderID/isProviderIDNormalized
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2025-03-19 12:30:33 -07:00
Jack Francis 4aa465764c capi: node and provider ID accounting funcs
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2025-03-19 11:40:19 -07:00
elmiko 71d3595cb7 improve failed machine detection in clusterapi
This change makes it so that when a failed machine is found during the
`findScalableResourceProviderIDs` it will always gain a normalized
provider ID with failure guard prepended. This is to ensure that
machines which have gained a provider ID from the infrastructure and
then later go into a failed state can be properly removed by the
autoscaler when it wants to correct the size of a node group.
2025-03-19 12:34:29 -04:00
elmiko 003e6cd67c make DecreaseTargetSize more accurate for clusterapi
this change ensures that when DecreaseTargetSize is counting the nodes
that it does not include any instances which are considered to be
pending (i.e. not having a node ref), deleting, or are failed. this change will
allow the core autoscaler to then decrease the size of the node group
accordingly, instead of raising an error.

This change also add some code to the unit tests to make detection of
this condition easier.
2025-03-17 19:34:07 -04:00
Joel Smith bef1f89a76 Update to golang.org/x/oauth2@v0.27 to fix CVE-2025-22868
Signed-off-by: Joel Smith <joelsmith@redhat.com>
2025-03-11 16:56:12 -06:00
eric-higgins-ai 233d5c6e4d Address comment 2025-03-05 20:34:24 -08:00
Kubernetes Prow Robot a58d346c09
Merge pull request #7767 from Kamatera/tag-fixes-add-support-for-filter-name-prefix
Kamatera cluster autoscaler fixes
2025-02-27 11:36:30 -08:00
Jack Francis 0fd973a45e azure: increase UT coverage in azure_vms_pool
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2025-02-25 16:08:22 -08:00
eric-higgins-ai 91d20d533e unit test coverage 2025-02-24 11:17:00 -08:00
eric-higgins-ai cc430980d2 fixes 2025-02-21 18:39:48 -08:00
eric-higgins-ai 5735b8ae19 get all node shapes 2025-02-21 14:02:31 -08:00
eric-higgins-ai 9c0357a6f2 fix scale up bug 2025-02-21 13:57:03 -08:00
Rachel Gregory ed621282b5 Update only skewer with go get dep@ver 2025-02-20 15:12:40 -08:00
Rachel Gregory 72665b3d1c Undo previous changes made by go mod vendor
This reverts commit b66b44621e.
2025-02-20 14:57:51 -08:00
Muhammad Soliman 4f13cabcb4
Fixes based on code review
change last character in extended resources prefix to be `.` instead of `-`.
Add a warning if the extended resource already exists.
2025-02-12 10:35:24 +01:00
Muhammad Soliman ad6d6c9871
Merge branch 'kubernetes:master' into prefixed_extended_resources 2025-02-12 10:11:33 +01:00
Tsubasa Watanabe 3fbacf0d0f cluster-api: node template in scale-from-0-nodes scenario with DRA
Modify TemplateNodeInfo() to return the template of ResourceSlice.
This is to address the DRA expansion of Cluster Autoscaler, allowing users to set the number of GPUs and DRA driver name by specifying
the annotation to NodeGroup provided by cluster-api.

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
2025-02-12 11:56:04 +09:00
Rachel Gregory b66b44621e Update skewer version on master branch 2025-02-11 14:31:35 -08:00
Karsten van Baal 65c14d5526 added possibility to retrieve hcloud cluster config from file 2025-02-10 16:41:39 +01:00
Jeremy L. Morris c87e68c01d add emeritus approvers section 2025-02-07 13:06:57 -05:00
Jeremy L. Morris d8f60fe19e Removes stale owners from DO cluster autoscaler owners file and updates to current DO employee 2025-01-30 13:10:27 -05:00
Ori Hoch 83ce5dd13c fix tag name attribute 2025-01-24 15:24:26 +02:00
Ori Hoch 3c187264cd add retry mechanism 2025-01-24 15:15:59 +02:00
Ori Hoch 1d42eb55ea shorten the uuids 2025-01-24 13:20:03 +02:00
Ori Hoch cdb90ec4d5 tags fixes and add support for filter name prefix 2025-01-24 12:47:23 +02:00
Robin D. 64ca097c1e
fix: undefined instance state on provisioning state failed (#7750)
* fix: undefined instance state on provisioning state failed

* test: add unit tests for provisioning state failed + fast delete

* test: support both fast/not fast delete on an affected test
2025-01-22 23:08:37 -08:00
Robin D. 9559204f61
test: add additional assertion for dynamic SKU list test (#7737) 2025-01-22 13:22:38 -08:00
Robin D. 03e6b2797d
chore: remove unnecessary logs on fast delete and add a relevant note (#7736) 2025-01-22 13:20:38 -08:00
Robin D. 7e8c41d175
test: clean up environments properly before/after each unit test in azure_manager_test.go (#7735)
* test: clean up environments properly before/after each unit test in azure_manager_test.go

* test: use testing.Cleanup() to ensure loadEnv()
2025-01-22 12:18:37 -08:00
Kubernetes Prow Robot 082e230b92
Merge pull request #7391 from jackfrancis/ca-cloudprovider-build-tags-hygiene
add test-build-tags make target
2025-01-17 02:14:06 -08:00
Kubernetes Prow Robot 027795a97c
Merge pull request #7339 from justinmir/kwok-provider-metrics-annotation
Add metrics-server annotation for kwok-provider managed nodes
2025-01-17 01:50:07 -08:00
Robin Deeboonchai 97dd5fe4ee fix: don't crash when vmss not present or has no nodes 2025-01-16 14:57:20 -08:00
Kubernetes Prow Robot ea52310b69
Merge pull request #6890 from b0e/implement-templateNodeInfo-for-cloudprovider-magnum
Implement TemplateNodeInfo for magnum cloudprovider
2025-01-16 03:00:33 -08:00
Kubernetes Prow Robot 03e2795c9f
Merge pull request #7405 from ctrox/rancher-clarify-docs
docs(rancher): clarify single RKE2 target
2024-12-30 02:14:14 +01:00
Kubernetes Prow Robot 38facfc3dd
Merge pull request #7633 from PerforMance308/master
remove  contact information for huaweicloud cluster autoscaler provider
2024-12-27 14:12:12 +01:00
Kubernetes Prow Robot 50c65906fd
Merge pull request #7530 from towca/jtuznik/dra-actual
CA: DRA integration MVP
2024-12-20 16:30:08 +01:00
Kuba Tużnik a45e6b7003 CA: implement DRA integration tests for StaticAutoscaler 2024-12-20 13:30:36 +01:00
Shiqi Wang 11740d1398
remove contact information 2024-12-19 09:23:05 -05:00
Muhammad Soliman dd6f11b10e
Merge branch 'kubernetes:master' into prefixed_extended_resources 2024-12-18 10:17:16 +01:00
Kubernetes Prow Robot da31dff7a6
Merge pull request #7614 from DataDog/update-azure-instance-types
update azure static sku list
2024-12-17 20:54:52 +01:00
Rahul Rangith 6ab0eb94f7
update azure static sku list 2024-12-16 15:01:28 -05:00
Walid Ghallab 720f5946fd Refactor NewAutoscalerError function.
We will have two functions instead of one:
1. One that doesn't do formatting, like klog.Error
2. One that accepts formating, like klog.Errorf

The main reason behind this is to avoid go vet errors and have clear
interfaces to catch accidental bugs and rely on go vet to catch those
accidental bugs (or go test in go 1.24, as those are treated as errors).
2024-12-16 17:46:40 +00:00
Kubernetes Prow Robot 148ffa345b
Merge pull request #7520 from hetznercloud/refactor-placement-groups
refactor(hetzner): refactored placement group code
2024-12-16 13:36:51 +01:00
Muhammad Soliman 2b62a7d6df
Add option for passing extended resources in node labels in GCE
on GCE, Cluster atuoscaler reads extended resource information from kubenv->AUTOSCALER_ENV_VARS->extended_resources in the managed scaling group template definition.

However, users have no way to add a variable to extended resources, they are controlled from GKE side. This results in cluster autoscaler not supporting scale up from zero for all node pools that has extended resources (like GPU) on GCE.

However, node labels are passed from the node pool to the managed scaling group template through the kubenv->AUTOSCALER_ENV_VARS->node_labels.

This commit introduces the ability to pass extended resources as node labels with defined prefix on GCE, similar to how cluster autoscaler expects extended resources on AWS. This allows scaling from zero for node pools with extended resrouces.
2024-12-13 13:39:12 +01:00
lukasmetzner d68a1f26b1 refactor: moved error checking with exiting to callsite 2024-12-13 11:57:51 +01:00
Alex Leites 61c8cdeff7 fix: corresponding test 2024-12-08 02:22:02 +00:00
Alex Leites 5e7ceee507 fix: setting getVmssSizeRefreshPeriod 2024-12-08 01:23:04 +00:00
Kubernetes Prow Robot bd7156e837
Merge pull request #7557 from gvnc/handle-ooh-capacity-nodes
Avoid making delete api calls for nodes that don't have an instance id
2024-12-06 22:48:01 +00:00
“gkazanci” 660f1aa6cd added more logs 2024-12-03 17:03:56 +00:00
willie-yao 064d48f36c
Add toggle for fast delete 2024-11-26 00:25:04 +00:00
Kubernetes Prow Robot 86a80c6823
Merge pull request #7526 from willie-yao/cse-fast-delete
Set node state to InstanceCreating to delete on CSE error
2024-11-26 00:20:57 +00:00
willie-yao 49a1ad4ad2
Set node state to InstanceCreating to delete on CSE error 2024-11-23 00:25:12 +00:00
Jack Francis f1a1bab379 add test-build-tags make target
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2024-11-22 09:16:23 -08:00
lukasmetzner 64495d95a0 refactor(hetzner): refactored placement group code 2024-11-22 13:28:52 +01:00
Kubernetes Prow Robot 5458e1c208
Merge pull request #7436 from maximrub/fr-7435-alibaba-cloud-rrsa-new-env-vars
7435 Support New Alibaba Cloud ENV Variables names for RRSA Authorization
2024-11-22 10:30:54 +00:00
Kubernetes Prow Robot 4c37ff38ce
Merge pull request #6999 from dominic-p/iss-5919-placement-groups
Add support for node pool placement group config
2024-11-20 13:04:53 +00:00
Kubernetes Prow Robot a01276ef14
Merge pull request #7493 from BigDarkClown/remove-unneeded
Add flag to force remove long unregistered nodes
2024-11-19 10:00:55 +00:00
Kubernetes Prow Robot 2d37aeefe8
Merge pull request #7385 from jlamillan/jlamillan/oci_sdk_65.75.2-2
Upgrade OCI providers SDK to v65.75.2.
2024-11-18 23:54:54 +00:00
Bartłomiej Wróblewski c5f13bb02d Add ForceDeleteNodes implementation for GCE cloud provider 2024-11-18 13:55:09 +00:00
Bartłomiej Wróblewski 3b47908e51 Add ForceDeleteNodes method to NodeGroup interface 2024-11-18 13:55:07 +00:00
Maxim Rubchinsky dcd6d6ab36
7435 Support New Alibaba Cloud ENV Variables names for RRSA Authorization in Cluster Autoscaler
Signed-off-by: Maxim Rubchinsky <maxim@rubchinsky.com>
2024-11-16 11:58:54 +02:00
Kubernetes Prow Robot b01bff1640
Merge pull request #7453 from gvnc/oci-self-managed-nodes-fix
exclude self-managed nodes from being processed
2024-11-15 23:32:53 +00:00
Kubernetes Prow Robot 009f2b8b16
Merge pull request #7438 from maximrub/bug-7437-alibaba-cloud-endpoint-reloving-logging
7437 Add logging for endpoint resolving errors
2024-11-15 10:10:52 +00:00
Kubernetes Prow Robot 267a0d8a98
Merge pull request #7459 from damikag/update-bootdisk-logs
Change log level of boot dist type and size defaulting in gce_price
2024-11-15 09:54:53 +00:00
Kubernetes Prow Robot 59aefbcd5e
Merge pull request #7379 from ionos-cloud/remove-obsolete-upper-bound-check
Remove obsolete upper bound check
2024-11-12 19:28:46 +00:00
Kubernetes Prow Robot 93f74c0948
Merge pull request #7481 from jackfrancis/vmss-proactive-deleting
azure: StrictCacheUpdates to disable proactive vmss cache updates
2024-11-11 18:52:46 +00:00
Kubernetes Prow Robot c9970a48ec
Merge pull request #7383 from DataDog/fix-instance-requirements-caching
AWS: only cache instance requirements when needed
2024-11-11 14:22:46 +00:00
Jack Francis 1e5ed185d7 restore original behavior
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2024-11-10 20:47:22 -08:00
Jack Francis c20971357f azure: don’t eagerly update vmss cache before delete success
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2024-11-08 16:54:38 -08:00
Achim Ledermüller a249ca9290 Implement TemplateNodeInfo for magnum cloudprovider 2024-11-07 17:04:37 +01:00
Kubernetes Prow Robot 0e8545325a
Merge pull request #7113 from IrisIris/feature/compatible-with-alicloud-desire-size
add support to scaling group desired size for alicloud
2024-11-07 10:43:29 +00:00