autoscaler

Commit Graph

Author	SHA1	Message	Date
Alessandro Di Stefano	3aedfc9929	Merge `16cfe50486` into `a9292351c3`	2025-09-18 04:45:32 -07:00
Bartłomiej Wróblewski	a366e629ce	Fix unit-tests	2025-09-15 16:16:46 +00:00
aleskandro	16cfe50486	[Tests] Update cluster-api provider to use machineTemplate.status.nodeInfo for architecture-aware autoscale from zero kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run. Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time. With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order: - Labels set in existing nodes for not-autoscale-from-zero cases - Labels set in the labels capacity annotation of machine template, machine set, and machine deployment. - Values in the status.nodeSystemInfo of MachineTemplates - Generic/default labels set in the environment of the cluster autoscaler	2025-09-05 06:58:57 +01:00
Kubernetes Prow Robot	24494f3c06	Merge pull request #7804 from ttsuuubasa/capi-scale-from-0-nodes cluster-api: node template in scale-from-0-nodes scenario with DRA	2025-05-01 16:17:54 -07:00
Kubernetes Prow Robot	dc91330f6a	Merge pull request #7989 from loick111/feature/clusterapi-instances-status ClusterAPI: Report machine phases to improve cluster-autoscaler decisions	2025-04-01 07:44:38 -07:00
Loick MAHIEUX	005a42b9af	feat(cluster-autoscaler): improve nodes listing in ClusterAPI provider Add improved error handling for machines phase in the ClusterAPI node group implementation. When a machine is in Deleting/Failed/Pending phase, mark the cloudprovider.Instance with a status for cluster-autoscaler recovery actions. The changes: - Enhance Nodes listing to allow reporting the machine phase in Instance status - Add error status reporting for failed machines This change helps identify and manage failed machines more effectively, allowing the autoscaler to make better scaling decisions.	2025-03-28 15:07:34 +01:00
elmiko	5e1fc195a3	refactor findScalableResourceProviderIDs in clusterapi this change refactors the function so that it each distinct machine state can be filtered more easily. the unit tests have been supplemented, but not changed to ensure that the functionality continues to work as expected. these changes are to help better detect edge cases where machines can be transiting through pending phase and might be removed by the autoscaler.	2025-03-26 12:41:09 -04:00
elmiko	71d3595cb7	improve failed machine detection in clusterapi This change makes it so that when a failed machine is found during the `findScalableResourceProviderIDs` it will always gain a normalized provider ID with failure guard prepended. This is to ensure that machines which have gained a provider ID from the infrastructure and then later go into a failed state can be properly removed by the autoscaler when it wants to correct the size of a node group.	2025-03-19 12:34:29 -04:00
elmiko	003e6cd67c	make DecreaseTargetSize more accurate for clusterapi this change ensures that when DecreaseTargetSize is counting the nodes that it does not include any instances which are considered to be pending (i.e. not having a node ref), deleting, or are failed. this change will allow the core autoscaler to then decrease the size of the node group accordingly, instead of raising an error. This change also add some code to the unit tests to make detection of this condition easier.	2025-03-17 19:34:07 -04:00
Tsubasa Watanabe	3fbacf0d0f	cluster-api: node template in scale-from-0-nodes scenario with DRA Modify TemplateNodeInfo() to return the template of ResourceSlice. This is to address the DRA expansion of Cluster Autoscaler, allowing users to set the number of GPUs and DRA driver name by specifying the annotation to NodeGroup provided by cluster-api. Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>	2025-02-12 11:56:04 +09:00
Kubernetes Prow Robot	4e3cc27898	Merge pull request #6743 from helio/capi-nodegroup-options feat(clusterapi): per nodeGroup autoscaling options	2024-06-27 03:58:03 -07:00
enxebre	31fdc397fd	Avoid expesive pointer copy in capi nodegroup	2024-05-07 15:47:23 +02:00
Michael Weibel	824c108853	feat(clusterapi): per nodeGroup autoscaling options	2024-04-22 15:42:08 +02:00
Max Fedotov	6c65baa1c6	[clusterapi] Update tests for nodegroups with minSize=maxSize	2024-03-13 18:38:12 +01:00
aleskandro	398ffaf82f	Fix the buildTemplateLabels method for the ClusterApi provider The joinStringMaps call in the buildTemplateLabels method of the clusterApi provider should not overwrite any custom labels with the generic ones returned by buildGenericLabels()	2023-04-26 10:37:13 +02:00
Matt Boersma	6040d29252	[cluster-api] Handle ignored errors	2023-02-28 13:56:29 -07:00
michael mccune	5bbfcd32e0	remove dead code in clusterapi provider tests This change removes an `if` statement that was left behind after a refactor. The test in question has the same logic embedded into a previous conditional and the removed statement has no effect on the tests.	2023-02-17 13:52:54 -05:00
Paco Xu	8dec2025f8	Stop applying the beta.kubernetes.io/os and arch	2022-10-27 12:20:04 +08:00
Michael McCune	bb015b26a1	remove unsupported functionality from cluster-api provider this change removes the code for the `Labels` and `Taints` interface functions of the clusterapi provider when scaling from zero. The body of these functions was added erronesouly and the Cluster API community is still deciding on how these values will be expose to the autoscaler. also updates the tests and readme to be more clear about the usage of labels and taints when scaling from zero.	2022-10-14 14:06:57 -04:00
Michael McCune	1a65fde540	cleanup clusterapi scale from zero implementation This commit is a combination of several commits. Significant details are preserved below. * update functions for resource annotations This change converts some of the functions that look at annotation for resource usage to indicate their usage in the function name. This helps to make room for allowing the infrastructure reference as an alternate source for the capacity information. * migrate capacity logic into a single function This change moves the logic to collect the instance capacity from the TemplateNodeInfo function into a method of the unstructuredScalableResource named InstanceCapacity. This new function is created to house the logic that will decide between annotations and the infrastructure reference when calculating the capacity for the node. * add ability to lookup infrastructure references This change supplements the annotation lookups by adding the logic to read the infrastructure reference if it exists. This is done to determine if the machine template exposes a capacity field in its status. For more information on how this mechanism works, please see the cluster-api enhancement[0]. * add documentation for capi scaling from zero * improve tests for clusterapi scale from zero this change adds functionality to test the dynamic client behavior of getting the infrastructure machine templates. * update README with information about rbac changes this adds more information about the rbac changes necessary for the scale from zero support to work. * remove extra check for scaling from zero since the CanScaleFromZero function checks to see if both CPU and memory are present, there is no need to check a second time. This also adds some documentation to the CanScaleFromZero function to make it clearer what is happening. * update unit test for capi scale from zero adding a few more cases and details to the scale from zero unit tests, including ensuring that the int based annotations do not accept other unit types. [0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md	2022-07-22 20:21:32 -04:00
Andrew McDermott	de90a462c7	Implement scale from zero for clusterapi This allows a Machine{Set,Deployment} to scale up/down from 0, providing the following annotations are set: ```yaml apiVersion: v1 items: - apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6" machine.openshift.io/vCPU: "2" machine.openshift.io/memoryMb: 8G machine.openshift.io/GPU: "1" machine.openshift.io/maxPods: "100" ``` Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods` are optional. For autoscaling from zero, the autoscaler should convert the mem value received in the appropriate annotation to bytes using powers of two consistently with other providers and fail if the format received is not expected. This gives robust behaviour consistent with cloud providers APIs and providers implementations. https://cloud.google.com/compute/all-pricing https://www.iec.ch/si/binary.htm https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366 Co-authored-by: Enxebre <alberto.garcial@hotmail.com> Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk> Co-authored-by: Michael McCune <elmiko@redhat.com>	2022-07-18 13:50:25 -04:00
enxebre	b2f1823c91	Get capi targetsize from cache This ensured that access to replicas during scale down operations were never stale by accessing the API server https://github.com/kubernetes/autoscaler/issues/3104. This honoured that behaviour while moving to unstructured client https://github.com/kubernetes/autoscaler/pull/3312. This regressed that behaviour while trying to reduce the API server load https://github.com/kubernetes/autoscaler/pull/4443. This put back the never stale replicas behaviour at the cost of loading back the API server https://github.com/kubernetes/autoscaler/pull/4634. Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource. This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache. Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.	2022-07-13 20:26:44 +02:00
Joel Speed	9f670d4ea8	Ensure ClusterAPI DeleteNodes accounts for out of band changes scale Because the autoscaler assumes it can delete nodes in parallel, it fetches nodegroups for each node in separate go routines and then instructs each nodegroup to delete a single node. Because we don't share the nodegroup across go routines, the cached replica count in the scalableresource can become stale and as such, if the autoscaler attempts to scale down multiple nodes at a time, the cluster api provider only actually removes a single node. To prevent this, we must ensure we have a fresh replica count for every scale down attempt.	2022-01-21 16:08:00 +00:00
Kubernetes Prow Robot	12efcce4c7	Merge pull request #4443 from codablock/fix-rate-limitting [clusterapi] Rely on replica count found in unstructuredScalableResource	2021-12-14 10:45:30 -08:00
Clinton Yeboah	ecfaa6d700	removes deprecated CAPI annotations	2021-11-11 18:56:53 -05:00
Alexander Block	897c208ed1	Fix tests	2021-11-04 14:40:10 +01:00
Jason DeTiberus	06e5f6a0ed	Update group identifier to use for Cluster API annotations - Also add backwards compatibility for the previously used deprecated annotations	2020-09-21 10:42:46 -04:00
Jason DeTiberus	75b850718f	Add node autodiscovery to cluster-autoscaler clusterapi provider	2020-08-20 16:08:49 -04:00
Jason DeTiberus	63f9e40d82	Improve Cluster API tests to work better with constrained resources	2020-08-19 13:31:32 -04:00
Jason DeTiberus	18d44fc532	Convert clusterapi provider to use unstructured Remove internal types for Cluster API and replace with unstructured access	2020-07-21 15:49:03 -04:00
Joel Speed	be6edb4a3e	Rewrite DeleteNodesTwice test to check API not TargetSize for cluster-autoscaler CAPI provider	2020-06-02 14:51:11 -04:00
Enxebre	9c8b78aa79	Get replicas always from API server for cluster-autoscaler CAPI provider When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104	2020-06-02 14:45:58 -04:00
Joel Speed	5e0126ada5	Do not normalize Node IDs outside of CAPI provider	2020-04-16 10:32:27 +01:00
Joel Speed	d23d3a1dd5	Add testing for fake provider IDs	2020-04-02 15:24:57 +01:00
Andrew McDermott	d9e3197daa	Normalize providerID values We index on providerID but it turns out that those values on node and machine are not always consistent. Some encode region, some do not, for example. This commit normalizes all values through the normalizedProviderString(). To ensure that we catch all places I've introduced a new type and made the find() functions take this new type in lieu of a string. Unit tests have also been adjusted to introduce a 'test:///' prefix on the providerID value to further validate the change. This change allows CAPI to work out-of-the-box, assuming v1alpha2. It's also reasonable to assert that this consistency should be enforced elsewhere and to make this behaviour easily revertable I'm leaving this as a separate commit in this patch series.	2020-03-10 10:59:05 +00:00
Joel Speed	eae1579100	Ensure DeleteNodes doesn't delete a node twice	2020-03-10 10:59:05 +00:00
Enxebre	699c0b83b4	Let Nodes() return the list of all machines The autoscaler expects provider implementations nodeGroups to implement the Nodes() function to return the number of instances belonging to the group regardless of they have become a kubernetes node or not. This information is then used for instance to realise about unregistered nodes `bf3a9fb52e/cluster-autoscaler/clusterstate/clusterstate.go (L307-L311)`	2020-03-10 10:59:05 +00:00
Andrew McDermott	46bb9b4f29	cloudprovider/clusterapi: new provider This adds a new cloudprovider based on the cluster-api project: https://github.com/kubernetes-sigs/cluster-api	2020-03-10 10:59:04 +00:00

38 Commits