kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run.
Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time.
With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order:
- Labels set in existing nodes for not-autoscale-from-zero cases
- Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
- Values in the status.nodeSystemInfo of MachineTemplates
- Generic/default labels set in the environment of the cluster autoscaler
Add improved error handling for machines phase in the ClusterAPI node group
implementation. When a machine is in Deleting/Failed/Pending phase, mark the cloudprovider.Instance
with a status for cluster-autoscaler recovery actions.
The changes:
- Enhance Nodes listing to allow reporting the machine phase in Instance status
- Add error status reporting for failed machines
This change helps identify and manage failed machines more effectively,
allowing the autoscaler to make better scaling decisions.
this change refactors the function so that it each distinct machine
state can be filtered more easily. the unit tests have been
supplemented, but not changed to ensure that the functionality continues
to work as expected. these changes are to help better detect edge cases
where machines can be transiting through pending phase and might be
removed by the autoscaler.
This change makes it so that when a failed machine is found during the
`findScalableResourceProviderIDs` it will always gain a normalized
provider ID with failure guard prepended. This is to ensure that
machines which have gained a provider ID from the infrastructure and
then later go into a failed state can be properly removed by the
autoscaler when it wants to correct the size of a node group.
this change ensures that when DecreaseTargetSize is counting the nodes
that it does not include any instances which are considered to be
pending (i.e. not having a node ref), deleting, or are failed. this change will
allow the core autoscaler to then decrease the size of the node group
accordingly, instead of raising an error.
This change also add some code to the unit tests to make detection of
this condition easier.
Modify TemplateNodeInfo() to return the template of ResourceSlice.
This is to address the DRA expansion of Cluster Autoscaler, allowing users to set the number of GPUs and DRA driver name by specifying
the annotation to NodeGroup provided by cluster-api.
Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
The joinStringMaps call in the buildTemplateLabels method of the clusterApi provider should not overwrite any custom labels with the generic ones returned by buildGenericLabels()
This change removes an `if` statement that was left behind after a
refactor. The test in question has the same logic embedded into a
previous conditional and the removed statement has no effect on the
tests.
this change removes the code for the `Labels` and `Taints` interface
functions of the clusterapi provider when scaling from zero. The body
of these functions was added erronesouly and the Cluster API community
is still deciding on how these values will be expose to the autoscaler.
also updates the tests and readme to be more clear about the usage of
labels and taints when scaling from zero.
This commit is a combination of several commits. Significant details are
preserved below.
* update functions for resource annotations
This change converts some of the functions that look at annotation for
resource usage to indicate their usage in the function name. This helps
to make room for allowing the infrastructure reference as an alternate
source for the capacity information.
* migrate capacity logic into a single function
This change moves the logic to collect the instance capacity from the
TemplateNodeInfo function into a method of the
unstructuredScalableResource named InstanceCapacity. This new function
is created to house the logic that will decide between annotations and
the infrastructure reference when calculating the capacity for the node.
* add ability to lookup infrastructure references
This change supplements the annotation lookups by adding the logic to
read the infrastructure reference if it exists. This is done to
determine if the machine template exposes a capacity field in its
status. For more information on how this mechanism works, please see the
cluster-api enhancement[0].
* add documentation for capi scaling from zero
* improve tests for clusterapi scale from zero
this change adds functionality to test the dynamic client behavior of
getting the infrastructure machine templates.
* update README with information about rbac changes
this adds more information about the rbac changes necessary for the
scale from zero support to work.
* remove extra check for scaling from zero
since the CanScaleFromZero function checks to see if both CPU and
memory are present, there is no need to check a second time. This also
adds some documentation to the CanScaleFromZero function to make it
clearer what is happening.
* update unit test for capi scale from zero
adding a few more cases and details to the scale from zero unit tests,
including ensuring that the int based annotations do not accept other
unit types.
[0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md
This allows a Machine{Set,Deployment} to scale up/down from 0,
providing the following annotations are set:
```yaml
apiVersion: v1
items:
- apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
annotations:
machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0"
machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6"
machine.openshift.io/vCPU: "2"
machine.openshift.io/memoryMb: 8G
machine.openshift.io/GPU: "1"
machine.openshift.io/maxPods: "100"
```
Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods`
are optional.
For autoscaling from zero, the autoscaler should convert the mem value
received in the appropriate annotation to bytes using powers of two
consistently with other providers and fail if the format received is not
expected. This gives robust behaviour consistent with cloud providers APIs
and providers implementations.
https://cloud.google.com/compute/all-pricinghttps://www.iec.ch/si/binary.htmhttps://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366
Co-authored-by: Enxebre <alberto.garcial@hotmail.com>
Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk>
Co-authored-by: Michael McCune <elmiko@redhat.com>
Because the autoscaler assumes it can delete nodes in parallel, it
fetches nodegroups for each node in separate go routines and then
instructs each nodegroup to delete a single node.
Because we don't share the nodegroup across go routines, the cached
replica count in the scalableresource can become stale and as such, if
the autoscaler attempts to scale down multiple nodes at a time, the
cluster api provider only actually removes a single node.
To prevent this, we must ensure we have a fresh replica count for every
scale down attempt.
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.
This is one in a series of PR to mitigate kubernetes#3104
We index on providerID but it turns out that those values on node and
machine are not always consistent. Some encode region, some do not,
for example.
This commit normalizes all values through the normalizedProviderString().
To ensure that we catch all places I've introduced a new type and made
the find() functions take this new type in lieu of a string. Unit
tests have also been adjusted to introduce a 'test:///' prefix on the
providerID value to further validate the change.
This change allows CAPI to work out-of-the-box, assuming v1alpha2.
It's also reasonable to assert that this consistency should be
enforced elsewhere and to make this behaviour easily revertable I'm
leaving this as a separate commit in this patch series.
The autoscaler expects provider implementations nodeGroups to implement the Nodes() function to return the number of instances belonging to the group regardless of they have become a kubernetes node or not.
This information is then used for instance to realise about unregistered nodes bf3a9fb52e/cluster-autoscaler/clusterstate/clusterstate.go (L307-L311)