Commit Graph

36 Commits

Author SHA1 Message Date
aleskandro 16cfe50486
[Tests] Update cluster-api provider to use machineTemplate.status.nodeInfo for architecture-aware autoscale from zero
kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run.

Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time.

With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order:

- Labels set in existing nodes for not-autoscale-from-zero cases
- Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
- Values in the status.nodeSystemInfo of MachineTemplates
- Generic/default labels set in the environment of the cluster autoscaler
2025-09-05 06:58:57 +01:00
Jun Wang 21ca04ae26 Replace capi v1alpha3 with v1beta2 in test cases 2025-08-08 15:16:41 +08:00
Jun Wang f4c2fdebeb Add process with apiGroup in capi provider 2025-08-08 15:16:26 +08:00
Kubernetes Prow Robot 1de2160986
Merge pull request #7908 from Preisschild/fix/capi-patch-instead-update
CA: Use Patch to Scale clusterapi nodepools
2025-04-03 07:16:48 -07:00
Florian Ströger ecb572a945 Use Patch to Scale clusterapi nodepools to avoid modification conflicts
Issue: https://github.com/kubernetes/autoscaler/issues/7872
Signed-off-by: Florian Ströger <stroeger@youniqx.com>
2025-04-01 08:26:45 +02:00
elmiko 5e1fc195a3 refactor findScalableResourceProviderIDs in clusterapi
this change refactors the function so that it each distinct machine
state can be filtered more easily. the unit tests have been
supplemented, but not changed to ensure that the functionality continues
to work as expected. these changes are to help better detect edge cases
where machines can be transiting through pending phase and might be
removed by the autoscaler.
2025-03-26 12:41:09 -04:00
Jack Francis 7b5e10156e s/nodeHasValidProviderID/isProviderIDNormalized
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2025-03-19 12:30:33 -07:00
Jack Francis 4aa465764c capi: node and provider ID accounting funcs
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2025-03-19 11:40:19 -07:00
elmiko 71d3595cb7 improve failed machine detection in clusterapi
This change makes it so that when a failed machine is found during the
`findScalableResourceProviderIDs` it will always gain a normalized
provider ID with failure guard prepended. This is to ensure that
machines which have gained a provider ID from the infrastructure and
then later go into a failed state can be properly removed by the
autoscaler when it wants to correct the size of a node group.
2025-03-19 12:34:29 -04:00
elmiko 003e6cd67c make DecreaseTargetSize more accurate for clusterapi
this change ensures that when DecreaseTargetSize is counting the nodes
that it does not include any instances which are considered to be
pending (i.e. not having a node ref), deleting, or are failed. this change will
allow the core autoscaler to then decrease the size of the node group
accordingly, instead of raising an error.

This change also add some code to the unit tests to make detection of
this condition easier.
2025-03-17 19:34:07 -04:00
Michael Weibel 98f948969a
fix(clusterapi): HasInstance/FindMachine with namespace prefix 2024-07-01 13:37:56 +02:00
enxebre 8cfe11c80d Add benchmark for capi nodeGroup 2024-05-09 16:42:06 +02:00
enxebre 31fdc397fd Avoid expesive pointer copy in capi nodegroup 2024-05-07 15:47:23 +02:00
Max Fedotov 6c65baa1c6 [clusterapi] Update tests for nodegroups with minSize=maxSize 2024-03-13 18:38:12 +01:00
Matt Boersma 17d2bd968e
[clusterapi] Add support for MachinePools
squash
2023-04-12 09:22:49 -06:00
Matt Boersma 6040d29252
[cluster-api] Handle ignored errors 2023-02-28 13:56:29 -07:00
Cameron McAvoy 5713408a4c clusterapi: track upcoming unprovisioned machines with a temporary providerID to enable detection of exhausted nodegroups 2023-01-12 15:16:32 -06:00
Michael McCune 5c9cc27f75 cleanup unused constants in clusterapi provider
this change removes some unused values and adjusts the names in the unit
tests to better reflect usage.
2022-09-29 14:22:05 -04:00
Eng Zer Jun 66805969de
test: use `T.Setenv` to set env vars in tests
This commit replaces `os.Setenv` with `t.Setenv` in tests. The
environment variable is automatically restored to its original value
when the test and all its subtests complete.

Reference: https://pkg.go.dev/testing#T.Setenv
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-08-18 21:28:18 +08:00
Michael McCune f02c9972eb add more caching to clusterapi provider
this change adds logic to create informers for the infrastructure
machine templates that are discovered during the scale from zero checks.
it also adds tests and a slight change to the controller structure to
account for the dynamic informer creation.
2022-08-17 16:25:16 -04:00
Michael McCune 1a65fde540 cleanup clusterapi scale from zero implementation
This commit is a combination of several commits. Significant details are
preserved below.

* update functions for resource annotations
  This change converts some of the functions that look at annotation for
  resource usage to indicate their usage in the function name. This helps
  to make room for allowing the infrastructure reference as an alternate
  source for the capacity information.

* migrate capacity logic into a single function
  This change moves the logic to collect the instance capacity from the
  TemplateNodeInfo function into a method of the
  unstructuredScalableResource named InstanceCapacity. This new function
  is created to house the logic that will decide between annotations and
  the infrastructure reference when calculating the capacity for the node.

* add ability to lookup infrastructure references
  This change supplements the annotation lookups by adding the logic to
  read the infrastructure reference if it exists. This is done to
  determine if the machine template exposes a capacity field in its
  status. For more information on how this mechanism works, please see the
  cluster-api enhancement[0].

* add documentation for capi scaling from zero

* improve tests for clusterapi scale from zero
  this change adds functionality to test the dynamic client behavior of
  getting the infrastructure machine templates.

* update README with information about rbac changes
  this adds more information about the rbac changes necessary for the
  scale from zero support to work.

* remove extra check for scaling from zero
  since the CanScaleFromZero function checks to see if both CPU and
  memory are present, there is no need to check a second time. This also
  adds some documentation to the CanScaleFromZero function to make it
  clearer what is happening.

* update unit test for capi scale from zero
  adding a few more cases and details to the scale from zero unit tests,
  including ensuring that the int based annotations do not accept other
  unit types.

[0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md
2022-07-22 20:21:32 -04:00
Michael McCune 1d5e0f155a add user configurable cluster api version
This change introduces an environment variable, `CAPI_VERSION`, through
which a user can set the API version for the group they are using. This
change is being added to address situations where a user might have
multiple API versions for the cluster api group and wishes to be
explicit about which version is selected.

Also adds unit tests and documentation for the new behavior. This change
does not break the existing behavior.
2022-02-25 09:46:34 -05:00
Clinton Yeboah ecfaa6d700 removes deprecated CAPI annotations 2021-11-11 18:56:53 -05:00
Bartłomiej Wróblewski 4550bfe300 Register resources for fake dynamic client in tests 2020-11-30 10:50:27 +00:00
Michael McCune d8d064f6bc refactor CAPI controller unit test to use PollImmediate
This change removes the `PollImmediateInfinite` calls in the cluster-api
controller unit tests in favor of `PollImmediate`. It is being proposed
to prevent an edge case where the polling calls would become blocked
indefinitely. As we are using fake clients within the unit tests there
should be no delay getting a return value, but just in case there is a
miss on the poll function the new `PollImmediate` will timeout after 15
seconds.
2020-11-04 16:24:17 -05:00
Jason DeTiberus 06e5f6a0ed
Update group identifier to use for Cluster API annotations
- Also add backwards compatibility for the previously used deprecated annotations
2020-09-21 10:42:46 -04:00
Jason DeTiberus 75b850718f
Add node autodiscovery to cluster-autoscaler clusterapi provider 2020-08-20 16:08:49 -04:00
Jason DeTiberus 63f9e40d82
Improve Cluster API tests to work better with constrained resources 2020-08-19 13:31:32 -04:00
Jason DeTiberus 18d44fc532
Convert clusterapi provider to use unstructured
Remove internal types for Cluster API and replace with unstructured access
2020-07-21 15:49:03 -04:00
Joel Speed 5e0126ada5
Do not normalize Node IDs outside of CAPI provider 2020-04-16 10:32:27 +01:00
Joel Speed d23d3a1dd5
Add testing for fake provider IDs 2020-04-02 15:24:57 +01:00
Joel Speed 8283e80da7
Provide fake proivder IDs for failed machines 2020-04-02 15:24:15 +01:00
Enxebre 1a16bbf4a9 Let the controller move on if machineDeployments are not available
There might be adhoc environments where machineDeployments might not necessarily be available. This let the controller to remain functional for such scenarios.
2020-03-20 15:44:54 +01:00
Michael McCune 7082cfee81 Add the ability to override CAPI group via env variable and discover API version.
This change adds detection for an environment variable to specify the group for the clusterapi resources. If the environment
variable `CAPI_GROUP` is specified, then it will
be used instead of the default.
This also decouples the API group from the version and let the latter to be discovered dynamically.
2020-03-16 14:58:54 +01:00
Andrew McDermott d9e3197daa Normalize providerID values
We index on providerID but it turns out that those values on node and
machine are not always consistent. Some encode region, some do not,
for example.

This commit normalizes all values through the normalizedProviderString().

To ensure that we catch all places I've introduced a new type and made
the find() functions take this new type in lieu of a string. Unit
tests have also been adjusted to introduce a 'test:///' prefix on the
providerID value to further validate the change.

This change allows CAPI to work out-of-the-box, assuming v1alpha2.

It's also reasonable to assert that this consistency should be
enforced elsewhere and to make this behaviour easily revertable I'm
leaving this as a separate commit in this patch series.
2020-03-10 10:59:05 +00:00
Andrew McDermott 46bb9b4f29 cloudprovider/clusterapi: new provider
This adds a new cloudprovider based on the cluster-api project:

  https://github.com/kubernetes-sigs/cluster-api
2020-03-10 10:59:04 +00:00