Commit Graph

83 Commits

Author SHA1 Message Date
michael mccune 1e1615ad63 add an extra note to clusterapi readme about gpus
this change adds a little more detail to ensure that users understand
how to use the GPU label feature.
2023-01-18 17:16:09 -05:00
michael mccune 6b80a7134a add a note to clusterapi readme about ignored labels
this change adds a section to the readme that provides advice for
clusterapi users about which labels they might want to ignore when using
the balance similar node groups flag on various cloud providers.
2023-01-12 09:59:25 -05:00
michael mccune 8ca3afc35b update clusterapi readme with table of contents
this change will make navigating the readme easier for users.
2023-01-12 09:59:25 -05:00
Kubernetes Prow Robot ba3b244720
Merge pull request #5054 from fookenc/fix-autoscaler-node-deletion
Identifying cloud provider deleted nodes
2022-12-16 05:54:17 -08:00
Kubernetes Prow Robot af23e6187e
Merge pull request #5276 from pacoxu/master
Stop applying the beta.kubernetes.io/os and arch
2022-12-16 03:10:17 -08:00
Nick Jones 684184c94a
Add note regarding GPU label for the CAPI provider
cluster-autoscaler takes into consideration the time that a node takes
to initialise a GPU resource on a node, as long as a particular label is
in place.  This label differs from provider to provider, and is
documented in some cases but not for CAPI.

This commit adds a note with the specific label that should be applied
when a node is instantiated.
2022-11-25 12:02:29 +00:00
Clint Fooken 08dfc7e20f Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance. 2022-11-04 17:54:05 -07:00
Paco Xu 8dec2025f8 Stop applying the beta.kubernetes.io/os and arch 2022-10-27 12:20:04 +08:00
Clint Fooken ea7059f4c6 Adjusting initial implementation of NodeExists to be consistent among cloud providers to return true and ErrNotImplemented. 2022-10-17 18:39:19 -07:00
Clint cf67a3004e
Implementing new cloud provider method for node deletion detection (#1)
* Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.
2022-10-17 14:58:38 -07:00
Michael McCune bb015b26a1 remove unsupported functionality from cluster-api provider
this change removes the code for the `Labels` and `Taints` interface
functions of the clusterapi provider when scaling from zero. The body
of these functions was added erronesouly and the Cluster API community
is still deciding on how these values will be expose to the autoscaler.

also updates the tests and readme to be more clear about the usage of
labels and taints when scaling from zero.
2022-10-14 14:06:57 -04:00
Michael McCune 5c9cc27f75 cleanup unused constants in clusterapi provider
this change removes some unused values and adjusts the names in the unit
tests to better reflect usage.
2022-09-29 14:22:05 -04:00
Kubernetes Prow Robot 500652b6e1
Merge pull request #5123 from elmiko/update-capi-docs
update clusterapi readme
2022-08-26 06:48:25 -07:00
Michael McCune e089d14692 update clusterapi readme
to be more accurate about scale from zero support.
2022-08-24 12:52:57 -04:00
Eng Zer Jun 66805969de
test: use `T.Setenv` to set env vars in tests
This commit replaces `os.Setenv` with `t.Setenv` in tests. The
environment variable is automatically restored to its original value
when the test and all its subtests complete.

Reference: https://pkg.go.dev/testing#T.Setenv
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-08-18 21:28:18 +08:00
Kubernetes Prow Robot e478ee2959
Merge pull request #4840 from elmiko/capi-scale-from-zero
clusterapi scale from zero support
2022-08-18 02:38:36 -07:00
Michael McCune f02c9972eb add more caching to clusterapi provider
this change adds logic to create informers for the infrastructure
machine templates that are discovered during the scale from zero checks.
it also adds tests and a slight change to the controller structure to
account for the dynamic informer creation.
2022-08-17 16:25:16 -04:00
killianmuldoon b24075c9bb Add ClusterClass usage instructions to ClusterClass docs
Signed-off-by: killianmuldoon <kmuldoon@vmware.com>
2022-07-27 15:34:37 +01:00
Michael McCune 1a65fde540 cleanup clusterapi scale from zero implementation
This commit is a combination of several commits. Significant details are
preserved below.

* update functions for resource annotations
  This change converts some of the functions that look at annotation for
  resource usage to indicate their usage in the function name. This helps
  to make room for allowing the infrastructure reference as an alternate
  source for the capacity information.

* migrate capacity logic into a single function
  This change moves the logic to collect the instance capacity from the
  TemplateNodeInfo function into a method of the
  unstructuredScalableResource named InstanceCapacity. This new function
  is created to house the logic that will decide between annotations and
  the infrastructure reference when calculating the capacity for the node.

* add ability to lookup infrastructure references
  This change supplements the annotation lookups by adding the logic to
  read the infrastructure reference if it exists. This is done to
  determine if the machine template exposes a capacity field in its
  status. For more information on how this mechanism works, please see the
  cluster-api enhancement[0].

* add documentation for capi scaling from zero

* improve tests for clusterapi scale from zero
  this change adds functionality to test the dynamic client behavior of
  getting the infrastructure machine templates.

* update README with information about rbac changes
  this adds more information about the rbac changes necessary for the
  scale from zero support to work.

* remove extra check for scaling from zero
  since the CanScaleFromZero function checks to see if both CPU and
  memory are present, there is no need to check a second time. This also
  adds some documentation to the CanScaleFromZero function to make it
  clearer what is happening.

* update unit test for capi scale from zero
  adding a few more cases and details to the scale from zero unit tests,
  including ensuring that the int based annotations do not accept other
  unit types.

[0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md
2022-07-22 20:21:32 -04:00
Andrew McDermott de90a462c7 Implement scale from zero for clusterapi
This allows a Machine{Set,Deployment} to scale up/down from 0,
providing the following annotations are set:

```yaml
apiVersion: v1
items:
- apiVersion: machine.openshift.io/v1beta1
  kind: MachineSet
  metadata:
    annotations:
      machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0"
      machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6"
      machine.openshift.io/vCPU: "2"
      machine.openshift.io/memoryMb: 8G
      machine.openshift.io/GPU: "1"
      machine.openshift.io/maxPods: "100"
```

Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods`
are optional.

For autoscaling from zero, the autoscaler should convert the mem value
received in the appropriate annotation to bytes using powers of two
consistently with other providers and fail if the format received is not
expected. This gives robust behaviour consistent with cloud providers APIs
and providers implementations.

https://cloud.google.com/compute/all-pricing
https://www.iec.ch/si/binary.htm
https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366

Co-authored-by:  Enxebre <alberto.garcial@hotmail.com>
Co-authored-by:  Joel Speed <joel.speed@hotmail.co.uk>
Co-authored-by:  Michael McCune <elmiko@redhat.com>
2022-07-18 13:50:25 -04:00
enxebre b2f1823c91 Get capi targetsize from cache
This ensured that access to replicas during scale down operations were never stale by accessing the API server https://github.com/kubernetes/autoscaler/issues/3104.
This honoured that behaviour while moving to unstructured client https://github.com/kubernetes/autoscaler/pull/3312.
This regressed that behaviour while trying to reduce the API server load https://github.com/kubernetes/autoscaler/pull/4443.
This put back the never stale replicas behaviour at the cost of loading back the API server https://github.com/kubernetes/autoscaler/pull/4634.

Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource.
This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache.

Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
2022-07-13 20:26:44 +02:00
enxebre f2f95102cf Drop deprecated CAPI annotations 2022-05-31 10:31:43 +02:00
ivan sumak 59a153c0f5
Typo fix - test.k8s.io
Fixing typo in Specifying a Custom Resource Group section in annotation examples.
2022-05-05 11:45:03 +02:00
Michael McCune 1d5e0f155a add user configurable cluster api version
This change introduces an environment variable, `CAPI_VERSION`, through
which a user can set the API version for the group they are using. This
change is being added to address situations where a user might have
multiple API versions for the cluster api group and wishes to be
explicit about which version is selected.

Also adds unit tests and documentation for the new behavior. This change
does not break the existing behavior.
2022-02-25 09:46:34 -05:00
Joel Speed 9f670d4ea8
Ensure ClusterAPI DeleteNodes accounts for out of band changes scale
Because the autoscaler assumes it can delete nodes in parallel, it 
fetches nodegroups for each node in separate go routines and then 
instructs each nodegroup to delete a single node.
Because we don't share the nodegroup across go routines, the cached 
replica count in the scalableresource can become stale and as such, if 
the autoscaler attempts to scale down multiple nodes at a time, the 
cluster api provider only actually removes a single node.

To prevent this, we must ensure we have a fresh replica count for every 
scale down attempt.
2022-01-21 16:08:00 +00:00
Naadir Jeewa ee761bdc24
Cluster API OWNERS: Remove randomvariable
Signed-off-by: Naadir Jeewa <jeewan@vmware.com>
2022-01-05 15:11:21 +00:00
Kubernetes Prow Robot 12efcce4c7
Merge pull request #4443 from codablock/fix-rate-limitting
[clusterapi] Rely on replica count found in unstructuredScalableResource
2021-12-14 10:45:30 -08:00
Kubernetes Prow Robot 732cb659cf
Merge pull request #4474 from elmiko/update-capi-readme
add configuration diagrams to clusterapi readme
2021-11-23 00:32:17 -08:00
Michael McCune 540a794d32 add configuration diagrams to clusterapi readme
This change adds ascii diagrams to help illustrate the differences
between the various authentication configurations for the clusterapi
provider. Due to the distributed nature of Cluster API and its ability
to have several Kubernetes clusters managed from a central location, the
kubeconfig authentication options for it are slightly more complex than
other providers.
2021-11-22 10:12:53 -05:00
GuyTempleton b7b5df50ca
CA - Update gofmt of CAPI_nodegroup.go 2021-11-14 19:41:31 +00:00
Clinton Yeboah ecfaa6d700 removes deprecated CAPI annotations 2021-11-11 18:56:53 -05:00
Michael McCune 755cb1b7b6 expand CAPI_GROUP usage to cover other capi group variables
This change updates the logic for the clusterapi autoscaler provider so
that the `CAPI_GROUP` environment variable will also affect the
annotations keys for minimum and maximum node group size, the machine
annotation, machine deletion, and the cluster name label. It also addes
unit tests and an update to the readme.
2021-11-09 16:22:36 -05:00
Alexander Block 897c208ed1 Fix tests 2021-11-04 14:40:10 +01:00
Alexander Block 8b21473fc7 [clusterapi] Rely on replica count found in unstructuredScalableResource
Instead of retrieving it each time from k8s, which easily causes client-side
throttling, which in turn causes each autoscaler run to take multiple
seconds even if only a small number of NodeGroups is involved and nothing
is to do.
2021-11-04 11:09:27 +01:00
Kubernetes Prow Robot 924b723646
Merge pull request #4273 from dkoshkin/patch-1
fix: add missing RBAC permissions to example spec
2021-09-06 03:54:29 -07:00
GuyTempleton 17e028bd9e
CA - Cloud Provider Examples - add ability to list/watch/get namespaces
As of the 1.22 release of k8s, the scheduler now requires the ability to list namespaces
2021-08-23 15:39:38 +01:00
Dimitri Koshkin 7105eb2189
fix: add missing RBAC permissions to example spec
Similar change was done in https://github.com/kubernetes/autoscaler/pull/4154
2021-08-17 10:40:13 -07:00
Michael McCune 0499b886d4 update cluster-autoscaler CAPI provider owners
This change is adding github users arunmk, mrajashree, jackfrancis,
shysank, and randomvariable to the reviews for the cluster-api
provider. It also removes frobware and ncdc from the approvers and
reviewers.
2021-07-15 14:36:19 -04:00
shysank 8b20473e82 fix capi example and update readme 2021-04-16 21:21:59 +05:30
Jack Francis d9531d3e81 cloudprovider: ClusterAPIProviderName spelling 2021-04-14 15:21:00 -07:00
shysank 7ac44990f5 update readme and example to limit capi rbac to a single namespace 2021-04-14 02:34:54 +05:30
shysank 68ce0643bd management cluster informer should watch only the namespace configured in auto discovery 2021-04-14 02:27:20 +05:30
jichenjc 411eff43d9 bump clusterapi sample suggested version 2021-01-29 04:24:40 +00:00
Maciek Pytel 08d18a7bd0 Define interfaces for per NodeGroup config.
This is the first step of implementing
https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343.
New method was added to cloudprovider interface. All existing providers
were updated with a no-op stub implementation that will result in no
behavior change.
The config values specified per NodeGroup are not yet applied.
2021-01-25 11:00:16 +01:00
jichenjc 5b798ae92d Add services into role of example file 2021-01-22 09:29:03 +00:00
jichenjc eea0287a05 Switch from v1beta1 to v1 for rbac 2021-01-15 08:18:25 +00:00
Kubernetes Prow Robot 214833a9ca
Merge pull request #3801 from jichenjc/capi-define
Define clusterapi in cloudprovider layer
2021-01-14 07:43:04 -08:00
jichenjc 4a5f740552 Define clusterapi in cloudprovider layer 2021-01-14 13:08:13 +00:00
Hidekazu Nakamura a5fee21a68 Fix cluster-autoscaler clusterapi sample manifest
This commit fixes sample manifest of cluster-autoscaler clusterapi
provider.
2021-01-12 07:37:51 +00:00
Bartłomiej Wróblewski 4550bfe300 Register resources for fake dynamic client in tests 2020-11-30 10:50:27 +00:00