autoscaler

Commit Graph

Author	SHA1	Message	Date
michael mccune	1e1615ad63	add an extra note to clusterapi readme about gpus this change adds a little more detail to ensure that users understand how to use the GPU label feature.	2023-01-18 17:16:09 -05:00
michael mccune	6b80a7134a	add a note to clusterapi readme about ignored labels this change adds a section to the readme that provides advice for clusterapi users about which labels they might want to ignore when using the balance similar node groups flag on various cloud providers.	2023-01-12 09:59:25 -05:00
michael mccune	8ca3afc35b	update clusterapi readme with table of contents this change will make navigating the readme easier for users.	2023-01-12 09:59:25 -05:00
Kubernetes Prow Robot	ba3b244720	Merge pull request #5054 from fookenc/fix-autoscaler-node-deletion Identifying cloud provider deleted nodes	2022-12-16 05:54:17 -08:00
Kubernetes Prow Robot	af23e6187e	Merge pull request #5276 from pacoxu/master Stop applying the beta.kubernetes.io/os and arch	2022-12-16 03:10:17 -08:00
Nick Jones	684184c94a	Add note regarding GPU label for the CAPI provider cluster-autoscaler takes into consideration the time that a node takes to initialise a GPU resource on a node, as long as a particular label is in place. This label differs from provider to provider, and is documented in some cases but not for CAPI. This commit adds a note with the specific label that should be applied when a node is instantiated.	2022-11-25 12:02:29 +00:00
Clint Fooken	08dfc7e20f	Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.	2022-11-04 17:54:05 -07:00
Paco Xu	8dec2025f8	Stop applying the beta.kubernetes.io/os and arch	2022-10-27 12:20:04 +08:00
Clint Fooken	ea7059f4c6	Adjusting initial implementation of NodeExists to be consistent among cloud providers to return true and ErrNotImplemented.	2022-10-17 18:39:19 -07:00
Clint	cf67a3004e	Implementing new cloud provider method for node deletion detection (#1 ) * Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.	2022-10-17 14:58:38 -07:00
Michael McCune	bb015b26a1	remove unsupported functionality from cluster-api provider this change removes the code for the `Labels` and `Taints` interface functions of the clusterapi provider when scaling from zero. The body of these functions was added erronesouly and the Cluster API community is still deciding on how these values will be expose to the autoscaler. also updates the tests and readme to be more clear about the usage of labels and taints when scaling from zero.	2022-10-14 14:06:57 -04:00
Michael McCune	5c9cc27f75	cleanup unused constants in clusterapi provider this change removes some unused values and adjusts the names in the unit tests to better reflect usage.	2022-09-29 14:22:05 -04:00
Kubernetes Prow Robot	500652b6e1	Merge pull request #5123 from elmiko/update-capi-docs update clusterapi readme	2022-08-26 06:48:25 -07:00
Michael McCune	e089d14692	update clusterapi readme to be more accurate about scale from zero support.	2022-08-24 12:52:57 -04:00
Eng Zer Jun	66805969de	test: use `T.Setenv` to set env vars in tests This commit replaces `os.Setenv` with `t.Setenv` in tests. The environment variable is automatically restored to its original value when the test and all its subtests complete. Reference: https://pkg.go.dev/testing#T.Setenv Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>	2022-08-18 21:28:18 +08:00
Kubernetes Prow Robot	e478ee2959	Merge pull request #4840 from elmiko/capi-scale-from-zero clusterapi scale from zero support	2022-08-18 02:38:36 -07:00
Michael McCune	f02c9972eb	add more caching to clusterapi provider this change adds logic to create informers for the infrastructure machine templates that are discovered during the scale from zero checks. it also adds tests and a slight change to the controller structure to account for the dynamic informer creation.	2022-08-17 16:25:16 -04:00
killianmuldoon	b24075c9bb	Add ClusterClass usage instructions to ClusterClass docs Signed-off-by: killianmuldoon <kmuldoon@vmware.com>	2022-07-27 15:34:37 +01:00
Michael McCune	1a65fde540	cleanup clusterapi scale from zero implementation This commit is a combination of several commits. Significant details are preserved below. * update functions for resource annotations This change converts some of the functions that look at annotation for resource usage to indicate their usage in the function name. This helps to make room for allowing the infrastructure reference as an alternate source for the capacity information. * migrate capacity logic into a single function This change moves the logic to collect the instance capacity from the TemplateNodeInfo function into a method of the unstructuredScalableResource named InstanceCapacity. This new function is created to house the logic that will decide between annotations and the infrastructure reference when calculating the capacity for the node. * add ability to lookup infrastructure references This change supplements the annotation lookups by adding the logic to read the infrastructure reference if it exists. This is done to determine if the machine template exposes a capacity field in its status. For more information on how this mechanism works, please see the cluster-api enhancement[0]. * add documentation for capi scaling from zero * improve tests for clusterapi scale from zero this change adds functionality to test the dynamic client behavior of getting the infrastructure machine templates. * update README with information about rbac changes this adds more information about the rbac changes necessary for the scale from zero support to work. * remove extra check for scaling from zero since the CanScaleFromZero function checks to see if both CPU and memory are present, there is no need to check a second time. This also adds some documentation to the CanScaleFromZero function to make it clearer what is happening. * update unit test for capi scale from zero adding a few more cases and details to the scale from zero unit tests, including ensuring that the int based annotations do not accept other unit types. [0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md	2022-07-22 20:21:32 -04:00
Andrew McDermott	de90a462c7	Implement scale from zero for clusterapi This allows a Machine{Set,Deployment} to scale up/down from 0, providing the following annotations are set: ```yaml apiVersion: v1 items: - apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6" machine.openshift.io/vCPU: "2" machine.openshift.io/memoryMb: 8G machine.openshift.io/GPU: "1" machine.openshift.io/maxPods: "100" ``` Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods` are optional. For autoscaling from zero, the autoscaler should convert the mem value received in the appropriate annotation to bytes using powers of two consistently with other providers and fail if the format received is not expected. This gives robust behaviour consistent with cloud providers APIs and providers implementations. https://cloud.google.com/compute/all-pricing https://www.iec.ch/si/binary.htm https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366 Co-authored-by: Enxebre <alberto.garcial@hotmail.com> Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk> Co-authored-by: Michael McCune <elmiko@redhat.com>	2022-07-18 13:50:25 -04:00
enxebre	b2f1823c91	Get capi targetsize from cache This ensured that access to replicas during scale down operations were never stale by accessing the API server https://github.com/kubernetes/autoscaler/issues/3104. This honoured that behaviour while moving to unstructured client https://github.com/kubernetes/autoscaler/pull/3312. This regressed that behaviour while trying to reduce the API server load https://github.com/kubernetes/autoscaler/pull/4443. This put back the never stale replicas behaviour at the cost of loading back the API server https://github.com/kubernetes/autoscaler/pull/4634. Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource. This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache. Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.	2022-07-13 20:26:44 +02:00
enxebre	f2f95102cf	Drop deprecated CAPI annotations	2022-05-31 10:31:43 +02:00
ivan sumak	59a153c0f5	Typo fix - test.k8s.io Fixing typo in Specifying a Custom Resource Group section in annotation examples.	2022-05-05 11:45:03 +02:00
Michael McCune	1d5e0f155a	add user configurable cluster api version This change introduces an environment variable, `CAPI_VERSION`, through which a user can set the API version for the group they are using. This change is being added to address situations where a user might have multiple API versions for the cluster api group and wishes to be explicit about which version is selected. Also adds unit tests and documentation for the new behavior. This change does not break the existing behavior.	2022-02-25 09:46:34 -05:00
Joel Speed	9f670d4ea8	Ensure ClusterAPI DeleteNodes accounts for out of band changes scale Because the autoscaler assumes it can delete nodes in parallel, it fetches nodegroups for each node in separate go routines and then instructs each nodegroup to delete a single node. Because we don't share the nodegroup across go routines, the cached replica count in the scalableresource can become stale and as such, if the autoscaler attempts to scale down multiple nodes at a time, the cluster api provider only actually removes a single node. To prevent this, we must ensure we have a fresh replica count for every scale down attempt.	2022-01-21 16:08:00 +00:00
Naadir Jeewa	ee761bdc24	Cluster API OWNERS: Remove randomvariable Signed-off-by: Naadir Jeewa <jeewan@vmware.com>	2022-01-05 15:11:21 +00:00
Kubernetes Prow Robot	12efcce4c7	Merge pull request #4443 from codablock/fix-rate-limitting [clusterapi] Rely on replica count found in unstructuredScalableResource	2021-12-14 10:45:30 -08:00
Kubernetes Prow Robot	732cb659cf	Merge pull request #4474 from elmiko/update-capi-readme add configuration diagrams to clusterapi readme	2021-11-23 00:32:17 -08:00
Michael McCune	540a794d32	add configuration diagrams to clusterapi readme This change adds ascii diagrams to help illustrate the differences between the various authentication configurations for the clusterapi provider. Due to the distributed nature of Cluster API and its ability to have several Kubernetes clusters managed from a central location, the kubeconfig authentication options for it are slightly more complex than other providers.	2021-11-22 10:12:53 -05:00
GuyTempleton	b7b5df50ca	CA - Update gofmt of CAPI_nodegroup.go	2021-11-14 19:41:31 +00:00
Clinton Yeboah	ecfaa6d700	removes deprecated CAPI annotations	2021-11-11 18:56:53 -05:00
Michael McCune	755cb1b7b6	expand CAPI_GROUP usage to cover other capi group variables This change updates the logic for the clusterapi autoscaler provider so that the `CAPI_GROUP` environment variable will also affect the annotations keys for minimum and maximum node group size, the machine annotation, machine deletion, and the cluster name label. It also addes unit tests and an update to the readme.	2021-11-09 16:22:36 -05:00
Alexander Block	897c208ed1	Fix tests	2021-11-04 14:40:10 +01:00
Alexander Block	8b21473fc7	[clusterapi] Rely on replica count found in unstructuredScalableResource Instead of retrieving it each time from k8s, which easily causes client-side throttling, which in turn causes each autoscaler run to take multiple seconds even if only a small number of NodeGroups is involved and nothing is to do.	2021-11-04 11:09:27 +01:00
Kubernetes Prow Robot	924b723646	Merge pull request #4273 from dkoshkin/patch-1 fix: add missing RBAC permissions to example spec	2021-09-06 03:54:29 -07:00
GuyTempleton	17e028bd9e	CA - Cloud Provider Examples - add ability to list/watch/get namespaces As of the 1.22 release of k8s, the scheduler now requires the ability to list namespaces	2021-08-23 15:39:38 +01:00
Dimitri Koshkin	7105eb2189	fix: add missing RBAC permissions to example spec Similar change was done in https://github.com/kubernetes/autoscaler/pull/4154	2021-08-17 10:40:13 -07:00
Michael McCune	0499b886d4	update cluster-autoscaler CAPI provider owners This change is adding github users arunmk, mrajashree, jackfrancis, shysank, and randomvariable to the reviews for the cluster-api provider. It also removes frobware and ncdc from the approvers and reviewers.	2021-07-15 14:36:19 -04:00
shysank	8b20473e82	fix capi example and update readme	2021-04-16 21:21:59 +05:30
Jack Francis	d9531d3e81	cloudprovider: ClusterAPIProviderName spelling	2021-04-14 15:21:00 -07:00
shysank	7ac44990f5	update readme and example to limit capi rbac to a single namespace	2021-04-14 02:34:54 +05:30
shysank	68ce0643bd	management cluster informer should watch only the namespace configured in auto discovery	2021-04-14 02:27:20 +05:30
jichenjc	411eff43d9	bump clusterapi sample suggested version	2021-01-29 04:24:40 +00:00
Maciek Pytel	08d18a7bd0	Define interfaces for per NodeGroup config. This is the first step of implementing https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343. New method was added to cloudprovider interface. All existing providers were updated with a no-op stub implementation that will result in no behavior change. The config values specified per NodeGroup are not yet applied.	2021-01-25 11:00:16 +01:00
jichenjc	5b798ae92d	Add services into role of example file	2021-01-22 09:29:03 +00:00
jichenjc	eea0287a05	Switch from v1beta1 to v1 for rbac	2021-01-15 08:18:25 +00:00
Kubernetes Prow Robot	214833a9ca	Merge pull request #3801 from jichenjc/capi-define Define clusterapi in cloudprovider layer	2021-01-14 07:43:04 -08:00
jichenjc	4a5f740552	Define clusterapi in cloudprovider layer	2021-01-14 13:08:13 +00:00
Hidekazu Nakamura	a5fee21a68	Fix cluster-autoscaler clusterapi sample manifest This commit fixes sample manifest of cluster-autoscaler clusterapi provider.	2021-01-12 07:37:51 +00:00
Bartłomiej Wróblewski	4550bfe300	Register resources for fake dynamic client in tests	2020-11-30 10:50:27 +00:00

1 2

83 Commits