autoscaler

Commit Graph

Author	SHA1	Message	Date
Bartłomiej Wróblewski	14655d219f	Remove the MaxNodeProvisioningTimeProvider interface	2023-08-05 11:26:40 +00:00
Karol Wychowaniec	8e621b23c4	Don't pass nil nodes to GetGpuInfoForMetrics	2023-08-04 09:28:34 +00:00
Karol Wychowaniec	80053f6eca	Support ZeroOrMaxNodeScaling node groups when cleaning up unregistered nodes	2023-08-03 08:44:46 +00:00
Karol Wychowaniec	2eba540d27	Add metrics for improved observability: * pending_node_deletions * failed_gpu_scale_ups_total	2023-07-25 13:01:36 +00:00
Artur Żyliński	e5bc070c8c	Fix: Do not inject fakeNode for instance which has errors on create	2023-07-17 11:54:30 +02:00
Daniel Gutowski	5fed449792	Add ClusterStateRegistry to the AutoscalingContext. Due to the dependency of the MaxNodeProvisionTimeProvider on the context the provider was extracted to a dedicated package and injected to the ClusterStateRegistry after context creation.	2023-07-04 05:00:09 -07:00
Bartłomiej Wróblewski	67d3e7ebc4	Include short unregistered nodes in calculation of incorrect node group sizes	2023-06-29 10:28:48 +00:00
Maria Oparka	ca088d26c2	Move MaxNodeProvisionTime to NodeGroupAutoscalingOptions	2023-04-19 11:08:20 +02:00
Bartłomiej Wróblewski	b5ead036a8	Merge taint utils into one package, make taint modifying methods public	2023-02-13 11:29:45 +00:00
Kuba Tużnik	7e6762535b	CA: stop passing registered upcoming nodes as scale-down candidates Without this, with aggressive settings, scale-down could be removing registered upcoming nodes before they have a chance to become ready (the duration of which should be unrelated to the scale-down settings).	2023-02-10 14:46:19 +01:00
Kuba Tużnik	6978ff8829	CA: Make CSR's Readiness keep lists of node names instead of just their count This does make us call len() in a bunch of places within CSR, but allows for greater flexibility - it's possible to act on the sets of nodes determined by Readiness.	2023-02-06 21:13:54 +01:00
Clint Fooken	1198fbcd90	Updating error messaging and fallback behavior of hasCloudProviderInstance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests.	2022-12-05 12:44:39 -08:00
Clint Fooken	08dfc7e20f	Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.	2022-11-04 17:54:05 -07:00
Clint Fooken	7fc1f6be01	Fixing errors due to merge on branches.	2022-10-17 15:45:55 -07:00
Clint	cf67a3004e	Implementing new cloud provider method for node deletion detection (#1 ) * Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.	2022-10-17 14:58:38 -07:00
Clint Fooken	776d7311a1	Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates. Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.	2022-10-17 14:40:01 -07:00
Aleksandra Gacek	ab2cc2fb8a	Bump k/k dependencies to v1.25.0 together with go.mod go version.	2022-08-26 13:38:07 +02:00
Daniel Kłobuszewski	66bfe55077	Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes"	2022-07-13 10:08:03 +02:00
Clint Fooken	a278255519	Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Including code changes first introduced in PR#4211, which will remove taints from all nodes on restarts.	2022-05-17 12:37:42 -07:00
weidongcai	03a0475502	Expose backoff time parameters	2022-05-12 15:34:28 +08:00
Daniel Kłobuszewski	26769e4c1b	Expose nodes with unready GPU in CA status This change simplifies debugging GPU issues: without it, all nodes can be Ready as far as Kubernetes API is concerned, but CA will still report some of them as unready if are missing GPU resource. Explicitly calling them out in the status ConfigMap will point into the right direction.	2022-03-03 14:59:31 +01:00
Marwan Ahmed	8039af647e	move annotations to cloudprovider package	2021-06-08 10:56:35 -07:00
Marwan Ahmed	36460df246	annotate fakeNodes so that cloudprovider implementations can identify them if needed	2021-06-06 13:54:05 -07:00
Dharma Bellamkonda	e80f7c502b	Log names of longUnregistered Nodes	2021-05-12 14:09:01 -06:00
Vivek Bagade	8c592f0c04	Fix bug where a node that becomes ready after 2 mins can be treated as unready. Deprecated LongNotStarted In cases where node n1 would: 1) Be created at t=0min 2) Ready condition is true at t=2.5min 3) Not ready taint is removed at t=3min the ready node is counted as unready Tested cases after fix: 1) Case described above 2) Nodes not starting even after 15mins still treated as unready 3) Nodes created long ago that suddenly become unready are counted as unready.	2021-03-11 18:32:51 +01:00
Bartłomiej Wróblewski	0fb897b839	Update imports after scheduler scheduler/framework/v1alpha1 removal	2020-11-30 10:48:52 +00:00
Jakub Tużnik	6a528b45de	Include taints by condition when determining if a node is unready/still starting Conditions and their corresponding taints can sometimes skew, which can cause unnecessary scale-up. CA thinks nodes are ready because it looks only at the conditions, but scheduler predicates fail because they consider the taints as well. CA adds nodes, even though the existing nodes are still starting. This commit brings CA behavior in line with scheduler predicates behavior, eliminating the unnecessary scale-up.	2020-11-02 11:15:42 +01:00
Marwan Ahmed	a3bada3708	correctly classify error for failed scale ups	2020-09-13 21:14:27 -07:00
Maciek Pytel	655b4081f4	Migrate to klog v2	2020-06-05 17:22:26 +02:00
Jakub Tużnik	73a5cdf928	Address recent breaking changes in scheduler The following things changed in scheduler and needed to be fixed: * NodeInfo was moved to schedulerframework * Some fields on NodeInfo are now exposed directly instead of via getters * NodeInfo.Pods is now a list of schedulerframework.PodInfo, not apiv1.Pod * SharedLister and NodeInfoLister were moved to schedulerframework * PodLister was removed	2020-04-24 17:54:47 +02:00
Kubernetes Prow Robot	bf3a9fb52e	Merge pull request #2436 from Jeffwan/skip_first_acceptable_range_check Skip acceptable range check before it has data	2019-12-10 01:49:29 -08:00
Jakub Tużnik	f64b6cd4de	CSR: fix a bug in GetClusterSize Currently, GetClusterSize reports the target number for all autoscaled node groups, but the actual number for _all_ node groups, even those that are not autoscaled. This commit fixes that behavior so that both target and actual size reported are from autoscaled node groups only.	2019-11-20 13:49:49 +01:00
Łukasz Osipiuk	aa53261098	More verbose logging of GCE instance create errors	2019-10-15 15:36:38 +02:00
Łukasz Osipiuk	288d4107b2	Rename GetCreatedNodesWithOutOfResourcesErrors to GetCreatedNodesWithErrors	2019-10-14 10:56:56 +02:00
Jiaxin Shan	0d278a2554	Skip acceptable range check before it has data	2019-10-09 17:59:43 -07:00
Thomas Hartland	7c17d52ec8	Invalidate node instances cache after deleting failed nodes	2019-09-30 13:56:33 +02:00
Kubernetes Prow Robot	6434df247d	Merge pull request #2304 from krzysztof-jastrzebski/fix_bug Stop disabling Cluster Autoscaler when there is no ready nodes.	2019-09-06 07:06:57 -07:00
Krzysztof Jastrzebski	839cdaaa09	Stop disabling Cluster Autoscaler when there is no ready nodes.	2019-09-06 14:45:34 +02:00
Łukasz Osipiuk	79b4614328	Use NodeDiskPressure conditino instead of NodeOutOfDisk	2019-09-05 23:23:43 +02:00
devinyan	3a633de55a	nodeGroup judy IsNil to avoid crashed	2019-06-30 17:33:32 +08:00
Kubernetes Prow Robot	dd89fb1385	Merge pull request #2096 from frobware/fix-segv-in-updateReadinessStats Fix potential SEGV in updateReadinessStats	2019-06-11 09:00:24 -07:00
Andrew McDermott	91016a605a	Fix SEGV in updateReadinessStats Calling cloudprovider.NodeGroupForNode(unregistered.Node) can result in a nil result for the nodegroup - handle that case.	2019-06-11 10:42:27 +01:00
Jakub Tużnik	bb382f47f9	Retain information about scale-up failures in CSR This will provide the AutoscalingStatusProcessor with information about failed scale-ups.	2019-06-05 16:53:30 +02:00
Łukasz Osipiuk	950a8a9f76	Quickly fail scaleup on all instance creation errors Change-Id: Ib918251f3e3229d882d5182a98f129b77d7731a3	2019-06-03 13:32:41 +02:00
Łukasz Osipiuk	c88f014470	Add debug log in handleOutOfResourcesErrorsForNodeGroup	2019-05-31 15:26:41 +02:00
Krzysztof Jastrzebski	4831d76288	Cache cloud provider node instances in cluster state.	2019-05-31 10:11:51 +02:00
Pengfei Ni	b721438315	Revert "Use cloudProvider.GetInstanceID() to get unregistered nodes" This reverts commit `f4ef957ecd`.	2019-03-08 10:47:26 +08:00
Pengfei Ni	f4ef957ecd	Use cloudProvider.GetInstanceID() to get unregistered nodes	2019-02-27 22:58:34 +08:00
Pengfei Ni	128729bae9	Move schedulercache to package nodeinfo	2019-02-21 12:41:08 +08:00
Łukasz Osipiuk	b5f9a9505c	Extend backoff interface with NodeInfo and error information	2019-01-09 11:25:34 +01:00

1 2 3

110 Commits