autoscaler

Commit Graph

Author	SHA1	Message	Date
Clint Fooken	1198fbcd90	Updating error messaging and fallback behavior of hasCloudProviderInstance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests.	2022-12-05 12:44:39 -08:00
Clint Fooken	08dfc7e20f	Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.	2022-11-04 17:54:05 -07:00
Clint Fooken	7fc1f6be01	Fixing errors due to merge on branches.	2022-10-17 15:45:55 -07:00
Clint	cf67a3004e	Implementing new cloud provider method for node deletion detection (#1 ) * Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.	2022-10-17 14:58:38 -07:00
Clint Fooken	776d7311a1	Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates. Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.	2022-10-17 14:40:01 -07:00
Aleksandra Gacek	ab2cc2fb8a	Bump k/k dependencies to v1.25.0 together with go.mod go version.	2022-08-26 13:38:07 +02:00
Daniel Kłobuszewski	66bfe55077	Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes"	2022-07-13 10:08:03 +02:00
Clint Fooken	a278255519	Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Including code changes first introduced in PR#4211, which will remove taints from all nodes on restarts.	2022-05-17 12:37:42 -07:00
weidongcai	03a0475502	Expose backoff time parameters	2022-05-12 15:34:28 +08:00
Daniel Kłobuszewski	26769e4c1b	Expose nodes with unready GPU in CA status This change simplifies debugging GPU issues: without it, all nodes can be Ready as far as Kubernetes API is concerned, but CA will still report some of them as unready if are missing GPU resource. Explicitly calling them out in the status ConfigMap will point into the right direction.	2022-03-03 14:59:31 +01:00
Marwan Ahmed	8039af647e	move annotations to cloudprovider package	2021-06-08 10:56:35 -07:00
Marwan Ahmed	36460df246	annotate fakeNodes so that cloudprovider implementations can identify them if needed	2021-06-06 13:54:05 -07:00
Dharma Bellamkonda	e80f7c502b	Log names of longUnregistered Nodes	2021-05-12 14:09:01 -06:00
Vivek Bagade	8c592f0c04	Fix bug where a node that becomes ready after 2 mins can be treated as unready. Deprecated LongNotStarted In cases where node n1 would: 1) Be created at t=0min 2) Ready condition is true at t=2.5min 3) Not ready taint is removed at t=3min the ready node is counted as unready Tested cases after fix: 1) Case described above 2) Nodes not starting even after 15mins still treated as unready 3) Nodes created long ago that suddenly become unready are counted as unready.	2021-03-11 18:32:51 +01:00
Bartłomiej Wróblewski	0fb897b839	Update imports after scheduler scheduler/framework/v1alpha1 removal	2020-11-30 10:48:52 +00:00
Jakub Tużnik	6a528b45de	Include taints by condition when determining if a node is unready/still starting Conditions and their corresponding taints can sometimes skew, which can cause unnecessary scale-up. CA thinks nodes are ready because it looks only at the conditions, but scheduler predicates fail because they consider the taints as well. CA adds nodes, even though the existing nodes are still starting. This commit brings CA behavior in line with scheduler predicates behavior, eliminating the unnecessary scale-up.	2020-11-02 11:15:42 +01:00
Marwan Ahmed	a3bada3708	correctly classify error for failed scale ups	2020-09-13 21:14:27 -07:00
Maciek Pytel	655b4081f4	Migrate to klog v2	2020-06-05 17:22:26 +02:00
Jakub Tużnik	73a5cdf928	Address recent breaking changes in scheduler The following things changed in scheduler and needed to be fixed: * NodeInfo was moved to schedulerframework * Some fields on NodeInfo are now exposed directly instead of via getters * NodeInfo.Pods is now a list of schedulerframework.PodInfo, not apiv1.Pod * SharedLister and NodeInfoLister were moved to schedulerframework * PodLister was removed	2020-04-24 17:54:47 +02:00
Kubernetes Prow Robot	bf3a9fb52e	Merge pull request #2436 from Jeffwan/skip_first_acceptable_range_check Skip acceptable range check before it has data	2019-12-10 01:49:29 -08:00
Jakub Tużnik	f64b6cd4de	CSR: fix a bug in GetClusterSize Currently, GetClusterSize reports the target number for all autoscaled node groups, but the actual number for _all_ node groups, even those that are not autoscaled. This commit fixes that behavior so that both target and actual size reported are from autoscaled node groups only.	2019-11-20 13:49:49 +01:00
Łukasz Osipiuk	aa53261098	More verbose logging of GCE instance create errors	2019-10-15 15:36:38 +02:00
Łukasz Osipiuk	288d4107b2	Rename GetCreatedNodesWithOutOfResourcesErrors to GetCreatedNodesWithErrors	2019-10-14 10:56:56 +02:00
Jiaxin Shan	0d278a2554	Skip acceptable range check before it has data	2019-10-09 17:59:43 -07:00
Thomas Hartland	7c17d52ec8	Invalidate node instances cache after deleting failed nodes	2019-09-30 13:56:33 +02:00
Kubernetes Prow Robot	6434df247d	Merge pull request #2304 from krzysztof-jastrzebski/fix_bug Stop disabling Cluster Autoscaler when there is no ready nodes.	2019-09-06 07:06:57 -07:00
Krzysztof Jastrzebski	839cdaaa09	Stop disabling Cluster Autoscaler when there is no ready nodes.	2019-09-06 14:45:34 +02:00
Łukasz Osipiuk	79b4614328	Use NodeDiskPressure conditino instead of NodeOutOfDisk	2019-09-05 23:23:43 +02:00
devinyan	3a633de55a	nodeGroup judy IsNil to avoid crashed	2019-06-30 17:33:32 +08:00
Kubernetes Prow Robot	dd89fb1385	Merge pull request #2096 from frobware/fix-segv-in-updateReadinessStats Fix potential SEGV in updateReadinessStats	2019-06-11 09:00:24 -07:00
Andrew McDermott	91016a605a	Fix SEGV in updateReadinessStats Calling cloudprovider.NodeGroupForNode(unregistered.Node) can result in a nil result for the nodegroup - handle that case.	2019-06-11 10:42:27 +01:00
Jakub Tużnik	bb382f47f9	Retain information about scale-up failures in CSR This will provide the AutoscalingStatusProcessor with information about failed scale-ups.	2019-06-05 16:53:30 +02:00
Łukasz Osipiuk	950a8a9f76	Quickly fail scaleup on all instance creation errors Change-Id: Ib918251f3e3229d882d5182a98f129b77d7731a3	2019-06-03 13:32:41 +02:00
Łukasz Osipiuk	c88f014470	Add debug log in handleOutOfResourcesErrorsForNodeGroup	2019-05-31 15:26:41 +02:00
Krzysztof Jastrzebski	4831d76288	Cache cloud provider node instances in cluster state.	2019-05-31 10:11:51 +02:00
Pengfei Ni	b721438315	Revert "Use cloudProvider.GetInstanceID() to get unregistered nodes" This reverts commit `f4ef957ecd`.	2019-03-08 10:47:26 +08:00
Pengfei Ni	f4ef957ecd	Use cloudProvider.GetInstanceID() to get unregistered nodes	2019-02-27 22:58:34 +08:00
Pengfei Ni	128729bae9	Move schedulercache to package nodeinfo	2019-02-21 12:41:08 +08:00
Łukasz Osipiuk	b5f9a9505c	Extend backoff interface with NodeInfo and error information	2019-01-09 11:25:34 +01:00
Łukasz Osipiuk	85a83b62bd	Pass nodeGroup->NodeInfo map to ClusterStateRegistry Change-Id: Ie2a51694b5731b39c8a4135355a3b4c832c26801	2019-01-08 15:52:00 +01:00
Łukasz Osipiuk	5cddbda693	Rename nodeGroupBackoffInfo to backoff in ClusterStateRegistry	2018-12-31 17:59:58 +01:00
Łukasz Osipiuk	2fbae197f4	Handle possible stockout/quota scale-up errors	2018-12-28 17:17:07 +01:00
Łukasz Osipiuk	9689b30ee4	Do not use time.Now() in RegisterFailedScaleUp	2018-12-28 17:17:07 +01:00
Łukasz Osipiuk	da5bef307b	Allow updating Increase for ScaleUpRequest in ClusterStateRegistry	2018-12-28 17:17:07 +01:00
lsytj0413	8ca0e71d1e	refactor(*): fix some golint warning	2018-12-24 11:07:15 +08:00
Łukasz Osipiuk	016bf7fc2c	Use k8s.io/klog instead github.com/golang/glog	2018-11-26 17:30:31 +01:00
Łukasz Osipiuk	5962354c81	Inject Backoff instance to ClusterStateRegistry on creation	2018-11-13 14:25:16 +01:00
k8s-ci-robot	7008fb50be	Merge pull request #1380 from losipiuk/lo/backoff Make Backoff interface	2018-11-07 05:13:43 -08:00
Łukasz Osipiuk	0e2c3739b7	Use NodeGroup as key in Backoff	2018-10-30 18:17:26 +01:00
Łukasz Osipiuk	55fc1e2f00	Store NodeGroup in ScaleUpRequest and ScaleDownRequest	2018-10-30 18:03:04 +01:00

1 2

99 Commits