autoscaler

Commit Graph

Author	SHA1	Message	Date
Daniel Kłobuszewski	26769e4c1b	Expose nodes with unready GPU in CA status This change simplifies debugging GPU issues: without it, all nodes can be Ready as far as Kubernetes API is concerned, but CA will still report some of them as unready if are missing GPU resource. Explicitly calling them out in the status ConfigMap will point into the right direction.	2022-03-03 14:59:31 +01:00
Marwan Ahmed	8039af647e	move annotations to cloudprovider package	2021-06-08 10:56:35 -07:00
Marwan Ahmed	36460df246	annotate fakeNodes so that cloudprovider implementations can identify them if needed	2021-06-06 13:54:05 -07:00
Dharma Bellamkonda	e80f7c502b	Log names of longUnregistered Nodes	2021-05-12 14:09:01 -06:00
Vivek Bagade	8c592f0c04	Fix bug where a node that becomes ready after 2 mins can be treated as unready. Deprecated LongNotStarted In cases where node n1 would: 1) Be created at t=0min 2) Ready condition is true at t=2.5min 3) Not ready taint is removed at t=3min the ready node is counted as unready Tested cases after fix: 1) Case described above 2) Nodes not starting even after 15mins still treated as unready 3) Nodes created long ago that suddenly become unready are counted as unready.	2021-03-11 18:32:51 +01:00
Bartłomiej Wróblewski	0fb897b839	Update imports after scheduler scheduler/framework/v1alpha1 removal	2020-11-30 10:48:52 +00:00
Jakub Tużnik	6a528b45de	Include taints by condition when determining if a node is unready/still starting Conditions and their corresponding taints can sometimes skew, which can cause unnecessary scale-up. CA thinks nodes are ready because it looks only at the conditions, but scheduler predicates fail because they consider the taints as well. CA adds nodes, even though the existing nodes are still starting. This commit brings CA behavior in line with scheduler predicates behavior, eliminating the unnecessary scale-up.	2020-11-02 11:15:42 +01:00
Marwan Ahmed	a3bada3708	correctly classify error for failed scale ups	2020-09-13 21:14:27 -07:00
Maciek Pytel	655b4081f4	Migrate to klog v2	2020-06-05 17:22:26 +02:00
Jakub Tużnik	73a5cdf928	Address recent breaking changes in scheduler The following things changed in scheduler and needed to be fixed: * NodeInfo was moved to schedulerframework * Some fields on NodeInfo are now exposed directly instead of via getters * NodeInfo.Pods is now a list of schedulerframework.PodInfo, not apiv1.Pod * SharedLister and NodeInfoLister were moved to schedulerframework * PodLister was removed	2020-04-24 17:54:47 +02:00
Kubernetes Prow Robot	bf3a9fb52e	Merge pull request #2436 from Jeffwan/skip_first_acceptable_range_check Skip acceptable range check before it has data	2019-12-10 01:49:29 -08:00
Jakub Tużnik	f64b6cd4de	CSR: fix a bug in GetClusterSize Currently, GetClusterSize reports the target number for all autoscaled node groups, but the actual number for _all_ node groups, even those that are not autoscaled. This commit fixes that behavior so that both target and actual size reported are from autoscaled node groups only.	2019-11-20 13:49:49 +01:00
Łukasz Osipiuk	aa53261098	More verbose logging of GCE instance create errors	2019-10-15 15:36:38 +02:00
Łukasz Osipiuk	288d4107b2	Rename GetCreatedNodesWithOutOfResourcesErrors to GetCreatedNodesWithErrors	2019-10-14 10:56:56 +02:00
Jiaxin Shan	0d278a2554	Skip acceptable range check before it has data	2019-10-09 17:59:43 -07:00
Thomas Hartland	7c17d52ec8	Invalidate node instances cache after deleting failed nodes	2019-09-30 13:56:33 +02:00
Kubernetes Prow Robot	6434df247d	Merge pull request #2304 from krzysztof-jastrzebski/fix_bug Stop disabling Cluster Autoscaler when there is no ready nodes.	2019-09-06 07:06:57 -07:00
Krzysztof Jastrzebski	839cdaaa09	Stop disabling Cluster Autoscaler when there is no ready nodes.	2019-09-06 14:45:34 +02:00
Łukasz Osipiuk	79b4614328	Use NodeDiskPressure conditino instead of NodeOutOfDisk	2019-09-05 23:23:43 +02:00
devinyan	3a633de55a	nodeGroup judy IsNil to avoid crashed	2019-06-30 17:33:32 +08:00
Kubernetes Prow Robot	dd89fb1385	Merge pull request #2096 from frobware/fix-segv-in-updateReadinessStats Fix potential SEGV in updateReadinessStats	2019-06-11 09:00:24 -07:00
Andrew McDermott	91016a605a	Fix SEGV in updateReadinessStats Calling cloudprovider.NodeGroupForNode(unregistered.Node) can result in a nil result for the nodegroup - handle that case.	2019-06-11 10:42:27 +01:00
Jakub Tużnik	bb382f47f9	Retain information about scale-up failures in CSR This will provide the AutoscalingStatusProcessor with information about failed scale-ups.	2019-06-05 16:53:30 +02:00
Łukasz Osipiuk	950a8a9f76	Quickly fail scaleup on all instance creation errors Change-Id: Ib918251f3e3229d882d5182a98f129b77d7731a3	2019-06-03 13:32:41 +02:00
Łukasz Osipiuk	c88f014470	Add debug log in handleOutOfResourcesErrorsForNodeGroup	2019-05-31 15:26:41 +02:00
Krzysztof Jastrzebski	4831d76288	Cache cloud provider node instances in cluster state.	2019-05-31 10:11:51 +02:00
Pengfei Ni	b721438315	Revert "Use cloudProvider.GetInstanceID() to get unregistered nodes" This reverts commit `f4ef957ecd`.	2019-03-08 10:47:26 +08:00
Pengfei Ni	f4ef957ecd	Use cloudProvider.GetInstanceID() to get unregistered nodes	2019-02-27 22:58:34 +08:00
Pengfei Ni	128729bae9	Move schedulercache to package nodeinfo	2019-02-21 12:41:08 +08:00
Łukasz Osipiuk	b5f9a9505c	Extend backoff interface with NodeInfo and error information	2019-01-09 11:25:34 +01:00
Łukasz Osipiuk	85a83b62bd	Pass nodeGroup->NodeInfo map to ClusterStateRegistry Change-Id: Ie2a51694b5731b39c8a4135355a3b4c832c26801	2019-01-08 15:52:00 +01:00
Łukasz Osipiuk	5cddbda693	Rename nodeGroupBackoffInfo to backoff in ClusterStateRegistry	2018-12-31 17:59:58 +01:00
Łukasz Osipiuk	2fbae197f4	Handle possible stockout/quota scale-up errors	2018-12-28 17:17:07 +01:00
Łukasz Osipiuk	9689b30ee4	Do not use time.Now() in RegisterFailedScaleUp	2018-12-28 17:17:07 +01:00
Łukasz Osipiuk	da5bef307b	Allow updating Increase for ScaleUpRequest in ClusterStateRegistry	2018-12-28 17:17:07 +01:00
lsytj0413	8ca0e71d1e	refactor(*): fix some golint warning	2018-12-24 11:07:15 +08:00
Łukasz Osipiuk	016bf7fc2c	Use k8s.io/klog instead github.com/golang/glog	2018-11-26 17:30:31 +01:00
Łukasz Osipiuk	5962354c81	Inject Backoff instance to ClusterStateRegistry on creation	2018-11-13 14:25:16 +01:00
k8s-ci-robot	7008fb50be	Merge pull request #1380 from losipiuk/lo/backoff Make Backoff interface	2018-11-07 05:13:43 -08:00
Łukasz Osipiuk	0e2c3739b7	Use NodeGroup as key in Backoff	2018-10-30 18:17:26 +01:00
Łukasz Osipiuk	55fc1e2f00	Store NodeGroup in ScaleUpRequest and ScaleDownRequest	2018-10-30 18:03:04 +01:00
Łukasz Osipiuk	e462d4420c	Extract Backoff interface	2018-10-29 23:02:13 +01:00
Łukasz Osipiuk	41b02870f8	NodeGroup.Nodes() return Instance struct instead instance name This is preparatory work for handling resource related (stockout/quota-exceeded) error conditions in CA.	2018-10-26 14:41:18 +02:00
Łukasz Osipiuk	29c22c0a3d	Store single ScaleUpRequest per node group	2018-10-18 18:27:31 +02:00
Jakub Tużnik	b105f28ebd	Add a method to determine if a node group is at its its target size to CSR	2018-09-07 20:24:38 +02:00
Aleksandra Malinowska	364e2da764	Check for ready condition not true	2018-08-30 13:43:24 +02:00
Jakub Tużnik	51334f283e	Fix GetClusterSize to return actual size in line with the rest of CSR It returned the number of registered nodes, but should return the number of provisioned nodes instead.	2018-08-27 14:58:07 +02:00
Jakub Tużnik	054f0b3b90	Add AutoscalingStatusProcessor	2018-08-07 14:47:06 +02:00
Krzysztof Jastrzebski	dd1db7a0ac	Move backoff mechanism to utils.	2018-06-13 15:32:25 +02:00
Aleksandra Malinowska	820f688d2a	Update max unready nodes to 45%	2018-05-17 12:51:45 +02:00

1 2

90 Commits