Commit Graph

110 Commits

Author SHA1 Message Date
Bartłomiej Wróblewski 14655d219f Remove the MaxNodeProvisioningTimeProvider interface 2023-08-05 11:26:40 +00:00
Karol Wychowaniec 8e621b23c4 Don't pass nil nodes to GetGpuInfoForMetrics 2023-08-04 09:28:34 +00:00
Karol Wychowaniec 80053f6eca Support ZeroOrMaxNodeScaling node groups when cleaning up unregistered nodes 2023-08-03 08:44:46 +00:00
Karol Wychowaniec 2eba540d27 Add metrics for improved observability:
* pending_node_deletions
* failed_gpu_scale_ups_total
2023-07-25 13:01:36 +00:00
Artur Żyliński e5bc070c8c Fix: Do not inject fakeNode for instance which has errors on create 2023-07-17 11:54:30 +02:00
Daniel Gutowski 5fed449792 Add ClusterStateRegistry to the AutoscalingContext.
Due to the dependency of the MaxNodeProvisionTimeProvider on the context
the provider was extracted to a dedicated package and injected to the
ClusterStateRegistry after context creation.
2023-07-04 05:00:09 -07:00
Bartłomiej Wróblewski 67d3e7ebc4 Include short unregistered nodes in calculation of incorrect node group
sizes
2023-06-29 10:28:48 +00:00
Maria Oparka ca088d26c2 Move MaxNodeProvisionTime to NodeGroupAutoscalingOptions 2023-04-19 11:08:20 +02:00
Bartłomiej Wróblewski b5ead036a8 Merge taint utils into one package, make taint modifying methods public 2023-02-13 11:29:45 +00:00
Kuba Tużnik 7e6762535b CA: stop passing registered upcoming nodes as scale-down candidates
Without this, with aggressive settings, scale-down could be removing
registered upcoming nodes before they have a chance to become ready
(the duration of which should be unrelated to the scale-down settings).
2023-02-10 14:46:19 +01:00
Kuba Tużnik 6978ff8829 CA: Make CSR's Readiness keep lists of node names instead of just their count
This does make us call len() in a bunch of places within CSR, but allows
for greater flexibility - it's possible to act on the sets of nodes determined
by Readiness.
2023-02-06 21:13:54 +01:00
Clint Fooken 1198fbcd90 Updating error messaging and fallback behavior of hasCloudProviderInstance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests. 2022-12-05 12:44:39 -08:00
Clint Fooken 08dfc7e20f Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance. 2022-11-04 17:54:05 -07:00
Clint Fooken 7fc1f6be01 Fixing errors due to merge on branches. 2022-10-17 15:45:55 -07:00
Clint cf67a3004e
Implementing new cloud provider method for node deletion detection (#1)
* Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.
2022-10-17 14:58:38 -07:00
Clint Fooken 776d7311a1 Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates.
Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.
2022-10-17 14:40:01 -07:00
Aleksandra Gacek ab2cc2fb8a Bump k/k dependencies to v1.25.0 together with go.mod go version. 2022-08-26 13:38:07 +02:00
Daniel Kłobuszewski 66bfe55077
Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes" 2022-07-13 10:08:03 +02:00
Clint Fooken a278255519 Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Including code changes first introduced in PR#4211, which will remove taints from all nodes on restarts. 2022-05-17 12:37:42 -07:00
weidongcai 03a0475502 Expose backoff time parameters 2022-05-12 15:34:28 +08:00
Daniel Kłobuszewski 26769e4c1b Expose nodes with unready GPU in CA status
This change simplifies debugging GPU issues: without it, all nodes can
be Ready as far as Kubernetes API is concerned, but CA will still report
some of them as unready if are missing GPU resource. Explicitly calling
them out in the status ConfigMap will point into the right direction.
2022-03-03 14:59:31 +01:00
Marwan Ahmed 8039af647e move annotations to cloudprovider package 2021-06-08 10:56:35 -07:00
Marwan Ahmed 36460df246 annotate fakeNodes so that cloudprovider implementations can identify them if needed 2021-06-06 13:54:05 -07:00
Dharma Bellamkonda e80f7c502b Log names of longUnregistered Nodes 2021-05-12 14:09:01 -06:00
Vivek Bagade 8c592f0c04 Fix bug where a node that becomes ready after 2 mins can be
treated as unready. Deprecated LongNotStarted

 In cases where node n1 would:
 1) Be created at t=0min
 2) Ready condition is true at t=2.5min
 3) Not ready taint is removed at t=3min
 the ready node is counted as unready

 Tested cases after fix:
 1) Case described above
 2) Nodes not starting even after 15mins still
 treated as unready
 3) Nodes created long ago that suddenly become unready are
 counted as unready.
2021-03-11 18:32:51 +01:00
Bartłomiej Wróblewski 0fb897b839 Update imports after scheduler scheduler/framework/v1alpha1 removal 2020-11-30 10:48:52 +00:00
Jakub Tużnik 6a528b45de Include taints by condition when determining if a node is unready/still starting
Conditions and their corresponding taints can sometimes skew, which
can cause unnecessary scale-up. CA thinks nodes are ready because it
looks only at the conditions, but scheduler predicates fail because they
consider the taints as well. CA adds nodes, even though the existing
nodes are still starting. This commit brings CA behavior in line
with scheduler predicates behavior, eliminating the unnecessary
scale-up.
2020-11-02 11:15:42 +01:00
Marwan Ahmed a3bada3708 correctly classify error for failed scale ups 2020-09-13 21:14:27 -07:00
Maciek Pytel 655b4081f4 Migrate to klog v2 2020-06-05 17:22:26 +02:00
Jakub Tużnik 73a5cdf928 Address recent breaking changes in scheduler
The following things changed in scheduler and needed to be fixed:
* NodeInfo was moved to schedulerframework
* Some fields on NodeInfo are now exposed directly instead of via getters
* NodeInfo.Pods is now a list of *schedulerframework.PodInfo, not *apiv1.Pod
* SharedLister and NodeInfoLister were moved to schedulerframework
* PodLister was removed
2020-04-24 17:54:47 +02:00
Kubernetes Prow Robot bf3a9fb52e
Merge pull request #2436 from Jeffwan/skip_first_acceptable_range_check
Skip acceptable range check before it has data
2019-12-10 01:49:29 -08:00
Jakub Tużnik f64b6cd4de CSR: fix a bug in GetClusterSize
Currently, GetClusterSize reports the target number for all autoscaled
node groups, but the actual number for _all_ node groups, even those
that are not autoscaled. This commit fixes that behavior so that both
target and actual size reported are from autoscaled node groups only.
2019-11-20 13:49:49 +01:00
Łukasz Osipiuk aa53261098 More verbose logging of GCE instance create errors 2019-10-15 15:36:38 +02:00
Łukasz Osipiuk 288d4107b2 Rename GetCreatedNodesWithOutOfResourcesErrors to GetCreatedNodesWithErrors 2019-10-14 10:56:56 +02:00
Jiaxin Shan 0d278a2554 Skip acceptable range check before it has data 2019-10-09 17:59:43 -07:00
Thomas Hartland 7c17d52ec8 Invalidate node instances cache after deleting failed nodes 2019-09-30 13:56:33 +02:00
Kubernetes Prow Robot 6434df247d
Merge pull request #2304 from krzysztof-jastrzebski/fix_bug
Stop disabling Cluster Autoscaler when there is no ready nodes.
2019-09-06 07:06:57 -07:00
Krzysztof Jastrzebski 839cdaaa09 Stop disabling Cluster Autoscaler when there is no ready nodes. 2019-09-06 14:45:34 +02:00
Łukasz Osipiuk 79b4614328 Use NodeDiskPressure conditino instead of NodeOutOfDisk 2019-09-05 23:23:43 +02:00
devinyan 3a633de55a nodeGroup judy IsNil to avoid crashed 2019-06-30 17:33:32 +08:00
Kubernetes Prow Robot dd89fb1385
Merge pull request #2096 from frobware/fix-segv-in-updateReadinessStats
Fix potential SEGV in updateReadinessStats
2019-06-11 09:00:24 -07:00
Andrew McDermott 91016a605a Fix SEGV in updateReadinessStats
Calling cloudprovider.NodeGroupForNode(unregistered.Node) can result
in a nil result for the nodegroup - handle that case.
2019-06-11 10:42:27 +01:00
Jakub Tużnik bb382f47f9 Retain information about scale-up failures in CSR
This will provide the AutoscalingStatusProcessor with information
about failed scale-ups.
2019-06-05 16:53:30 +02:00
Łukasz Osipiuk 950a8a9f76 Quickly fail scaleup on all instance creation errors
Change-Id: Ib918251f3e3229d882d5182a98f129b77d7731a3
2019-06-03 13:32:41 +02:00
Łukasz Osipiuk c88f014470 Add debug log in handleOutOfResourcesErrorsForNodeGroup 2019-05-31 15:26:41 +02:00
Krzysztof Jastrzebski 4831d76288 Cache cloud provider node instances in cluster state. 2019-05-31 10:11:51 +02:00
Pengfei Ni b721438315 Revert "Use cloudProvider.GetInstanceID() to get unregistered nodes"
This reverts commit f4ef957ecd.
2019-03-08 10:47:26 +08:00
Pengfei Ni f4ef957ecd Use cloudProvider.GetInstanceID() to get unregistered nodes 2019-02-27 22:58:34 +08:00
Pengfei Ni 128729bae9 Move schedulercache to package nodeinfo 2019-02-21 12:41:08 +08:00
Łukasz Osipiuk b5f9a9505c Extend backoff interface with NodeInfo and error information 2019-01-09 11:25:34 +01:00