Clint Fooken
1198fbcd90
Updating error messaging and fallback behavior of hasCloudProviderInstance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests.
2022-12-05 12:44:39 -08:00
Clint Fooken
08dfc7e20f
Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.
2022-11-04 17:54:05 -07:00
Clint Fooken
7fc1f6be01
Fixing errors due to merge on branches.
2022-10-17 15:45:55 -07:00
Clint
cf67a3004e
Implementing new cloud provider method for node deletion detection ( #1 )
...
* Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.
2022-10-17 14:58:38 -07:00
Clint Fooken
776d7311a1
Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates.
...
Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.
2022-10-17 14:40:01 -07:00
Aleksandra Gacek
ab2cc2fb8a
Bump k/k dependencies to v1.25.0 together with go.mod go version.
2022-08-26 13:38:07 +02:00
Daniel Kłobuszewski
66bfe55077
Revert "Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes"
2022-07-13 10:08:03 +02:00
Clint Fooken
a278255519
Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Including code changes first introduced in PR#4211, which will remove taints from all nodes on restarts.
2022-05-17 12:37:42 -07:00
weidongcai
03a0475502
Expose backoff time parameters
2022-05-12 15:34:28 +08:00
Daniel Kłobuszewski
26769e4c1b
Expose nodes with unready GPU in CA status
...
This change simplifies debugging GPU issues: without it, all nodes can
be Ready as far as Kubernetes API is concerned, but CA will still report
some of them as unready if are missing GPU resource. Explicitly calling
them out in the status ConfigMap will point into the right direction.
2022-03-03 14:59:31 +01:00
Marwan Ahmed
8039af647e
move annotations to cloudprovider package
2021-06-08 10:56:35 -07:00
Marwan Ahmed
36460df246
annotate fakeNodes so that cloudprovider implementations can identify them if needed
2021-06-06 13:54:05 -07:00
Dharma Bellamkonda
e80f7c502b
Log names of longUnregistered Nodes
2021-05-12 14:09:01 -06:00
Vivek Bagade
8c592f0c04
Fix bug where a node that becomes ready after 2 mins can be
...
treated as unready. Deprecated LongNotStarted
In cases where node n1 would:
1) Be created at t=0min
2) Ready condition is true at t=2.5min
3) Not ready taint is removed at t=3min
the ready node is counted as unready
Tested cases after fix:
1) Case described above
2) Nodes not starting even after 15mins still
treated as unready
3) Nodes created long ago that suddenly become unready are
counted as unready.
2021-03-11 18:32:51 +01:00
Bartłomiej Wróblewski
0fb897b839
Update imports after scheduler scheduler/framework/v1alpha1 removal
2020-11-30 10:48:52 +00:00
Jakub Tużnik
6a528b45de
Include taints by condition when determining if a node is unready/still starting
...
Conditions and their corresponding taints can sometimes skew, which
can cause unnecessary scale-up. CA thinks nodes are ready because it
looks only at the conditions, but scheduler predicates fail because they
consider the taints as well. CA adds nodes, even though the existing
nodes are still starting. This commit brings CA behavior in line
with scheduler predicates behavior, eliminating the unnecessary
scale-up.
2020-11-02 11:15:42 +01:00
Marwan Ahmed
a3bada3708
correctly classify error for failed scale ups
2020-09-13 21:14:27 -07:00
Maciek Pytel
655b4081f4
Migrate to klog v2
2020-06-05 17:22:26 +02:00
Jakub Tużnik
73a5cdf928
Address recent breaking changes in scheduler
...
The following things changed in scheduler and needed to be fixed:
* NodeInfo was moved to schedulerframework
* Some fields on NodeInfo are now exposed directly instead of via getters
* NodeInfo.Pods is now a list of *schedulerframework.PodInfo, not *apiv1.Pod
* SharedLister and NodeInfoLister were moved to schedulerframework
* PodLister was removed
2020-04-24 17:54:47 +02:00
Kubernetes Prow Robot
bf3a9fb52e
Merge pull request #2436 from Jeffwan/skip_first_acceptable_range_check
...
Skip acceptable range check before it has data
2019-12-10 01:49:29 -08:00
Jakub Tużnik
f64b6cd4de
CSR: fix a bug in GetClusterSize
...
Currently, GetClusterSize reports the target number for all autoscaled
node groups, but the actual number for _all_ node groups, even those
that are not autoscaled. This commit fixes that behavior so that both
target and actual size reported are from autoscaled node groups only.
2019-11-20 13:49:49 +01:00
Łukasz Osipiuk
aa53261098
More verbose logging of GCE instance create errors
2019-10-15 15:36:38 +02:00
Łukasz Osipiuk
288d4107b2
Rename GetCreatedNodesWithOutOfResourcesErrors to GetCreatedNodesWithErrors
2019-10-14 10:56:56 +02:00
Jiaxin Shan
0d278a2554
Skip acceptable range check before it has data
2019-10-09 17:59:43 -07:00
Thomas Hartland
7c17d52ec8
Invalidate node instances cache after deleting failed nodes
2019-09-30 13:56:33 +02:00
Kubernetes Prow Robot
6434df247d
Merge pull request #2304 from krzysztof-jastrzebski/fix_bug
...
Stop disabling Cluster Autoscaler when there is no ready nodes.
2019-09-06 07:06:57 -07:00
Krzysztof Jastrzebski
839cdaaa09
Stop disabling Cluster Autoscaler when there is no ready nodes.
2019-09-06 14:45:34 +02:00
Łukasz Osipiuk
79b4614328
Use NodeDiskPressure conditino instead of NodeOutOfDisk
2019-09-05 23:23:43 +02:00
devinyan
3a633de55a
nodeGroup judy IsNil to avoid crashed
2019-06-30 17:33:32 +08:00
Kubernetes Prow Robot
dd89fb1385
Merge pull request #2096 from frobware/fix-segv-in-updateReadinessStats
...
Fix potential SEGV in updateReadinessStats
2019-06-11 09:00:24 -07:00
Andrew McDermott
91016a605a
Fix SEGV in updateReadinessStats
...
Calling cloudprovider.NodeGroupForNode(unregistered.Node) can result
in a nil result for the nodegroup - handle that case.
2019-06-11 10:42:27 +01:00
Jakub Tużnik
bb382f47f9
Retain information about scale-up failures in CSR
...
This will provide the AutoscalingStatusProcessor with information
about failed scale-ups.
2019-06-05 16:53:30 +02:00
Łukasz Osipiuk
950a8a9f76
Quickly fail scaleup on all instance creation errors
...
Change-Id: Ib918251f3e3229d882d5182a98f129b77d7731a3
2019-06-03 13:32:41 +02:00
Łukasz Osipiuk
c88f014470
Add debug log in handleOutOfResourcesErrorsForNodeGroup
2019-05-31 15:26:41 +02:00
Krzysztof Jastrzebski
4831d76288
Cache cloud provider node instances in cluster state.
2019-05-31 10:11:51 +02:00
Pengfei Ni
b721438315
Revert "Use cloudProvider.GetInstanceID() to get unregistered nodes"
...
This reverts commit f4ef957ecd .
2019-03-08 10:47:26 +08:00
Pengfei Ni
f4ef957ecd
Use cloudProvider.GetInstanceID() to get unregistered nodes
2019-02-27 22:58:34 +08:00
Pengfei Ni
128729bae9
Move schedulercache to package nodeinfo
2019-02-21 12:41:08 +08:00
Łukasz Osipiuk
b5f9a9505c
Extend backoff interface with NodeInfo and error information
2019-01-09 11:25:34 +01:00
Łukasz Osipiuk
85a83b62bd
Pass nodeGroup->NodeInfo map to ClusterStateRegistry
...
Change-Id: Ie2a51694b5731b39c8a4135355a3b4c832c26801
2019-01-08 15:52:00 +01:00
Łukasz Osipiuk
5cddbda693
Rename nodeGroupBackoffInfo to backoff in ClusterStateRegistry
2018-12-31 17:59:58 +01:00
Łukasz Osipiuk
2fbae197f4
Handle possible stockout/quota scale-up errors
2018-12-28 17:17:07 +01:00
Łukasz Osipiuk
9689b30ee4
Do not use time.Now() in RegisterFailedScaleUp
2018-12-28 17:17:07 +01:00
Łukasz Osipiuk
da5bef307b
Allow updating Increase for ScaleUpRequest in ClusterStateRegistry
2018-12-28 17:17:07 +01:00
lsytj0413
8ca0e71d1e
refactor(*): fix some golint warning
2018-12-24 11:07:15 +08:00
Łukasz Osipiuk
016bf7fc2c
Use k8s.io/klog instead github.com/golang/glog
2018-11-26 17:30:31 +01:00
Łukasz Osipiuk
5962354c81
Inject Backoff instance to ClusterStateRegistry on creation
2018-11-13 14:25:16 +01:00
k8s-ci-robot
7008fb50be
Merge pull request #1380 from losipiuk/lo/backoff
...
Make Backoff interface
2018-11-07 05:13:43 -08:00
Łukasz Osipiuk
0e2c3739b7
Use NodeGroup as key in Backoff
2018-10-30 18:17:26 +01:00
Łukasz Osipiuk
55fc1e2f00
Store NodeGroup in ScaleUpRequest and ScaleDownRequest
2018-10-30 18:03:04 +01:00