Automatic merge from submit-queue
Cluster-Autoscaler: reset unneededNodes list on cluster failure
If the nodes is marked as unneeded and cluster goes to an unhealthy state shortly after the node will likely be deleted immediately on cluster recovery. This is because there is already an entry for it in unnededNodes datastructure and the cluster downtime is counted towards node being unneeded time.
It's not 100% obvious to me what should happen in this case, but I think it's better to play it safe and just wait the full 10 minutes after cluster recovery before we start to delete nodes. After a quick glance at the code I haven't spotted any other stuff that needs to be cleaned up in case of cluster failure, but maybe you have some other ideas @mwielgus?
Automatic merge from submit-queue
Cluster-Autoscaler: update status configmap on errors
Previously it would only update after successfully completing the main
loop, meaning the status wouldn't get updated unless cluster was
healthy.
Automatic merge from submit-queue
Cluster-Autoscaler: consider node with unknown readiness unready
Node with non-responsive kubelet seems to be marked as NodeReady: Unknown, which is currently considered as ready by CA.
Automatic merge from submit-queue
Cluster-autoscaler: fix NotTriggerScaleUp event
This should fix a failing e2e test.
Also updated some scale_up unittests to check created events and fixed a typo in variable name.
Automatic merge from submit-queue
Cluster-Autoscaler: fix delete taint failing
It was using old node version (which in general is always going to be outdated, as we've likely modified it by adding delete taint).
@mwielgus
Automatic merge from submit-queue
Cluster-Autoscaler: fix delete taint value format
Fix a bug, where non-compliant value format prevented CA deletetaint from being created (which in turn caused CA node drain to fail).
Automatic merge from submit-queue
Cluster-Autoscaler: Update timestamps in status configmap
Update LastProbeTime and LastTransitionTime fields in ClusterStateRegistry (previously they weren't used and always showed as epoch in status). Update scale down part of status whenever list of unneeded nodes in CA changes.
Automatic merge from submit-queue
Cluster-Autoscaler: skip nodes currently under deletion in scale down
Currently we may try to delete the same node multiple times.
cc: @MaciekPytel @jszczepkowski @fgrzadkowski
Automatic merge from submit-queue
Cluster-autoscaler: include PodDisruptionBudget in drain - part 1/2
In part 1 or 2 we skip nodes that have a pod with 0 poddisruptionallowed. Part 2/2 will delete pods using evict.
cc: @jszczepkowski @MaciekPytel @davidopp @fgrzadkowski
Automatic merge from submit-queue
Cluster-Autoscaler events on status configmap
Write events on status configmap on scale up / scale down. Moved writing to status configmap to a separate file (clusterstate/utils/status.go) to clean up and also allow reasonably isolated unittests.
Automatic merge from submit-queue
Cluster-autoscaler: precheck that the api server link is ok
The logs from leader election are super vague. An explicit check is needed to let the user know that the connection could not be established.
cc: @jszczepkowski @MaciekPytel
Automatic merge from submit-queue
Add mwilegus to /hack/OWNERS
Cluster Autoscaler is an actively developer project and from time to time we have to add couple flags to it. It would be good to be able to have an approval rights for verify-flags.