This change is adding github users arunmk, mrajashree, jackfrancis,
shysank, and randomvariable to the reviews for the cluster-api
provider. It also removes frobware and ncdc from the approvers and
reviewers.
Each test works in isolation, but they cause panic when the entire
suite is run (ex. make test-in-docker), because the underlying
metrics library panics when the same metric is registered twice.
We have recevied an issue from one of our customers that their
autoscaler pod regularly crashes due to a nil pointer dereference panic.
Analyzing the code we found out that the autoscaler polls the server
status to find out if a server is running.
The Hetzner Cloud Go client is implemented in such a way that it does
not return an error if a resource could not be found. Instead it returns
nil for the error and the resource. Ususally this is not an issue.
However, in case the server creation fails the server gets deleted from
Hetzner Cloud. This in turn leads to nil being returned and the
abovementioned panic.
The Hetzner Cloud API implements a concept called Actions. Whenever a
long running process is triggered we return an object which can
be used to get information about the progress of the task. The Action
object reliably allows to detect if a server has been created and
provides access to any error that may have occured.
This commit replaces polling the server status with using the action
object.
This makes it possible to securely store the access token in a file and
load it into the cloud provider from there.
Document DigitalOcean's cloud config format while we are here.
An initial version of the DigitalOcean cloud provider implementation
relied on tags to define the behavior but has since been transitioned to
using the public DOKS API. Update the README accordingly.
This fixes the comment at the beginning of the addon-resizer deployment example
so that it describes the correct key names to use for configuring the cpu
and memory parameters.
If the API is temporarily unavailable, cluster-autoscaler will be
crash-looping on startup during the initial call to Refresh(). This
makes for a bad user/operator experience since it aggravates
differentiating between API and cluster/workload problems.
Let autoscaler start up and retry fetching node pool information from
the API as part of the pre-existing, periodic sync. This should be no
different to experiencing transient API problems during runtime.