This change adds a new metric, skipped_scale_events_count, which will
record the number of times that the CA has chosen to skip a scaling
event. The metric contains a label for the scaling direction (up or down)
and the reason.
This patch includes usages for the new metric based on CPU or Memory
limits being reached in eiter a scale up or down.
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.
The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`
This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.
User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
This change adds a new metric which counts the number of nodes removed
by the cluster autoscaler due to being unregistered with kubernetes.
User Story
As a cluster-autoscaler user, I would like to know when the autoscaler
is cleaning up nodes that have failed to register with kubernetes. I
would like to monitor the rate at which failed nodes are being removed
so that I can better alert on infrastructure issues which may go
unnoticed elsewhere.
The following things changed in scheduler and needed to be fixed:
* NodeInfo was moved to schedulerframework
* Some fields on NodeInfo are now exposed directly instead of via getters
* NodeInfo.Pods is now a list of *schedulerframework.PodInfo, not *apiv1.Pod
* SharedLister and NodeInfoLister were moved to schedulerframework
* PodLister was removed
"n1-standard-..." should be changed to "N1-standard-..."
In this doc, mostly using "N1-standard-...", but there are several "n1-standard-...", too.
It's better to use same format.
Many functions take an order of magnitude more time
if they actually decide to take an action (like deleting
node in scale-down) and it's ok if executing action is
slow. That makes summary less useful, as we expect to
have large outliers on some percentile, depending on
churn in cluster. Instead having a histogram gives
us the fuller picture of how the distribution of
function runtimes look like.