Existing bucketing is inconsistent. Specifically, the second to last
bucket is [100, 1000), which is huge and doesn't allow to differentiate
between something that took 2m (120s) and something that took 15m (900s).
* Added the new resource_name field to scaled_up/down_gpu_nodes_total,
representing the resource name for the gpu.
* Changed metrics registrations to use GpuConfig
This change adds a new metric, skipped_scale_events_count, which will
record the number of times that the CA has chosen to skip a scaling
event. The metric contains a label for the scaling direction (up or down)
and the reason.
This patch includes usages for the new metric based on CPU or Memory
limits being reached in eiter a scale up or down.
This function can take an variable amount of time due to various
conditions (ie. many nodegroups changes causing forced refreshes,
caches time to live expiries, ...).
Monitoring that duration is useful to diagnose those variations,
and to uncover external issues (ie. throttling from cloud provider)
affecting cluster-autoscaler.
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.
The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`
This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.
User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
This change adds a new metric which counts the number of nodes removed
by the cluster autoscaler due to being unregistered with kubernetes.
User Story
As a cluster-autoscaler user, I would like to know when the autoscaler
is cleaning up nodes that have failed to register with kubernetes. I
would like to monitor the rate at which failed nodes are being removed
so that I can better alert on infrastructure issues which may go
unnoticed elsewhere.
* Add GPU-related scaled_up & scaled_down metrics
* Fix name to match SD naming convention
* Fix import after master rebase
* Change the logic to include GPU-being-installed nodes