* Added the new resource_name field to scaled_up/down_gpu_nodes_total,
representing the resource name for the gpu.
* Changed metrics registrations to use GpuConfig
This change adds a new metric, skipped_scale_events_count, which will
record the number of times that the CA has chosen to skip a scaling
event. The metric contains a label for the scaling direction (up or down)
and the reason.
This patch includes usages for the new metric based on CPU or Memory
limits being reached in eiter a scale up or down.
This function can take an variable amount of time due to various
conditions (ie. many nodegroups changes causing forced refreshes,
caches time to live expiries, ...).
Monitoring that duration is useful to diagnose those variations,
and to uncover external issues (ie. throttling from cloud provider)
affecting cluster-autoscaler.
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.
The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`
This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.
User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
This change adds a new metric which counts the number of nodes removed
by the cluster autoscaler due to being unregistered with kubernetes.
User Story
As a cluster-autoscaler user, I would like to know when the autoscaler
is cleaning up nodes that have failed to register with kubernetes. I
would like to monitor the rate at which failed nodes are being removed
so that I can better alert on infrastructure issues which may go
unnoticed elsewhere.
* Add GPU-related scaled_up & scaled_down metrics
* Fix name to match SD naming convention
* Fix import after master rebase
* Change the logic to include GPU-being-installed nodes
Many functions take an order of magnitude more time
if they actually decide to take an action (like deleting
node in scale-down) and it's ok if executing action is
slow. That makes summary less useful, as we expect to
have large outliers on some percentile, depending on
churn in cluster. Instead having a histogram gives
us the fuller picture of how the distribution of
function runtimes look like.
Adds a new optional flag named `configmap` to specify the name of a configmap containing node group specs.
The configmap is polled every `scan-interval` seconds to reconfigure cluster-autoscaler dynamically at runtime.
Example usage:
```
./cluster-autoscaler --v=4 --cloud-provider=aws --skip-nodes-with-local-storage=false --logtostderr --leader-elect=false --configmap=cluster-autoscaler --logtostderr
```
The configmap would look like:
```yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: cluster-autoscaler
namespace: kube-system
data:
settings: |-
{
"nodeGroups": [
{
"minSize": 1,
"maxSize": 2,
"name": "kubeawstest-nodepool1-AutoScaleWorker-1VWD4GAVG35L5"
}
]
}
```
Other notes:
* Make namespace defaults to "kube-system"
according to https://github.com/kubernetes/contrib/pull/2226#discussion_r94144267
* Trigger a full-recreate on a configuration change
according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-269617410
* Introduced `autoscaler/` and moved all the dynamic/recreatable-at-runtime parts of autoscaler into there (Update: the package is now named `core` according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-273071663)
* Extracted the core of CA(=`func Run()` in `main.go`) into `Autoscaler`
* `DynamicAutoscaler` is a wrapper around `Autoscaler` which achieves reconfiguration of CA by recreating an `Autoscaler` instance on a configmap change.
* Moved `scale_down*.go`, `scale_up*.go` and `utils*.go` into the `autoscaler` package accordingly because they seemed to be meant to be collocated in the same package as the core of CA (which is now implemented as `Autoscaler`)
* Moved the `createEventRecorder` func from the `main` package to the `utils/kubernetes` package to make it importable from both `main` and `autoscaler`