Commit Graph

45 Commits

Author SHA1 Message Date
Karol Wychowaniec 2eba540d27 Add metrics for improved observability:
* pending_node_deletions
* failed_gpu_scale_ups_total
2023-07-25 13:01:36 +00:00
qianlei.qianl fab8ec7fd2 feat(*): add more metrics 2023-05-25 22:56:36 +08:00
Hakan Bostan 2ea2fb66f6 Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics
* Added the new resource_name field to scaled_up/down_gpu_nodes_total,
  representing the resource name for the gpu.
* Changed metrics registrations to use GpuConfig
2023-02-22 10:09:45 +00:00
Michael McCune da9d307e57 add metric for skipped scaling events
This change adds a new metric, skipped_scale_events_count, which will
record the number of times that the CA has chosen to skip a scaling
event. The metric contains a label for the scaling direction (up or down)
and the reason.

This patch includes usages for the new metric based on CPU or Memory
limits being reached in eiter a scale up or down.
2022-07-28 10:51:49 -04:00
Daniel Kłobuszewski 525145c651 Limit caching pods per owner reference 2022-03-15 10:03:04 +01:00
Kubernetes Prow Robot 9f84d391f6
Merge pull request #4022 from amrmahdi/amrh/nodegroupminmaxmetrics
[cluster-autoscaler] Publish node group min/max metrics
2021-07-05 07:38:54 -07:00
Benjamin Pineau 986fe3ae20 Metric for CloudProvider.Refresh() duration
This function can take an variable amount of time due to various
conditions (ie. many nodegroups changes causing forced refreshes,
caches time to live expiries, ...).

Monitoring that duration is useful to diagnose those variations,
and to uncover external issues (ie. throttling from cloud provider)
affecting cluster-autoscaler.
2021-05-31 15:55:28 +02:00
Amr Hanafi (MAHDI)) f5c2ab7328 Emit the node group metrics behind a flag 2021-05-20 16:49:39 -07:00
Amr Hanafi (MAHDI)) 2bd7f0efa3 [cluster-autoscaler] Publish node group min/max metrics 2021-05-17 12:27:21 -07:00
Michael McCune a24ea6c66b add cluster cores and memory bytes count metrics
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.

The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`

This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.

User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
2021-04-06 10:35:21 -04:00
Brett Elliott 013fa19be3 Log failed scale up metric based on string of AutoscalerErrorType. 2021-03-23 15:37:04 +01:00
Brett Elliott 4cddaed2f2 Support for reporting authorization errors during scale up 2021-03-17 14:56:03 +01:00
Michael McCune 7ecf933e7b add a metric for unregistered nodes removed by cluster autoscaler
This change adds a new metric which counts the number of nodes removed
by the cluster autoscaler due to being unregistered with kubernetes.

User Story

As a cluster-autoscaler user, I would like to know when the autoscaler
is cleaning up nodes that have failed to register with kubernetes. I
would like to monitor the rate at which failed nodes are being removed
so that I can better alert on infrastructure issues which may go
unnoticed elsewhere.
2021-03-04 19:23:03 -05:00
Evgenii Petrov b6f5d5567d Add unremovable_nodes_count metric 2021-02-12 15:47:34 +00:00
Marwan Ahmed a3bada3708 correctly classify error for failed scale ups 2020-09-13 21:14:27 -07:00
M. Habib Rosyad b7e02047f7 expose max-nodes-total as a metric 2020-08-19 17:43:39 +07:00
Maciek Pytel 655b4081f4 Migrate to klog v2 2020-06-05 17:22:26 +02:00
Łukasz Osipiuk b4c8bbb12c Fixes around metrics/ handler 2019-11-22 14:07:10 +01:00
Julien Balestra 012c8421da cluster-autoscaler/metrics: add a summary for function duration
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2019-08-28 16:28:16 +02:00
Julien Balestra 6d707a08ac cluster-autoscaler/metrics: expose the scale down cooldown
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2019-08-27 18:12:33 +02:00
Jacek Kaniuk 0c64e0932a Tainting unneeded nodes as PreferNoSchedule 2019-01-21 13:06:50 +01:00
Łukasz Osipiuk 016bf7fc2c Use k8s.io/klog instead github.com/golang/glog 2018-11-26 17:30:31 +01:00
Karol Gołąb 67b834368b Add client-go metrics (rest_client_request_*). 2018-09-06 12:35:16 +02:00
Karol Gołąb aae4d1270a Make GetGpuTypeForMetrics more robust 2018-06-26 21:35:16 +02:00
Karol Gołąb 5eb7021f82 Add GPU-related scaled_up & scaled_down metrics (#974)
* Add GPU-related scaled_up & scaled_down metrics

* Fix name to match SD naming convention

* Fix import after master rebase

* Change the logic to include GPU-being-installed nodes
2018-06-22 21:00:52 +02:00
Aleksandra Malinowska 3894ecb470 Export unregistered node count metric 2018-01-16 16:56:40 +01:00
Aleksandra Malinowska 3d33b64599 Export long unregistered node count metric 2018-01-16 16:07:24 +01:00
Aleksandra Malinowska 312f989c15 Don't register metrics unless on leading master 2017-12-14 16:08:20 +01:00
Maciej Pytel c376ef3c87 Add metrics for autoprovisioning 2017-10-31 17:42:58 +01:00
Maciej Pytel e12ee88f5f Add failed scale-up reason in metric 2017-09-26 13:40:34 +02:00
Maciej Pytel 7f7243ea98 Add reason field to faied_scale_ups_total metric
For now it's just a placeholder, will add proper logic
for next release
2017-09-25 16:33:49 +02:00
Maciej Pytel 5e05c84cf0 Add metric counting failed scale-ups
A minor refactor was required to avoid cyclic imports
2017-09-22 18:12:50 +02:00
Marcin Wielgus 2d8f59e23d Set verbosity for each of the glog.Info logs 2017-09-01 12:34:29 +02:00
Beata Skiba edeb522274 Add measuring of FilterOutSchedulable 2017-08-22 18:36:13 +02:00
Beata Skiba 43c9b6b06b Add cleaner function labels for metrics exporting. 2017-08-22 16:09:42 +02:00
Beata Skiba 14df1b808b Drill down scale down metrics
Split scale down duration into three parts:
1. Find nodes to remove
2. Node deletion
3. Misc operations
2017-08-18 14:17:02 +02:00
Maciej Pytel 1782cbc4ed Log long function execution 2017-08-07 11:21:15 +02:00
Beata Skiba 25f6242b99 Change histogram buckets. 2017-08-04 14:04:02 +02:00
Maciej Pytel 9123400fcf Change function duration metric to histogram
Many functions take an order of magnitude more time
if they actually decide to take an action (like deleting
node in scale-down) and it's ok if executing action is
slow. That makes summary less useful, as we expect to
have large outliers on some percentile, depending on
churn in cluster. Instead having a histogram gives
us the fuller picture of how the distribution of
function runtimes look like.
2017-06-23 12:06:28 +02:00
Marcin Wielgus 69c77791a2 Fix error types 2017-06-12 21:26:50 +02:00
Maciej Pytel f716a7e496 Add typed errors; add errors_total metric
To keep reasonable commit size only top-level files use
new errors. Will add them in other files in next commits.
2017-05-18 14:09:15 +02:00
Maciej Pytel 7a21a68b56 Add metrics counting CA operations 2017-05-15 13:03:00 +02:00
Maciej Pytel 4cdf06ea94 Added CA metrics related to autoscaler execution 2017-05-11 14:51:04 +02:00
Maciej Pytel 83ef3d2be3 Added CA metrics related to cluster state 2017-05-11 13:54:04 +02:00
Yusuke Kuoka baee799524 cluster-autoscaler: Dynamic Reconfiguration via ConfigMaps
Adds a new optional flag named `configmap` to specify the name of a configmap containing node group specs.

The configmap is polled every `scan-interval` seconds to reconfigure cluster-autoscaler dynamically at runtime.

Example usage:

```
./cluster-autoscaler --v=4 --cloud-provider=aws --skip-nodes-with-local-storage=false --logtostderr --leader-elect=false --configmap=cluster-autoscaler --logtostderr
```

The configmap would look like:

```yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: cluster-autoscaler
  namespace: kube-system
data:
  settings: |-
    {
      "nodeGroups": [
        {
          "minSize": 1,
          "maxSize": 2,
          "name": "kubeawstest-nodepool1-AutoScaleWorker-1VWD4GAVG35L5"
        }
      ]
    }
 ```

Other notes:

* Make namespace defaults to "kube-system"
according to https://github.com/kubernetes/contrib/pull/2226#discussion_r94144267

* Trigger a full-recreate on a configuration change

according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-269617410

* Introduced `autoscaler/` and moved  all the dynamic/recreatable-at-runtime parts of autoscaler into there (Update: the package is now named `core` according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-273071663)

* Extracted the core of CA(=`func Run()` in `main.go`) into `Autoscaler`

* `DynamicAutoscaler` is a wrapper around `Autoscaler` which achieves reconfiguration of CA by recreating an `Autoscaler` instance on a configmap change.

* Moved `scale_down*.go`, `scale_up*.go` and `utils*.go` into the `autoscaler` package accordingly because they seemed to be meant to be collocated in the same package as the core of CA (which is now implemented as `Autoscaler`)

* Moved the `createEventRecorder` func from the `main` package to the `utils/kubernetes` package to make it importable from both `main` and `autoscaler`
2017-02-24 20:36:47 +09:00