Commit Graph

33 Commits

Author SHA1 Message Date
Marwan Ahmed a3bada3708 correctly classify error for failed scale ups 2020-09-13 21:14:27 -07:00
M. Habib Rosyad b7e02047f7 expose max-nodes-total as a metric 2020-08-19 17:43:39 +07:00
Maciek Pytel 655b4081f4 Migrate to klog v2 2020-06-05 17:22:26 +02:00
Łukasz Osipiuk b4c8bbb12c Fixes around metrics/ handler 2019-11-22 14:07:10 +01:00
Julien Balestra 012c8421da cluster-autoscaler/metrics: add a summary for function duration
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2019-08-28 16:28:16 +02:00
Julien Balestra 6d707a08ac cluster-autoscaler/metrics: expose the scale down cooldown
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2019-08-27 18:12:33 +02:00
Jacek Kaniuk 0c64e0932a Tainting unneeded nodes as PreferNoSchedule 2019-01-21 13:06:50 +01:00
Łukasz Osipiuk 016bf7fc2c Use k8s.io/klog instead github.com/golang/glog 2018-11-26 17:30:31 +01:00
Karol Gołąb 67b834368b Add client-go metrics (rest_client_request_*). 2018-09-06 12:35:16 +02:00
Karol Gołąb aae4d1270a Make GetGpuTypeForMetrics more robust 2018-06-26 21:35:16 +02:00
Karol Gołąb 5eb7021f82 Add GPU-related scaled_up & scaled_down metrics (#974)
* Add GPU-related scaled_up & scaled_down metrics

* Fix name to match SD naming convention

* Fix import after master rebase

* Change the logic to include GPU-being-installed nodes
2018-06-22 21:00:52 +02:00
Aleksandra Malinowska 3894ecb470 Export unregistered node count metric 2018-01-16 16:56:40 +01:00
Aleksandra Malinowska 3d33b64599 Export long unregistered node count metric 2018-01-16 16:07:24 +01:00
Aleksandra Malinowska 312f989c15 Don't register metrics unless on leading master 2017-12-14 16:08:20 +01:00
Maciej Pytel c376ef3c87 Add metrics for autoprovisioning 2017-10-31 17:42:58 +01:00
Maciej Pytel e12ee88f5f Add failed scale-up reason in metric 2017-09-26 13:40:34 +02:00
Maciej Pytel 7f7243ea98 Add reason field to faied_scale_ups_total metric
For now it's just a placeholder, will add proper logic
for next release
2017-09-25 16:33:49 +02:00
Maciej Pytel 5e05c84cf0 Add metric counting failed scale-ups
A minor refactor was required to avoid cyclic imports
2017-09-22 18:12:50 +02:00
Marcin Wielgus 2d8f59e23d Set verbosity for each of the glog.Info logs 2017-09-01 12:34:29 +02:00
Beata Skiba edeb522274 Add measuring of FilterOutSchedulable 2017-08-22 18:36:13 +02:00
Beata Skiba 43c9b6b06b Add cleaner function labels for metrics exporting. 2017-08-22 16:09:42 +02:00
Beata Skiba 14df1b808b Drill down scale down metrics
Split scale down duration into three parts:
1. Find nodes to remove
2. Node deletion
3. Misc operations
2017-08-18 14:17:02 +02:00
Maciej Pytel 1782cbc4ed Log long function execution 2017-08-07 11:21:15 +02:00
Beata Skiba 25f6242b99 Change histogram buckets. 2017-08-04 14:04:02 +02:00
Maciej Pytel 9123400fcf Change function duration metric to histogram
Many functions take an order of magnitude more time
if they actually decide to take an action (like deleting
node in scale-down) and it's ok if executing action is
slow. That makes summary less useful, as we expect to
have large outliers on some percentile, depending on
churn in cluster. Instead having a histogram gives
us the fuller picture of how the distribution of
function runtimes look like.
2017-06-23 12:06:28 +02:00
Marcin Wielgus 69c77791a2 Fix error types 2017-06-12 21:26:50 +02:00
Aleksandra Malinowska 972772440a Add failing health check if autoscaler loop consistently returns error 2017-05-29 11:31:57 +02:00
Aleksandra Malinowska 7c94367099 Add health check 2017-05-25 11:37:44 +02:00
Maciej Pytel f716a7e496 Add typed errors; add errors_total metric
To keep reasonable commit size only top-level files use
new errors. Will add them in other files in next commits.
2017-05-18 14:09:15 +02:00
Maciej Pytel 7a21a68b56 Add metrics counting CA operations 2017-05-15 13:03:00 +02:00
Maciej Pytel 4cdf06ea94 Added CA metrics related to autoscaler execution 2017-05-11 14:51:04 +02:00
Maciej Pytel 83ef3d2be3 Added CA metrics related to cluster state 2017-05-11 13:54:04 +02:00
Yusuke Kuoka baee799524 cluster-autoscaler: Dynamic Reconfiguration via ConfigMaps
Adds a new optional flag named `configmap` to specify the name of a configmap containing node group specs.

The configmap is polled every `scan-interval` seconds to reconfigure cluster-autoscaler dynamically at runtime.

Example usage:

```
./cluster-autoscaler --v=4 --cloud-provider=aws --skip-nodes-with-local-storage=false --logtostderr --leader-elect=false --configmap=cluster-autoscaler --logtostderr
```

The configmap would look like:

```yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: cluster-autoscaler
  namespace: kube-system
data:
  settings: |-
    {
      "nodeGroups": [
        {
          "minSize": 1,
          "maxSize": 2,
          "name": "kubeawstest-nodepool1-AutoScaleWorker-1VWD4GAVG35L5"
        }
      ]
    }
 ```

Other notes:

* Make namespace defaults to "kube-system"
according to https://github.com/kubernetes/contrib/pull/2226#discussion_r94144267

* Trigger a full-recreate on a configuration change

according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-269617410

* Introduced `autoscaler/` and moved  all the dynamic/recreatable-at-runtime parts of autoscaler into there (Update: the package is now named `core` according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-273071663)

* Extracted the core of CA(=`func Run()` in `main.go`) into `Autoscaler`

* `DynamicAutoscaler` is a wrapper around `Autoscaler` which achieves reconfiguration of CA by recreating an `Autoscaler` instance on a configmap change.

* Moved `scale_down*.go`, `scale_up*.go` and `utils*.go` into the `autoscaler` package accordingly because they seemed to be meant to be collocated in the same package as the core of CA (which is now implemented as `Autoscaler`)

* Moved the `createEventRecorder` func from the `main` package to the `utils/kubernetes` package to make it importable from both `main` and `autoscaler`
2017-02-24 20:36:47 +09:00