Commit Graph

61 Commits

Author SHA1 Message Date
Kubernetes Prow Robot a2f890247b
Merge pull request #6396 from guopeng0/feature/node_group_healthy_metrics
feat:add node group health and back off metrics
2024-01-24 11:46:45 +01:00
Guo Peng 68e661f1ed feat:add node group health and back off metrics 2024-01-23 19:39:18 +08:00
Kubernetes Prow Robot d31e1cfb5a
Merge pull request #6453 from x13n/master
Use exponential buckets for function_duration_seconds
2024-01-18 06:47:39 +01:00
Daniel Kłobuszewski aa3bab1f21 Use exponential buckets for function_duration_seconds
Existing bucketing is inconsistent. Specifically, the second to last
bucket is [100, 1000), which is huge and doesn't allow to differentiate
between something that took 2m (120s) and something that took 15m (900s).
2024-01-17 14:26:54 +01:00
Guo Peng ae0ab53060 feat:add node group health and back off metrics 2024-01-13 18:58:28 +08:00
guopeng 849e9e7332
Merge branch 'master' into feature/node_group_healthy_metrics 2024-01-02 12:02:37 +08:00
Guo Peng 89241e40c4 feat:add node group health and back off metrics 2024-01-02 11:50:56 +08:00
Joachim Bartosik 43b46b3875 Fix typo 2023-12-29 16:30:36 +00:00
Guo Peng 1255c95f27 feat:add node group health and back off metrics 2023-12-29 11:45:56 +08:00
Kubernetes Prow Robot fc48d5c052
Merge pull request #6139 from damikag/priority-evictor
Implement priority based evictor
2023-12-21 18:18:53 +01:00
damikag 9ffbea4408 implement priority based evictor and refactor drain logic 2023-12-21 16:57:05 +00:00
Guo Peng 044c03d09f feat:add node group health and back off metrics 2023-12-21 17:59:14 +08:00
Guo Peng eb5ef4bc83 feat: add metrics to show target size of every node group 2023-12-08 23:40:24 +08:00
Mahmoud Atwa 86ab017967 Fix multiple comments and update flags 2023-11-22 11:17:48 +00:00
Mahmoud Atwa a1ab7b9e20 Add new pod list processors for clearing TPU requests & filtering out
expendable pods

Treat non-processed pods yet as unschedulable
2023-11-22 11:16:33 +00:00
Piotr Wrótniak fe6eae5041 Reports node taints. 2023-10-20 13:47:51 +00:00
Karol Wychowaniec 2eba540d27 Add metrics for improved observability:
* pending_node_deletions
* failed_gpu_scale_ups_total
2023-07-25 13:01:36 +00:00
qianlei.qianl fab8ec7fd2 feat(*): add more metrics 2023-05-25 22:56:36 +08:00
Hakan Bostan 2ea2fb66f6 Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics
* Added the new resource_name field to scaled_up/down_gpu_nodes_total,
  representing the resource name for the gpu.
* Changed metrics registrations to use GpuConfig
2023-02-22 10:09:45 +00:00
Michael McCune da9d307e57 add metric for skipped scaling events
This change adds a new metric, skipped_scale_events_count, which will
record the number of times that the CA has chosen to skip a scaling
event. The metric contains a label for the scaling direction (up or down)
and the reason.

This patch includes usages for the new metric based on CPU or Memory
limits being reached in eiter a scale up or down.
2022-07-28 10:51:49 -04:00
Daniel Kłobuszewski 525145c651 Limit caching pods per owner reference 2022-03-15 10:03:04 +01:00
Kubernetes Prow Robot 9f84d391f6
Merge pull request #4022 from amrmahdi/amrh/nodegroupminmaxmetrics
[cluster-autoscaler] Publish node group min/max metrics
2021-07-05 07:38:54 -07:00
Benjamin Pineau 986fe3ae20 Metric for CloudProvider.Refresh() duration
This function can take an variable amount of time due to various
conditions (ie. many nodegroups changes causing forced refreshes,
caches time to live expiries, ...).

Monitoring that duration is useful to diagnose those variations,
and to uncover external issues (ie. throttling from cloud provider)
affecting cluster-autoscaler.
2021-05-31 15:55:28 +02:00
Amr Hanafi (MAHDI)) f5c2ab7328 Emit the node group metrics behind a flag 2021-05-20 16:49:39 -07:00
Amr Hanafi (MAHDI)) 2bd7f0efa3 [cluster-autoscaler] Publish node group min/max metrics 2021-05-17 12:27:21 -07:00
Michael McCune a24ea6c66b add cluster cores and memory bytes count metrics
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.

The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`

This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.

User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
2021-04-06 10:35:21 -04:00
Brett Elliott 013fa19be3 Log failed scale up metric based on string of AutoscalerErrorType. 2021-03-23 15:37:04 +01:00
Brett Elliott 4cddaed2f2 Support for reporting authorization errors during scale up 2021-03-17 14:56:03 +01:00
Michael McCune 7ecf933e7b add a metric for unregistered nodes removed by cluster autoscaler
This change adds a new metric which counts the number of nodes removed
by the cluster autoscaler due to being unregistered with kubernetes.

User Story

As a cluster-autoscaler user, I would like to know when the autoscaler
is cleaning up nodes that have failed to register with kubernetes. I
would like to monitor the rate at which failed nodes are being removed
so that I can better alert on infrastructure issues which may go
unnoticed elsewhere.
2021-03-04 19:23:03 -05:00
Evgenii Petrov b6f5d5567d Add unremovable_nodes_count metric 2021-02-12 15:47:34 +00:00
Marwan Ahmed a3bada3708 correctly classify error for failed scale ups 2020-09-13 21:14:27 -07:00
M. Habib Rosyad b7e02047f7 expose max-nodes-total as a metric 2020-08-19 17:43:39 +07:00
Maciek Pytel 655b4081f4 Migrate to klog v2 2020-06-05 17:22:26 +02:00
Łukasz Osipiuk b4c8bbb12c Fixes around metrics/ handler 2019-11-22 14:07:10 +01:00
Julien Balestra 012c8421da cluster-autoscaler/metrics: add a summary for function duration
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2019-08-28 16:28:16 +02:00
Julien Balestra 6d707a08ac cluster-autoscaler/metrics: expose the scale down cooldown
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2019-08-27 18:12:33 +02:00
Jacek Kaniuk 0c64e0932a Tainting unneeded nodes as PreferNoSchedule 2019-01-21 13:06:50 +01:00
Łukasz Osipiuk 016bf7fc2c Use k8s.io/klog instead github.com/golang/glog 2018-11-26 17:30:31 +01:00
Karol Gołąb 67b834368b Add client-go metrics (rest_client_request_*). 2018-09-06 12:35:16 +02:00
Karol Gołąb aae4d1270a Make GetGpuTypeForMetrics more robust 2018-06-26 21:35:16 +02:00
Karol Gołąb 5eb7021f82 Add GPU-related scaled_up & scaled_down metrics (#974)
* Add GPU-related scaled_up & scaled_down metrics

* Fix name to match SD naming convention

* Fix import after master rebase

* Change the logic to include GPU-being-installed nodes
2018-06-22 21:00:52 +02:00
Aleksandra Malinowska 3894ecb470 Export unregistered node count metric 2018-01-16 16:56:40 +01:00
Aleksandra Malinowska 3d33b64599 Export long unregistered node count metric 2018-01-16 16:07:24 +01:00
Aleksandra Malinowska 312f989c15 Don't register metrics unless on leading master 2017-12-14 16:08:20 +01:00
Maciej Pytel c376ef3c87 Add metrics for autoprovisioning 2017-10-31 17:42:58 +01:00
Maciej Pytel e12ee88f5f Add failed scale-up reason in metric 2017-09-26 13:40:34 +02:00
Maciej Pytel 7f7243ea98 Add reason field to faied_scale_ups_total metric
For now it's just a placeholder, will add proper logic
for next release
2017-09-25 16:33:49 +02:00
Maciej Pytel 5e05c84cf0 Add metric counting failed scale-ups
A minor refactor was required to avoid cyclic imports
2017-09-22 18:12:50 +02:00
Marcin Wielgus 2d8f59e23d Set verbosity for each of the glog.Info logs 2017-09-01 12:34:29 +02:00
Beata Skiba edeb522274 Add measuring of FilterOutSchedulable 2017-08-22 18:36:13 +02:00