autoscaler

Commit Graph

Author	SHA1	Message	Date
Kubernetes Prow Robot	a2f890247b	Merge pull request #6396 from guopeng0/feature/node_group_healthy_metrics feat:add node group health and back off metrics	2024-01-24 11:46:45 +01:00
Guo Peng	68e661f1ed	feat:add node group health and back off metrics	2024-01-23 19:39:18 +08:00
Kubernetes Prow Robot	d31e1cfb5a	Merge pull request #6453 from x13n/master Use exponential buckets for function_duration_seconds	2024-01-18 06:47:39 +01:00
Daniel Kłobuszewski	aa3bab1f21	Use exponential buckets for function_duration_seconds Existing bucketing is inconsistent. Specifically, the second to last bucket is [100, 1000), which is huge and doesn't allow to differentiate between something that took 2m (120s) and something that took 15m (900s).	2024-01-17 14:26:54 +01:00
Guo Peng	ae0ab53060	feat:add node group health and back off metrics	2024-01-13 18:58:28 +08:00
guopeng	849e9e7332	Merge branch 'master' into feature/node_group_healthy_metrics	2024-01-02 12:02:37 +08:00
Guo Peng	89241e40c4	feat:add node group health and back off metrics	2024-01-02 11:50:56 +08:00
Joachim Bartosik	43b46b3875	Fix typo	2023-12-29 16:30:36 +00:00
Guo Peng	1255c95f27	feat:add node group health and back off metrics	2023-12-29 11:45:56 +08:00
Kubernetes Prow Robot	fc48d5c052	Merge pull request #6139 from damikag/priority-evictor Implement priority based evictor	2023-12-21 18:18:53 +01:00
damikag	9ffbea4408	implement priority based evictor and refactor drain logic	2023-12-21 16:57:05 +00:00
Guo Peng	044c03d09f	feat:add node group health and back off metrics	2023-12-21 17:59:14 +08:00
Guo Peng	eb5ef4bc83	feat: add metrics to show target size of every node group	2023-12-08 23:40:24 +08:00
Mahmoud Atwa	86ab017967	Fix multiple comments and update flags	2023-11-22 11:17:48 +00:00
Mahmoud Atwa	a1ab7b9e20	Add new pod list processors for clearing TPU requests & filtering out expendable pods Treat non-processed pods yet as unschedulable	2023-11-22 11:16:33 +00:00
Piotr Wrótniak	fe6eae5041	Reports node taints.	2023-10-20 13:47:51 +00:00
Karol Wychowaniec	2eba540d27	Add metrics for improved observability: * pending_node_deletions * failed_gpu_scale_ups_total	2023-07-25 13:01:36 +00:00
qianlei.qianl	fab8ec7fd2	feat(*): add more metrics	2023-05-25 22:56:36 +08:00
Hakan Bostan	2ea2fb66f6	Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics * Added the new resource_name field to scaled_up/down_gpu_nodes_total, representing the resource name for the gpu. * Changed metrics registrations to use GpuConfig	2023-02-22 10:09:45 +00:00
Michael McCune	da9d307e57	add metric for skipped scaling events This change adds a new metric, skipped_scale_events_count, which will record the number of times that the CA has chosen to skip a scaling event. The metric contains a label for the scaling direction (up or down) and the reason. This patch includes usages for the new metric based on CPU or Memory limits being reached in eiter a scale up or down.	2022-07-28 10:51:49 -04:00
Daniel Kłobuszewski	525145c651	Limit caching pods per owner reference	2022-03-15 10:03:04 +01:00
Kubernetes Prow Robot	9f84d391f6	Merge pull request #4022 from amrmahdi/amrh/nodegroupminmaxmetrics [cluster-autoscaler] Publish node group min/max metrics	2021-07-05 07:38:54 -07:00
Benjamin Pineau	986fe3ae20	Metric for CloudProvider.Refresh() duration This function can take an variable amount of time due to various conditions (ie. many nodegroups changes causing forced refreshes, caches time to live expiries, ...). Monitoring that duration is useful to diagnose those variations, and to uncover external issues (ie. throttling from cloud provider) affecting cluster-autoscaler.	2021-05-31 15:55:28 +02:00
Amr Hanafi (MAHDI))	f5c2ab7328	Emit the node group metrics behind a flag	2021-05-20 16:49:39 -07:00
Amr Hanafi (MAHDI))	2bd7f0efa3	[cluster-autoscaler] Publish node group min/max metrics	2021-05-17 12:27:21 -07:00
Michael McCune	a24ea6c66b	add cluster cores and memory bytes count metrics This change adds 4 metrics that can be used to monitor the minimum and maximum limits for CPU and memory, as well as the current counts in cores and bytes, respectively. The four metrics added are: * `cluster_autoscaler_cpu_limits_cores` * `cluster_autoscaler_cluster_cpu_current_cores` * `cluster_autoscaler_memory_limits_bytes` * `cluster_autoscaler_cluster_memory_current_bytes` This change also adds the `max_cores_total` metric to the metrics proposal doc, as it was previously not recorded there. User story: As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage.	2021-04-06 10:35:21 -04:00
Brett Elliott	013fa19be3	Log failed scale up metric based on string of AutoscalerErrorType.	2021-03-23 15:37:04 +01:00
Brett Elliott	4cddaed2f2	Support for reporting authorization errors during scale up	2021-03-17 14:56:03 +01:00
Michael McCune	7ecf933e7b	add a metric for unregistered nodes removed by cluster autoscaler This change adds a new metric which counts the number of nodes removed by the cluster autoscaler due to being unregistered with kubernetes. User Story As a cluster-autoscaler user, I would like to know when the autoscaler is cleaning up nodes that have failed to register with kubernetes. I would like to monitor the rate at which failed nodes are being removed so that I can better alert on infrastructure issues which may go unnoticed elsewhere.	2021-03-04 19:23:03 -05:00
Evgenii Petrov	b6f5d5567d	Add unremovable_nodes_count metric	2021-02-12 15:47:34 +00:00
Marwan Ahmed	a3bada3708	correctly classify error for failed scale ups	2020-09-13 21:14:27 -07:00
M. Habib Rosyad	b7e02047f7	expose max-nodes-total as a metric	2020-08-19 17:43:39 +07:00
Maciek Pytel	655b4081f4	Migrate to klog v2	2020-06-05 17:22:26 +02:00
Łukasz Osipiuk	b4c8bbb12c	Fixes around metrics/ handler	2019-11-22 14:07:10 +01:00
Julien Balestra	012c8421da	cluster-autoscaler/metrics: add a summary for function duration Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>	2019-08-28 16:28:16 +02:00
Julien Balestra	6d707a08ac	cluster-autoscaler/metrics: expose the scale down cooldown Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>	2019-08-27 18:12:33 +02:00
Jacek Kaniuk	0c64e0932a	Tainting unneeded nodes as PreferNoSchedule	2019-01-21 13:06:50 +01:00
Łukasz Osipiuk	016bf7fc2c	Use k8s.io/klog instead github.com/golang/glog	2018-11-26 17:30:31 +01:00
Karol Gołąb	67b834368b	Add client-go metrics (rest_client_request_*).	2018-09-06 12:35:16 +02:00
Karol Gołąb	aae4d1270a	Make GetGpuTypeForMetrics more robust	2018-06-26 21:35:16 +02:00
Karol Gołąb	5eb7021f82	Add GPU-related scaled_up & scaled_down metrics (#974 ) * Add GPU-related scaled_up & scaled_down metrics * Fix name to match SD naming convention * Fix import after master rebase * Change the logic to include GPU-being-installed nodes	2018-06-22 21:00:52 +02:00
Aleksandra Malinowska	3894ecb470	Export unregistered node count metric	2018-01-16 16:56:40 +01:00
Aleksandra Malinowska	3d33b64599	Export long unregistered node count metric	2018-01-16 16:07:24 +01:00
Aleksandra Malinowska	312f989c15	Don't register metrics unless on leading master	2017-12-14 16:08:20 +01:00
Maciej Pytel	c376ef3c87	Add metrics for autoprovisioning	2017-10-31 17:42:58 +01:00
Maciej Pytel	e12ee88f5f	Add failed scale-up reason in metric	2017-09-26 13:40:34 +02:00
Maciej Pytel	7f7243ea98	Add reason field to faied_scale_ups_total metric For now it's just a placeholder, will add proper logic for next release	2017-09-25 16:33:49 +02:00
Maciej Pytel	5e05c84cf0	Add metric counting failed scale-ups A minor refactor was required to avoid cyclic imports	2017-09-22 18:12:50 +02:00
Marcin Wielgus	2d8f59e23d	Set verbosity for each of the glog.Info logs	2017-09-01 12:34:29 +02:00
Beata Skiba	edeb522274	Add measuring of FilterOutSchedulable	2017-08-22 18:36:13 +02:00

1 2

61 Commits