autoscaler

Commit Graph

Author	SHA1	Message	Date
Karol Wychowaniec	2eba540d27	Add metrics for improved observability: * pending_node_deletions * failed_gpu_scale_ups_total	2023-07-25 13:01:36 +00:00
qianlei.qianl	fab8ec7fd2	feat(*): add more metrics	2023-05-25 22:56:36 +08:00
Hakan Bostan	2ea2fb66f6	Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics * Added the new resource_name field to scaled_up/down_gpu_nodes_total, representing the resource name for the gpu. * Changed metrics registrations to use GpuConfig	2023-02-22 10:09:45 +00:00
Michael McCune	da9d307e57	add metric for skipped scaling events This change adds a new metric, skipped_scale_events_count, which will record the number of times that the CA has chosen to skip a scaling event. The metric contains a label for the scaling direction (up or down) and the reason. This patch includes usages for the new metric based on CPU or Memory limits being reached in eiter a scale up or down.	2022-07-28 10:51:49 -04:00
Daniel Kłobuszewski	525145c651	Limit caching pods per owner reference	2022-03-15 10:03:04 +01:00
Kubernetes Prow Robot	9f84d391f6	Merge pull request #4022 from amrmahdi/amrh/nodegroupminmaxmetrics [cluster-autoscaler] Publish node group min/max metrics	2021-07-05 07:38:54 -07:00
Benjamin Pineau	986fe3ae20	Metric for CloudProvider.Refresh() duration This function can take an variable amount of time due to various conditions (ie. many nodegroups changes causing forced refreshes, caches time to live expiries, ...). Monitoring that duration is useful to diagnose those variations, and to uncover external issues (ie. throttling from cloud provider) affecting cluster-autoscaler.	2021-05-31 15:55:28 +02:00
Amr Hanafi (MAHDI))	f5c2ab7328	Emit the node group metrics behind a flag	2021-05-20 16:49:39 -07:00
Amr Hanafi (MAHDI))	2bd7f0efa3	[cluster-autoscaler] Publish node group min/max metrics	2021-05-17 12:27:21 -07:00
Michael McCune	a24ea6c66b	add cluster cores and memory bytes count metrics This change adds 4 metrics that can be used to monitor the minimum and maximum limits for CPU and memory, as well as the current counts in cores and bytes, respectively. The four metrics added are: * `cluster_autoscaler_cpu_limits_cores` * `cluster_autoscaler_cluster_cpu_current_cores` * `cluster_autoscaler_memory_limits_bytes` * `cluster_autoscaler_cluster_memory_current_bytes` This change also adds the `max_cores_total` metric to the metrics proposal doc, as it was previously not recorded there. User story: As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage.	2021-04-06 10:35:21 -04:00
Brett Elliott	013fa19be3	Log failed scale up metric based on string of AutoscalerErrorType.	2021-03-23 15:37:04 +01:00
Brett Elliott	4cddaed2f2	Support for reporting authorization errors during scale up	2021-03-17 14:56:03 +01:00
Michael McCune	7ecf933e7b	add a metric for unregistered nodes removed by cluster autoscaler This change adds a new metric which counts the number of nodes removed by the cluster autoscaler due to being unregistered with kubernetes. User Story As a cluster-autoscaler user, I would like to know when the autoscaler is cleaning up nodes that have failed to register with kubernetes. I would like to monitor the rate at which failed nodes are being removed so that I can better alert on infrastructure issues which may go unnoticed elsewhere.	2021-03-04 19:23:03 -05:00
Evgenii Petrov	b6f5d5567d	Add unremovable_nodes_count metric	2021-02-12 15:47:34 +00:00
Marwan Ahmed	a3bada3708	correctly classify error for failed scale ups	2020-09-13 21:14:27 -07:00
M. Habib Rosyad	b7e02047f7	expose max-nodes-total as a metric	2020-08-19 17:43:39 +07:00
Maciek Pytel	655b4081f4	Migrate to klog v2	2020-06-05 17:22:26 +02:00
Łukasz Osipiuk	b4c8bbb12c	Fixes around metrics/ handler	2019-11-22 14:07:10 +01:00
Julien Balestra	012c8421da	cluster-autoscaler/metrics: add a summary for function duration Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>	2019-08-28 16:28:16 +02:00
Julien Balestra	6d707a08ac	cluster-autoscaler/metrics: expose the scale down cooldown Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>	2019-08-27 18:12:33 +02:00
Jacek Kaniuk	0c64e0932a	Tainting unneeded nodes as PreferNoSchedule	2019-01-21 13:06:50 +01:00
Łukasz Osipiuk	016bf7fc2c	Use k8s.io/klog instead github.com/golang/glog	2018-11-26 17:30:31 +01:00
Karol Gołąb	67b834368b	Add client-go metrics (rest_client_request_*).	2018-09-06 12:35:16 +02:00
Karol Gołąb	aae4d1270a	Make GetGpuTypeForMetrics more robust	2018-06-26 21:35:16 +02:00
Karol Gołąb	5eb7021f82	Add GPU-related scaled_up & scaled_down metrics (#974 ) * Add GPU-related scaled_up & scaled_down metrics * Fix name to match SD naming convention * Fix import after master rebase * Change the logic to include GPU-being-installed nodes	2018-06-22 21:00:52 +02:00
Aleksandra Malinowska	3894ecb470	Export unregistered node count metric	2018-01-16 16:56:40 +01:00
Aleksandra Malinowska	3d33b64599	Export long unregistered node count metric	2018-01-16 16:07:24 +01:00
Aleksandra Malinowska	312f989c15	Don't register metrics unless on leading master	2017-12-14 16:08:20 +01:00
Maciej Pytel	c376ef3c87	Add metrics for autoprovisioning	2017-10-31 17:42:58 +01:00
Maciej Pytel	e12ee88f5f	Add failed scale-up reason in metric	2017-09-26 13:40:34 +02:00
Maciej Pytel	7f7243ea98	Add reason field to faied_scale_ups_total metric For now it's just a placeholder, will add proper logic for next release	2017-09-25 16:33:49 +02:00
Maciej Pytel	5e05c84cf0	Add metric counting failed scale-ups A minor refactor was required to avoid cyclic imports	2017-09-22 18:12:50 +02:00
Marcin Wielgus	2d8f59e23d	Set verbosity for each of the glog.Info logs	2017-09-01 12:34:29 +02:00
Beata Skiba	edeb522274	Add measuring of FilterOutSchedulable	2017-08-22 18:36:13 +02:00
Beata Skiba	43c9b6b06b	Add cleaner function labels for metrics exporting.	2017-08-22 16:09:42 +02:00
Beata Skiba	14df1b808b	Drill down scale down metrics Split scale down duration into three parts: 1. Find nodes to remove 2. Node deletion 3. Misc operations	2017-08-18 14:17:02 +02:00
Maciej Pytel	1782cbc4ed	Log long function execution	2017-08-07 11:21:15 +02:00
Beata Skiba	25f6242b99	Change histogram buckets.	2017-08-04 14:04:02 +02:00
Maciej Pytel	9123400fcf	Change function duration metric to histogram Many functions take an order of magnitude more time if they actually decide to take an action (like deleting node in scale-down) and it's ok if executing action is slow. That makes summary less useful, as we expect to have large outliers on some percentile, depending on churn in cluster. Instead having a histogram gives us the fuller picture of how the distribution of function runtimes look like.	2017-06-23 12:06:28 +02:00
Marcin Wielgus	69c77791a2	Fix error types	2017-06-12 21:26:50 +02:00
Maciej Pytel	f716a7e496	Add typed errors; add errors_total metric To keep reasonable commit size only top-level files use new errors. Will add them in other files in next commits.	2017-05-18 14:09:15 +02:00
Maciej Pytel	7a21a68b56	Add metrics counting CA operations	2017-05-15 13:03:00 +02:00
Maciej Pytel	4cdf06ea94	Added CA metrics related to autoscaler execution	2017-05-11 14:51:04 +02:00
Maciej Pytel	83ef3d2be3	Added CA metrics related to cluster state	2017-05-11 13:54:04 +02:00
Yusuke Kuoka	baee799524	cluster-autoscaler: Dynamic Reconfiguration via ConfigMaps Adds a new optional flag named `configmap` to specify the name of a configmap containing node group specs. The configmap is polled every `scan-interval` seconds to reconfigure cluster-autoscaler dynamically at runtime. Example usage: ``` ./cluster-autoscaler --v=4 --cloud-provider=aws --skip-nodes-with-local-storage=false --logtostderr --leader-elect=false --configmap=cluster-autoscaler --logtostderr ``` The configmap would look like: ```yaml kind: ConfigMap apiVersion: v1 metadata: name: cluster-autoscaler namespace: kube-system data: settings: \|- { "nodeGroups": [ { "minSize": 1, "maxSize": 2, "name": "kubeawstest-nodepool1-AutoScaleWorker-1VWD4GAVG35L5" } ] } ``` Other notes: * Make namespace defaults to "kube-system" according to https://github.com/kubernetes/contrib/pull/2226#discussion_r94144267 * Trigger a full-recreate on a configuration change according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-269617410 * Introduced `autoscaler/` and moved all the dynamic/recreatable-at-runtime parts of autoscaler into there (Update: the package is now named `core` according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-273071663) * Extracted the core of CA(=`func Run()` in `main.go`) into `Autoscaler` * `DynamicAutoscaler` is a wrapper around `Autoscaler` which achieves reconfiguration of CA by recreating an `Autoscaler` instance on a configmap change. * Moved `scale_down.go`, `scale_up.go` and `utils.go` into the `autoscaler` package accordingly because they seemed to be meant to be collocated in the same package as the core of CA (which is now implemented as `Autoscaler`) Moved the `createEventRecorder` func from the `main` package to the `utils/kubernetes` package to make it importable from both `main` and `autoscaler`	2017-02-24 20:36:47 +09:00

45 Commits