From bbe4de82e1c0144a7c1fb740eb45af80b77db2e3 Mon Sep 17 00:00:00 2001 From: Renaud Gaubert Date: Tue, 21 Jul 2020 21:11:52 +0000 Subject: [PATCH] Add DisableAcceleratorUsageMetrics feature flag Signed-off-by: Renaud Gaubert --- .../concepts/cluster-administration/system-metrics.md | 8 ++++++++ .../command-line-tools-reference/feature-gates.md | 2 ++ 2 files changed, 10 insertions(+) diff --git a/content/en/docs/concepts/cluster-administration/system-metrics.md b/content/en/docs/concepts/cluster-administration/system-metrics.md index 577e5b8baf..e727aa17ce 100644 --- a/content/en/docs/concepts/cluster-administration/system-metrics.md +++ b/content/en/docs/concepts/cluster-administration/system-metrics.md @@ -98,6 +98,14 @@ Take metric `A` as an example, here assumed that `A` is deprecated in 1.n. Accor If you're upgrading from release `1.12` to `1.13`, but still depend on a metric `A` deprecated in `1.12`, you should set hidden metrics via command line: `--show-hidden-metrics=1.12` and remember to remove this metric dependency before upgrading to `1.14` +## Disable accelerator metrics + +The kubelet collects accelerator metrics through cAdvisor. To collect these metrics, for accelerators like NVIDIA GPUs, kubelet held an open handle on the driver. This meant that in order to perform infrastructure changes (for example, updating the driver), a cluster administrator needed to stop the kubelet agent. + +The responsibility for collecting accelerator metrics now belongs to the vendor rather than the kubelet. Vendors must provide a container that collects metrics and exposes them to the metrics service (for example, Prometheus). + +The [`DisableAcceleratorUsageMetrics` feature gate](/docs/references/command-line-tools-reference/feature-gate.md#feature-gates-for-alpha-or-beta-features:~:text= DisableAcceleratorUsageMetrics,-false) disables metrics collected by the kubelet, with a [timeline for enabling this feature by default](https://github.com/kubernetes/enhancements/tree/411e51027db842355bd489691af897afc1a41a5e/keps/sig-node/1867-disable-accelerator-usage-metrics#graduation-criteria). + ## Component metrics ### kube-controller-manager metrics diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates.md b/content/en/docs/reference/command-line-tools-reference/feature-gates.md index 32339be605..4df2ddeb05 100644 --- a/content/en/docs/reference/command-line-tools-reference/feature-gates.md +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates.md @@ -88,6 +88,7 @@ different Kubernetes components. | `DefaultPodTopologySpread` | `false` | Alpha | 1.19 | | | `DevicePlugins` | `false` | Alpha | 1.8 | 1.9 | | `DevicePlugins` | `true` | Beta | 1.10 | | +| `DisableAcceleratorUsageMetrics` | `false` | Alpha | 1.19 | 1.20 | | `DryRun` | `false` | Alpha | 1.12 | 1.12 | | `DryRun` | `true` | Beta | 1.13 | | | `DynamicKubeletConfig` | `false` | Alpha | 1.4 | 1.10 | @@ -420,6 +421,7 @@ Each feature gate is designed for enabling/disabling a specific feature: - `CustomResourceWebhookConversion`: Enable webhook-based conversion on resources created from [CustomResourceDefinition](/docs/concepts/api-extension/custom-resources/). troubleshoot a running Pod. +- `DisableAcceleratorUsageMetrics`: [Disable accelerator metrics collected by the kubelet](/docs/concepts/cluster-administration/monitoring.md). - `DevicePlugins`: Enable the [device-plugins](/docs/concepts/cluster-administration/device-plugins/) based resource provisioning on nodes. - `DefaultPodTopologySpread`: Enables the use of `PodTopologySpread` scheduling plugin to do