[serving] OTel Metrics Documentation (#6352)

* Updating collecting metrics steps

* remove collector installation steps and separate out shared metrics into a snippet

* combine notes

* update nav to remove stutter

* include webhook metrics

* update common metrics with attributes and include serving metrics

* update notice banners for eventing and serving

* fix typos

* Update docs/eventing/observability/metrics/collecting-metrics.md

Co-authored-by: Calum Murray <cmurray@redhat.com>

---------

Co-authored-by: Calum Murray <cmurray@redhat.com>
This commit is contained in:
Dave Protasowski 2025-09-03 14:48:44 -04:00 committed by GitHub
parent 31cca91976
commit b936c72e46
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
11 changed files with 362 additions and 532 deletions

View File

@ -188,7 +188,7 @@ nav:
- Configuring logging: serving/observability/logging/config-logging.md
- Configuring Request logging: serving/observability/logging/request-logging.md
- Collecting metrics: serving/observability/metrics/collecting-metrics.md
- Knative Serving metrics: serving/observability/metrics/serving-metrics.md
- Metrics Reference: serving/observability/metrics/serving-metrics.md
# Serving - troubleshooting
- Troubleshooting:
- Debugging application issues: serving/troubleshooting/debugging-application-issues.md
@ -307,7 +307,7 @@ nav:
- Collecting logs: eventing/observability/logging/collecting-logs.md
- Configuring logging: eventing/observability/logging/config-logging.md
- Collecting metrics: eventing/observability/metrics/collecting-metrics.md
- Knative Eventing metrics: eventing/observability/metrics/eventing-metrics.md
- Metrics Reference: eventing/observability/metrics/eventing-metrics.md
- Features:
- About Eventing features: eventing/features/README.md
- DeliverySpec.Timeout field: eventing/features/delivery-timeout.md

View File

@ -6,3 +6,40 @@ function: how-to
---
--8<-- "collecting-metrics.md"
### Enabling Metric Collection
1. To enable prometheus metrics collection you will want to update `config-observability` ConfigMap and set the `metrics-protocol` to `prometheus`.
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: knative-eventing
data:
# metrics-protocol field specifies the protocol used when exporting metrics
# It supports either 'none' (the default), 'prometheus', 'http/protobuf' (OTLP HTTP), 'grpc' (OTLP gRPC)
metrics-protocol: prometheus
tracing-protocol: http/protobuf
tracing-endpoint: http://jaeger-collector.observability.svc:4318/v1/traces
tracing-sampling-rate: "1"
```
### Apply the Eventing Service/Pod Monitors
1. Apply the ServiceMonitors/PodMonitors to collect metrics from Knative Eventing Control Plane
```bash
kubectl apply -f https://raw.githubusercontent.com/knative-extensions/monitoring/main/config/eventing-monitors.yaml
```
### Import Grafana dashboards
1. Grafana dashboards can be imported from the [`monitoring` repository](https://github.com/knative-extensions/monitoring).
1. If you are using the Grafana Helm Chart with the dashboard sidecar enabled (the default), you can load the dashboards by applying the following configmaps.
```bash
kubectl apply -f https://raw.githubusercontent.com/knative-extensions/monitoring/main/config/configmap-eventing-dashboard.yaml
```

View File

@ -1,96 +0,0 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: metrics
data:
collector.yaml: |
receivers:
opencensus:
endpoint: "0.0.0.0:55678"
exporters:
logging:
prometheus:
endpoint: "0.0.0.0:8889"
extensions:
health_check:
pprof:
zpages:
service:
extensions: [health_check, pprof, zpages]
pipelines:
metrics:
receivers: [opencensus]
processors: []
exporters: [prometheus]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: metrics
labels:
app: otel-collector
spec:
selector:
matchLabels:
app: otel-collector
replicas: 1 # This can be increased for a larger system.
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
args:
- --config=/conf/collector.yaml
image: otel/opentelemetry-collector:latest
resources:
requests: # Note: these are suitable for a small instance, but may need to be increased for a large instance.
memory: 100Mi
cpu: 50m
ports:
- name: otel
containerPort: 55678
- name: prom-export
containerPort: 8889
- name: zpages # A /debug page
containerPort: 55679
volumeMounts:
- mountPath: /conf
name: config
volumes:
- name: config
configMap:
name: otel-collector-config
items:
- key: collector.yaml
path: collector.yaml
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: metrics
spec:
selector:
app: "otel-collector"
ports:
- port: 55678
name: otel
---
apiVersion: v1
kind: Service
metadata:
name: otel-export
namespace: metrics
labels:
app: otel-export
spec:
selector:
app: otel-collector
ports:
- port: 8889
name: prom-export

View File

@ -5,7 +5,15 @@ components:
function: reference
---
# Knative Eventing metrics
# Knative Eventing Metrics
!!! warning
The metrics below have not been updated to reflect our migration from
OpenCensus to OpenTelemetry. We are in the process of updating them.
These metrics may change as we flush out our migration from OpenCensus
to OpenTelemetry.
Administrators can view metrics for Knative Eventing components.
@ -42,11 +50,6 @@ By aggregating the metrics over the http code, events can be separated into two
| event_count | Number of events dispatched by the in-memory channel | Counter | container_name<br>event_type=<br>namespace_name=<br>response_code<br>response_code_class<br>unique_name | Dimensionless | Stable
| event_dispatch_latencies | The time spent dispatching an event from a in-memory Channel | Histogram | container_name<br>event_type<br>namespace_name=<br>response_code<br>response_code_class<br>unique_name | Milliseconds | Stable
!!! note
A number of metrics eg. controller, Go runtime and others are omitted here as they are common
across most components. For more about these metrics check the
[Serving metrics API section](../../../serving/observability/metrics/serving-metrics.md).
## Eventing sources
Eventing sources are created by users who own the related system, so they can trigger applications with events.
@ -57,3 +60,5 @@ to verify that events have been delivered from the source side, thus verifying t
|:-|:-|:-|:-|:-|:-|
| event_count | Number of events sent by the source | Counter | event_source<br>event_type<br>name<br>namespace_name<br>resource_group<br>response_code<br>response_code_class<br>response_error<br>response_timeout | Dimensionless | Stable |
| retry_event_count | Number of events sent by the source in retries | Counter | event_source<br>event_type<br>name<br>namespace_name<br>resource_group<br>response_code<br>response_code_class<br>response_error<br>response_timeout | Dimensionless | Stable
--8<-- "observability-shared-metrics.md"

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 239 KiB

View File

@ -6,3 +6,46 @@ function: how-to
---
--8<-- "collecting-metrics.md"
### Enabling Metric Collection
1. To enable prometheus metrics collection you will want to update `config-observability` ConfigMap and set the `metrics-protocol` to `prometheus`. For request-metrics we recommend setting up pushing metrics to prometheus. This requires enabling the Prometheus OLTP receiver. This is already configured in our monitoring example.
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: knative-serving
data:
# metrics-protocol field specifies the protocol used when exporting metrics
# It supports either 'none' (the default), 'prometheus', 'http/protobuf' (OTLP HTTP), 'grpc' (OTLP gRPC)
metrics-protocol: prometheus
# request-metrics-protocol
request-metrics-protocol: http/protobuf
request-metrics-endpoint: http://knative-kube-prometheus-st-prometheus.observability.svc:9090/api/v1/otlp/v1/metrics
tracing-protocol: http/protobuf
tracing-endpoint: http://jaeger-collector.observability.svc:4318/v1/traces
tracing-sampling-rate: "1"
```
1. Apply the ServiceMonitors/PodMonitors to collect metrics from Knative Serving Control Plane.
```bash
kubectl apply -f https://raw.githubusercontent.com/knative-extensions/monitoring/main/config/serving-monitors.yaml
```
### Import Grafana dashboards
1. Grafana dashboards can be imported from the [`monitoring` repository](https://github.com/knative-extensions/monitoring).
1. If you are using the Grafana Helm Chart with the dashboard sidecar enabled (the default), you can load the dashboards by applying the following configmaps.
```bash
kubectl apply -f https://raw.githubusercontent.com/knative-extensions/monitoring/main/config/configmap-serving-dashboard.yaml
```

View File

@ -1,96 +0,0 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: metrics
data:
collector.yaml: |
receivers:
opencensus:
endpoint: "0.0.0.0:55678"
exporters:
logging:
prometheus:
endpoint: "0.0.0.0:8889"
extensions:
health_check:
pprof:
zpages:
service:
extensions: [health_check, pprof, zpages]
pipelines:
metrics:
receivers: [opencensus]
processors: []
exporters: [prometheus]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: metrics
labels:
app: otel-collector
spec:
selector:
matchLabels:
app: otel-collector
replicas: 1 # This can be increased for a larger system.
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
args:
- --config=/conf/collector.yaml
image: otel/opentelemetry-collector:latest
resources:
requests: # Note: these are suitable for a small instance, but may need to be increased for a large instance.
memory: 100Mi
cpu: 50m
ports:
- name: otel
containerPort: 55678
- name: prom-export
containerPort: 8889
- name: zpages # A /debug page
containerPort: 55679
volumeMounts:
- mountPath: /conf
name: config
volumes:
- name: config
configMap:
name: otel-collector-config
items:
- key: collector.yaml
path: collector.yaml
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: metrics
spec:
selector:
app: "otel-collector"
ports:
- port: 55678
name: otel
---
apiVersion: v1
kind: Service
metadata:
name: otel-export
namespace: metrics
labels:
app: otel-export
spec:
selector:
app: otel-collector
ports:
- port: 8889
name: prom-export

View File

@ -5,105 +5,164 @@ components:
function: reference
---
# Knative Serving metrics
# Knative Serving Metrics
Administrators can monitor Serving control plane based on the metrics exposed by each Serving component.
Metrics are listed next.
!!! note
These metrics may change as we flush out our migration from OpenCensus to OpenTelemetry
## Queue Proxy
The queue proxy is the per-pod sidecar that enforces container concurrency and provides metrics to the autoscaler. The following metrics provide you insights into queued
requests and user-container behavior.
### `kn.queueproxy.depth`
**Instrument Type:** Int64Gauge
**Unit (UCUM):** {item}
**Description:** Number of current items in the queue proxy queue
### `kn.queueproxy.app.duration`
**Instrument Type:** Float64Histogram
**Unit (UCUM):** s
**Description:** The duration of the task execution
## Activator
The following metrics can help you to understand how an application responds when traffic passes through the activator. For example, when scaling from zero, high request latency might mean that requests are taking too much time to be fulfilled.
| Metric Name | Description | Type | Tags | Unit | Status |
|:-|:-|:-|:-|:-|:-|
| ```request_concurrency``` | Concurrent requests that are routed to Activator<br>These are requests reported by the concurrency reporter which may not be done yet.<br> This is the average concurrency over a reporting period | Gauge | ```configuration_name```<br>```container_name```<br>```namespace_name```<br>```pod_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```request_count``` | The number of requests that are routed to Activator.<br>These are requests that have been fulfilled from the activator handler. | Counter | ```configuration_name```<br>```container_name```<br>```namespace_name```<br>```pod_name```<br>```response_code```<br>```response_code_class```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```request_latencies``` | The response time in millisecond for the fulfilled routed requests | Histogram | ```configuration_name```<br>```container_name```<br>```namespace_name```<br>```pod_name```<br>```response_code```<br>```response_code_class```<br>```revision_name```<br>```service_name``` | Milliseconds | Stable |
### `kn.revision.request.concurrency`
**Instrument Type:** Float64Gauge
**Unit (UCUM):** {request}
**Description:** Concurrent requests that are routed to the Activator
The following attributes are included with the metrics below
Name | Type | Description
-|-|-
`k8s.namespace.name` | string | Namespace of the Revision
`kn.service.name` | string | Knative Service name associated with this Revision
`kn.configuration.name` | string | Knative Configuration name associated with this Revision
`kn.revision.name` | string | The name of the Revision
## Autoscaler
Autoscaler component exposes a number of metrics related to its decisions per revision. For example, at any given time, you can monitor the desired pods the Autoscaler wants to allocate for a Service, the average number of requests per second during the stable window, or whether autoscaler is in panic mode (KPA).
| Metric Name | Description | Type | Tags | Unit | Status |
|:-|:-|:-|:-|:-|:-|
| ```desired_pods``` | Number of pods autoscaler wants to allocate | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```excess_burst_capacity``` | Excess burst capacity overserved over the stable window | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```stable_request_concurrency``` | Average of requests count per observed pod over the stable window | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```panic_request_concurrency``` | Average of requests count per observed pod over the panic window | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```target_concurrency_per_pod``` | The desired number of concurrent requests for each pod | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```stable_requests_per_second``` | Average requests-per-second per observed pod over the stable window | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```panic_requests_per_second``` | Average requests-per-second per observed pod over the panic window | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```target_requests_per_second``` | The desired requests-per-second for each pod | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```panic_mode``` | 1 if autoscaler is in panic mode, 0 otherwise | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```requested_pods``` | Number of pods autoscaler requested from Kubernetes | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```actual_pods``` | Number of pods that are allocated currently in ready state | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```not_ready_pods``` | Number of pods that are not ready currently | Gauge | ```configuration_name=```<br>```namespace_name=```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```pending_pods``` | Number of pods that are pending currently | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name``` | Dimensionless | Stable |
| ```terminating_pods``` | Number of pods that are terminating currently | Gauge | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name<br>``` | Dimensionless | Stable |
| ```scrape_time``` | Time autoscaler takes to scrape metrics from the service pods in milliseconds | Histogram | ```configuration_name```<br>```namespace_name```<br>```revision_name```<br>```service_name```<br> | Milliseconds | Stable |
### `kn.autoscaler.scrape.duration`
## Controller
**Instrument Type:** Float64Histogram
The following metrics are emitted by any component that implements a controller logic.
The metrics show details about the reconciliation operations and the workqueue behavior on which
reconciliation requests are enqueued.
**Unit (UCUM):** s
| Metric Name | Description | Type | Tags | Unit | Status |
|:-|:-|:-|:-|:-|:-|
| ```work_queue_depth``` | Depth of the work queue | Gauge | ```reconciler``` | Dimensionless | Stable |
| ```reconcile_count``` | Number of reconcile operations | Counter | ```reconciler```<br>```success```<br> | Dimensionless | Stable |
| ```reconcile_latency``` | Latency of reconcile operations | Histogram | ```reconciler```<br>```success```<br> | Milliseconds | Stable |
| ```workqueue_adds_total``` | Total number of adds handled by workqueue | Counter | ```name``` | Dimensionless | Stable |
| ```workqueue_depth``` | Current depth of workqueue | Gauge | ```reconciler``` | Dimensionless | Stable |
| ```workqueue_queue_latency_seconds``` | How long in seconds an item stays in workqueue before being requested | Histogram | ```name``` | Seconds | Stable |
| ```workqueue_retries_total``` | Total number of retries handled by workqueue | Counter | ```name``` | Dimensionless | Stable |
| ```workqueue_work_duration_seconds``` | How long in seconds processing an item from a workqueue takes. | Histogram | ```name``` | Seconds| Stable |
| ```workqueue_unfinished_work_seconds``` | How long in seconds the outstanding workqueue items have been in flight (total). | Histogram | ```name``` | Seconds | Stable |
| ```workqueue_longest_running_processor_seconds``` | How long in seconds the longest outstanding workqueue item has been in flight | Histogram | ```name``` | Seconds | Stable |
**Description:** The duration of scraping the revision
## Webhook
### `kn.revision.pods.desired`
Webhook metrics report useful info about operations. For example, if a large number of operations fail, this could indicate an issue with a user-created resource.
**Instrument Type:** Int64Gauge
| Metric Name | Description | Type | Tags | Unit | Status |
|:-|:-|:-|:-|:-|:-|
| ```request_count``` | The number of requests that are routed to webhook | Counter | ```admission_allowed```<br>```kind_group```<br>```kind_kind```<br>```kind_version```<br>```request_operation```<br>```resource_group```<br>```resource_namespace```<br>```resource_resource```<br>```resource_version``` | Dimensionless | Stable |
| ```request_latencies``` | The response time in milliseconds | Histogram | ```admission_allowed```<br>```kind_group```<br>```kind_kind```<br>```kind_version```<br>```request_operation```<br>```resource_group```<br>```resource_namespace```<br>```resource_resource```<br>```resource_version``` | Milliseconds | Stable |
**Unit (UCUM):** {item}
## Go Runtime - memstats
**Description:** Number of pods the autoscaler wants to allocate
Each Knative Serving control plane process emits a number of Go runtime [memory statistics](https://golang.org/pkg/runtime/#MemStats) (shown next).
As a baseline for monitoring purposes, user could start with a subset of the metrics: current allocations (go_alloc), total allocations (go_total_alloc), system memory (go_sys), mallocs (go_mallocs), frees (go_frees) and garbage collection total pause time (total_gc_pause_ns), next gc target heap size (go_next_gc) and number of garbage collection cycles (num_gc).
### `kn.revision.capacity.excess`
| Metric Name | Description | Type | Tags | Unit | Status |
|:-|:-|:-|:-|:-|:-|
| ```go_alloc``` | The number of bytes of allocated heap objects (same as heap_alloc) | Gauge | ```name``` | Dimensionless | Stable |
| ```go_total_alloc``` | The cumulative bytes allocated for heap objects | Gauge | ```name``` | Dimensionless | Stable |
| ```go_sys``` | The total bytes of memory obtained from the OS | Gauge | ```name``` | Dimensionless | Stable |
| ```go_lookups``` | The number of pointer lookups performed by the runtime | Gauge | ```name``` | Dimensionless | Stable |
| ```go_mallocs``` | The cumulative count of heap objects allocated | Gauge | ```name``` | Dimensionless | Stable |
| ```go_frees``` | The cumulative count of heap objects freed | Gauge | ```name``` | Dimensionless | Stable |
| ```go_heap_alloc``` | The number of bytes of allocated heap objects | Gauge | ```name``` | Dimensionless | Stable |
| ```go_heap_sys``` | The number of bytes of heap memory obtained from the OS | Gauge | ```name``` | Dimensionless | Stable |
| ```go_heap_idle``` | The number of bytes in idle (unused) spans | Gauge | ```name``` | Dimensionless | Stable |
| ```go_heap_in_use``` | The number of bytes in in-use spans | Gauge | ```name``` | Dimensionless | Stable |
| ```go_heap_released``` | The number of bytes of physical memory returned to the OS | Gauge | ```name``` | Dimensionless | Stable |
| ```go_heap_objects``` | The number of allocated heap objects | Gauge | ```name``` | Dimensionless | Stable |
| ```go_stack_in_use``` | The number of bytes in stack spans | Gauge | ```name``` | Dimensionless | Stable |
| ```go_stack_sys``` | The number of bytes of stack memory obtained from the OS | Gauge | ```name``` | Dimensionless | Stable |
| ```go_mspan_in_use``` | The number of bytes of allocated mspan structures | Gauge | ```name``` | Dimensionless | Stable |
| ```go_mspan_sys``` | The number of bytes of memory obtained from the OS for mspan structures | Gauge | ```name``` | Dimensionless | Stable |
| ```go_mcache_in_use``` | The number of bytes of allocated mcache structures | Gauge | ```name``` | Dimensionless | Stable |
| ```go_mcache_sys``` | The number of bytes of memory obtained from the OS for mcache structures | Gauge | ```name``` | Dimensionless | Stable |
| ```go_bucket_hash_sys``` | The number of bytes of memory in profiling bucket hash tables. | Gauge | ```name``` | Dimensionless | Stable |
| ```go_gc_sys``` | The number of bytes of memory in garbage collection metadata | Gauge | ```name``` | Dimensionless | Stable |
| ```go_other_sys``` | The number of bytes of memory in miscellaneous off-heap runtime allocations | Gauge | ```name``` | Dimensionless | Stable |
| ```go_next_gc``` | The target heap size of the next GC cycle | Gauge | ```name``` | Dimensionless | Stable |
| ```go_last_gc``` | The time the last garbage collection finished, as nanoseconds since 1970 (the UNIX epoch) | Gauge | ```name``` | Nanoseconds | Stable |
| ```go_total_gc_pause_ns``` | The cumulative nanoseconds in GC stop-the-world pauses since the program started | Gauge | ```name``` | Nanoseconds | Stable |
| ```go_num_gc``` | The number of completed GC cycles. | Gauge | ```name``` | Dimensionless | Stable |
| ```go_num_forced_gc``` | The number of GC cycles that were forced by the application calling the GC function. | Gauge | ```name``` | Dimensionless | Stable |
| ```go_gc_cpu_fraction``` | The fraction of this program's available CPU time used by the GC since the program started | Gauge | ```name``` | Dimensionless | Stable |
**Instrument Type:** Float64Gauge
!!! note
The name tag is empty.
**Unit (UCUM):** {concurrency}
**Description:** Excess burst capacity observed over the stable window
### `kn.revision.concurrency.stable`
**Instrument Type:** Float64Gauge
**Unit (UCUM):** {concurrency}
**Description:** Average of request count per observed pod over the stable window
### `kn.revision.concurrency.panic`
**Instrument Type:** Float64Gauge
**Unit (UCUM):** {concurrency}
**Description:** Average of request count per observed pod over the panic window
### `kn.revision.concurrency.target`
**Instrument Type:** Float64Gauge
**Unit (UCUM):** {concurrency}
**Description:** The desired concurrent requests for each pod
### `kn.revision.rps.stable`
**Instrument Type:** Float64Gauge
**Unit (UCUM):** {request}/s
**Description:** Average of requests-per-second per observed pod over the stable window
### `kn.revision.rps.panic`
**Instrument Type:** Float64Gauge
**Unit (UCUM):** {request}/s
**Description:** Average of requests-per-second per observed pod over the panic window
### `kn.revision.pods.requested`
**Instrument Type:** Int64Gauge
**Unit (UCUM):** {pod}
**Description:** Number of pods autoscaler requested from Kubernetes
### `kn.revision.pods.count`
**Instrument Type:** Int64Gauge
**Unit (UCUM):** {pod}
**Description:** Number of pods that are allocated currently
### `kn.revision.pods.not_ready.count`
**Instrument Type:** Int64Gauge
**Unit (UCUM):** {pod}
**Description:** Number of pods that are not ready currently
### `kn.revision.pods.pending.count`
**Instrument Type:** Int64Gauge
**Unit (UCUM):** {pod}
**Description:** Number of pods that are pending currently
### `kn.revision.pods.terminating.count`
**Instrument Type:** Int64Gauge
**Unit (UCUM):** {pod}
**Description:** Number of pods that are terminating currently
--8<-- "observability-shared-metrics.md"

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 239 KiB

View File

@ -1,16 +1,17 @@
# Collecting Metrics in Knative
Knative supports different popular tools for collecting metrics:
Knative leverages [OpenTelemetry](https://opentelemetry.io/docs/what-is-opentelemetry/) for exporting metrics.
- [Prometheus](https://prometheus.io/)
- [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
[Grafana](https://grafana.com/oss/) dashboards are available for metrics collected directly with Prometheus.
We currently support the following export protocols:
- [OTel (OTLP) over HTTP or gRPC](https://opentelemetry.io/docs/languages/go/exporters/#prometheus-experimental)
- [Prometheus](https://opentelemetry.io/docs/languages/go/exporters/#prometheus-experimental)
You can also set up the OpenTelemetry Collector to receive metrics from Knative components and distribute them to other metrics providers that support OpenTelemetry.
!!! warning
You can't use OpenTelemetry Collector and Prometheus at the same time. The default metrics backend is Prometheus. You will need to remove `metrics.backend-destination` and `metrics.request-metrics-backend-destination` keys from the config-observability Configmap to enable Prometheus metrics.
!!! note
The following monitoring setup is for illustrative purposes. Support is best-effort and changes
are welcome in the [Knative Monitoring repository](https://github.com/knative-extensions/monitoring)
By default metrics are exporting is off.
## About the Prometheus Stack
@ -27,39 +28,22 @@ You can also set up the OpenTelemetry Collector to receive metrics from Knative
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n default -f values.yaml
# values.yaml contains at minimum the configuration below
```
helm install knative prometheus-community/kube-prometheus-stack \
--create-namespace \
--namespace observability \
-f https://raw.githubusercontent.com/knative-extensions/monitoring/main/promstack-values.yaml
!!! caution
You will need to ensure that the helm chart has following values configured, otherwise the ServiceMonitors/Podmonitors will not work.
```yaml
kube-state-metrics:
metricLabelsAllowlist:
- pods=[*]
- deployments=[app.kubernetes.io/name,app.kubernetes.io/component,app.kubernetes.io/instance]
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
```
1. Apply the ServiceMonitors/PodMonitors to collect metrics from Knative.
```bash
kubectl apply -f https://raw.githubusercontent.com/knative-extensions/monitoring/main/servicemonitor.yaml
```
### Access the Prometheus instance locally
By default, the Prometheus instance is only exposed on a private service named `prometheus-kube-prometheus-prometheus`.
By default, the Prometheus instance is only exposed on a private service named `prometheus-operated`.
To access the console in your web browser:
1. Enter the command:
```bash
kubectl port-forward -n default svc/prometheus-kube-prometheus-prometheus 9090:9090
kubectl port-forward -n observability svc/prometheus-operated 9090:9090
```
1. Access the console in your browser via `http://localhost:9090`.
@ -73,7 +57,7 @@ To access the dashboards in your web browser:
1. Enter the command:
```bash
kubectl port-forward -n default svc/prometheus-grafana 3000:80
kubectl port-forward -n observability svc/knative-grafana 3000:80
```
1. Access the dashboards in your browser via `http://localhost:3000`.
@ -85,90 +69,3 @@ To access the dashboards in your web browser:
password: prom-operator
```
### Import Grafana dashboards
1. Grafana dashboards can be imported from the [`monitoring` repository](https://github.com/knative-extensions/monitoring/tree/main/grafana).
1. If you are using the Grafana Helm Chart with the Dashboard Sidecar enabled, you can load the dashboards by applying the following configmaps.
```bash
kubectl apply -f https://raw.githubusercontent.com/knative-extensions/monitoring/main/grafana/dashboards.yaml
```
!!! caution
You will need to ensure that the helm chart has following values configured, otherwise the dashboards loading will not work.
```yaml
grafana:
sidecar:
dashboards:
enabled: true
searchNamespace: ALL
```
If you have an existing configmap and the dashboards loading doesn't work, add the `labelValue: true` attribute to the helm chart after the `searchNamespace: ALL` declaration.
## About OpenTelemetry
OpenTelemetry is a CNCF observability framework for cloud-native software, which provides a collection of tools, APIs, and SDKs.
You can use OpenTelemetry to instrument, generate, collect, and export telemetry data. This data includes metrics, logs, and traces, that you can analyze to understand the performance and behavior of Knative components.
OpenTelemetry allows you to easily export metrics to multiple monitoring services without needing to rebuild or reconfigure the Knative binaries.
## Understanding the collector
The collector provides a location where various Knative components can push metrics to be retained and collected by a monitoring service.
In the following example, you can configure a single collector instance using a ConfigMap and a Deployment.
!!! tip
For more complex deployments, you can automate some of these steps by using the [OpenTelemetry Operator](https://github.com/open-telemetry/opentelemetry-operator).
!!! caution
The Grafana dashboards at https://github.com/knative-extensions/monitoring/tree/main/grafana don't work with metrics scraped from OpenTelemetry Collector.
![Diagram of components reporting to collector, which is scraped by Prometheus](system-diagram.svg)
<!-- yuml.me UML rendering of:
[queue-proxy1]->[Collector]
[queue-proxy2]->[Collector]
[autoscaler]->[Collector]
[controller]->[Collector]
[Collector]<-scrape[Prometheus]
-->
## Set up the collector
1. Create a namespace for the collector to run in, by entering the following command:
```bash
kubectl create namespace metrics
```
The next step uses the `metrics` namespace for creating the collector.
1. Create a Deployment, Service, and ConfigMap for the collector by entering the following command:
```bash
kubectl apply -f https://raw.githubusercontent.com/knative/docs/main/docs/serving/observability/metrics/collector.yaml
```
1. Update the `config-observability` ConfigMaps in the Knative Serving and
Eventing namespaces, by entering the follow command:
```bash
kubectl patch --namespace knative-serving configmap/config-observability \
--type merge \
--patch '{"data":{"metrics.backend-destination":"opencensus","metrics.request-metrics-backend-destination":"opencensus","metrics.opencensus-address":"otel-collector.metrics:55678"}}'
kubectl patch --namespace knative-eventing configmap/config-observability \
--type merge \
--patch '{"data":{"metrics.backend-destination":"opencensus","metrics.opencensus-address":"otel-collector.metrics:55678"}}'
```
## Verify the collector setup
1. You can check that metrics are being forwarded by loading the Prometheus export port on the collector, by entering the following command:
```bash
kubectl port-forward --namespace metrics deployment/otel-collector 8889
```
1. Fetch `http://localhost:8889/metrics` to see the exported metrics.

View File

@ -0,0 +1,115 @@
## Webhook Metrics
Webhook metrics report useful info about operations. For example, if a large number of operations fail, this could indicate an issue with a user-created resource.
### `http.server.request.duration`
Knative implements the [semantic conventions for HTTP Servers](https://opentelemetry.io/docs/specs/semconv/http/http-metrics/#http-server) using the OpenTelemetry [otel-go/otelhttp](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp) package.
Please refer to the [OpenTelemetry docs](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp) for details about the HTTP Server metrics it exports.
The following attributes are included with the metric
Name | Type | Description | Examples
-|-|-|-
`kn.webhook.type` | string | Specifies the type of webhook invoked | `admission`, `defaulting`, `validation`, `conversion` |
`kn.webhook.resource.group` | string | Specifies the resource Kubernetes group name |
`kn.webhook.resource.version` | string | Specifies the resource Kubernetes group version|
`kn.webhook.resource.kind` | string | Specifies the resource Kubernetes group kind |
`kn.webhook.subresource` | string | Specifies the subresource | "" (empty), `status`, `scale` |
`kn.webhook.operation.type` | string | Specifies the operation that invoked the webhook | `CREATE`, `UPDATE`, `DELETE` |
`kn.webhook.operation.status` | string | Specifies whether the operation was successful | `success`, `failed` |
### `kn.webhook.handler.duration`
**Instrument Type:** Histogram
**Unit (UCUM):** s
**Description:** The duration of task execution.
The following attributes are included with the metric
Name | Type | Description | Examples
-|-|-|-
`kn.webhook.type` | string | Specifies the type of webhook invoked | `admission`, `defaulting`, `validation`, `conversion` |
`kn.webhook.resource.group` | string | Specifies the resource Kubernetes group name |
`kn.webhook.resource.version` | string | Specifies the resource Kubernetes group version|
`kn.webhook.resource.kind` | string | Specifies the resource Kubernetes group kind |
`kn.webhook.subresource` | string | Specifies the subresource | "" (empty), `status`, `scale` |
`kn.webhook.operation.type` | string | Specifies the operation that invoked the webhook | `CREATE`, `UPDATE`, `DELETE` |
`kn.webhook.operation.status` | string | Specifies whether the operation was successful | `success`, `failed` |
## Workqueue Metrics
Knative controllers expose [client-go workqueue metrics](https://pkg.go.dev/k8s.io/client-go/util/workqueue#MetricsProvider)
The following attributes are included with the metrics below
Name | Type | Description |
-|-|-
`name` | string | Name of the work queue
### `kn.workqueue.depth`
**Instrument Type:** Int64UpDownCounter
**Unit (UCUM):** {item}
**Description:** Number of current items in the queue
### `kn.workqueue.adds`
**Instrument Type:** Int64Counter
**Unit (UCUM):** {item}
**Description:** Number of items added to the queue
### `kn.workqueue.queue.duration`
**Instrument Type:**
**Unit (UCUM):** s
**Description:** How long an item stays in workqueue
### `kn.workqueue.process.duration`
**Instrument Type:** Float64Histogram
**Unit (UCUM):** s
**Description:** How long in seconds processing an item from workqueue takes
### `kn.workqueue.unfinished_work`
**Instrument Type:** Float64Gauge
**Unit (UCUM):** s
**Description:** How many seconds of work the reconciler has done that is in progress and hasn't been observed by duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
### `kn.workqueue.longest_running_processor`
**Instrument Type:** Float64Gauge
**Unit (UCUM):** s
**Description:** How long the longest worker thread has been running
### `kn.workqueue.retries`
**Instrument Type:** Int64Counter
**Unit (UCUM):** {item}
**Description:** Number of items re-added to the queue
## Go Runtime
Knative implements the [semantic conventions for Go runtime metrics](https://opentelemetry.io/docs/specs/semconv/runtime/go-metrics/) using the OpenTelemetry [otel-go/instrumentation/runtime](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/runtime) package.
Please refer to the [OpenTelemetry docs](https://opentelemetry.io/docs/specs/semconv/runtime/go-metrics/) for details about the go runtime metrics it exports.