[docs] Clean up internal observability docs (#10454)

#### Description
Now that
[4246](https://github.com/open-telemetry/opentelemetry.io/pull/4246),
[4322](https://github.com/open-telemetry/opentelemetry.io/pull/4322),
and [4529](https://github.com/open-telemetry/opentelemetry.io/pull/4529)
have been merged, and the new [Internal
telemetry](https://opentelemetry.io/docs/collector/internal-telemetry/)
and
[Troubleshooting](https://opentelemetry.io/docs/collector/troubleshooting/)
pages are live, it's time to clean up the underlying Collector repo docs
so that the website is the single source of truth.

I've deleted any content that was moved to the website, and linked to
the relevant sections where possible. I've consolidated what content
remains in the observability.md file and left troubleshooting.md and
monitoring.md as stubs that point to the website.

I also searched the Collector repo for cross-references to these files
and adjusted links where appropriate.

~~Note that this PR is blocked by
[4731](https://github.com/open-telemetry/opentelemetry.io/pull/4731).~~
EDIT: #4731 is merged and no longer a blocker.

<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes #8886
This commit is contained in:
Tiffany Hrabusa 2024-06-28 01:25:13 -07:00 committed by GitHub
parent fead8fc530
commit 2a19d55c4c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 117 additions and 509 deletions

View File

@ -33,7 +33,7 @@
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
<a href="https://opentelemetry.io/docs/collector/configuration/">Configuration</a>
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
<a href="docs/monitoring.md">Monitoring</a>
<a href="https://opentelemetry.io/docs/collector/internal-telemetry/#use-internal-telemetry-to-monitor-the-collector</a>
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
<a href="docs/security-best-practices.md">Security</a>
&nbsp;&nbsp;&bull;&nbsp;&nbsp;

View File

@ -1,70 +1,7 @@
# Monitoring
Many metrics are provided by the Collector for its monitoring. Below some
key recommendations for alerting and monitoring are listed.
To learn how to monitor the Collector using its own telemetry, see the [Internal
telemetry] page.
## Critical Monitoring
### Data Loss
Use rate of `otelcol_processor_dropped_spans > 0` and
`otelcol_processor_dropped_metric_points > 0` to detect data loss, depending on
the requirements set up a minimal time window before alerting, avoiding
notifications for small losses that are not considered outages or within the
desired reliability level.
### Low on CPU Resources
This depends on the CPU metrics available on the deployment, eg.:
`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for Kubernetes. Let's call it
`available_cores` below. The idea here is to have an upper bound of the number
of available cores, and the maximum expected ingestion rate considered safe,
let's call it `safe_rate`, per core. This should trigger increase of resources/
instances (or raise an alert as appropriate) whenever
`(actual_rate/available_cores) < safe_rate`.
The `safe_rate` depends on the specific configuration being used.
// TODO: Provide reference `safe_rate` for a few selected configurations.
## Secondary Monitoring
### Queue Length
Most exporters offer a [queue/retry mechanism](../exporter/exporterhelper/README.md)
that is recommended as the retry mechanism for the Collector and as such should
be used in any production deployment.
The `otelcol_exporter_queue_capacity` indicates the capacity of the retry queue (in batches). The `otelcol_exporter_queue_size` indicates the current size of retry queue. So you can use these two metrics to check if the queue capacity is enough for your workload.
The `otelcol_exporter_enqueue_failed_spans`, `otelcol_exporter_enqueue_failed_metric_points` and `otelcol_exporter_enqueue_failed_log_records` indicate the number of span/metric points/log records failed to be added to the sending queue. This may be cause by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.
The queue/retry mechanism also supports logging for monitoring. Check
the logs for messages like `"Dropping data because sending_queue is full"`.
### Receive Failures
Sustained rates of `otelcol_receiver_refused_spans` and
`otelcol_receiver_refused_metric_points` indicate too many errors returned to
clients. Depending on the deployment and the clients resilience this may
indicate data loss at the clients.
Sustained rates of `otelcol_exporter_send_failed_spans` and
`otelcol_exporter_send_failed_metric_points` indicate that the Collector is not
able to export data as expected.
It doesn't imply data loss per se since there could be retries but a high rate
of failures could indicate issues with the network or backend receiving the
data.
## Data Flow
### Data Ingress
The `otelcol_receiver_accepted_spans` and
`otelcol_receiver_accepted_metric_points` metrics provide information about
the data ingested by the Collector.
### Data Egress
The `otecol_exporter_sent_spans` and
`otelcol_exporter_sent_metric_points`metrics provide information about
the data exported by the Collector.
[Internal telemetry]:
https://opentelemetry.io/docs/collector/internal-telemetry/#use-internal-telemetry-to-monitor-the-collector

View File

@ -1,140 +1,134 @@
# OpenTelemetry Collector Observability
# OpenTelemetry Collector internal observability
## Goal
The [Internal telemetry] page on OpenTelemetry's website contains the
documentation for the Collector's internal observability, including:
The goal of this document is to have a comprehensive description of observability of the Collector and changes needed to achieve observability part of our [vision](vision.md).
- Which types of observability are emitted by the Collector.
- How to enable and configure these signals.
- How to use this telemetry to monitor your Collector instance.
## What Needs Observation
If you need to troubleshoot the Collector, see [Troubleshooting].
The following elements of the Collector need to be observable.
Read on to learn about experimental features and the project's overall vision
for internal telemetry.
### Current Values
## Experimental trace telemetry
- Resource consumption: CPU, RAM (in the future also IO - if we implement persistent queues) and any other metrics that may be available to Go apps (e.g. garbage size, etc).
- Receiving data rate, broken down by receivers and by data type (traces/metrics).
- Exporting data rate, broken down by exporters and by data type (traces/metrics).
- Data drop rate due to throttling, broken down by data type.
- Data drop rate due to invalid data received, broken down by data type.
- Current throttling state: Not Throttled/Throttled by Downstream/Internally Saturated.
- Incoming connection count, broken down by receiver.
- Incoming connection rate (new connections per second), broken down by receiver.
- In-memory queue size (in bytes and in units). Note: measurements in bytes may be difficult / expensive to obtain and should be used cautiously.
- Persistent queue size (when supported).
- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline).
- Latency broken down by pipeline elements (including exporter network roundtrip latency for request/response protocols).
“Rate” values must reflect the average rate of the last 10 seconds. Rates must exposed in bytes/sec and units/sec (e.g. spans/sec).
Note: some of the current values and rates may be calculated as derivatives of cumulative values in the backend, so it is an open question if we want to expose them separately or no.
### Cumulative Values
- Total received data, broken down by receivers and by data type (traces/metrics).
- Total exported data, broken down by exporters and by data type (traces/metrics).
- Total dropped data due to throttling, broken down by data type.
- Total dropped data due to invalid data received, broken down by data type.
- Total incoming connection count, broken down by receiver.
- Uptime since start.
### Trace or Log on Events
We want to generate the following events (log and/or send as a trace with additional data):
- Collector started/stopped.
- Collector reconfigured (if we support on-the-fly reconfiguration).
- Begin dropping due to throttling (include throttling reason, e.g. local saturation, downstream saturation, downstream unavailable, etc).
- Stop dropping due to throttling.
- Begin dropping due to invalid data (include sample/first invalid data).
- Stop dropping due to invalid data.
- Crash detected (differentiate clean stopping and crash, possibly include crash data if available).
For begin/stop events we need to define an appropriate hysteresis to avoid generating too many events. Note that begin/stop events cannot be detected in the backend simply as derivatives of current rates, the events include additional data that is not present in the current value.
### Host Metrics
The service should collect host resource metrics in addition to service's own process metrics. This may help to understand that the problem that we observe in the service is induced by a different process on the same host.
## How We Expose Telemetry
By default, the Collector exposes service telemetry in two ways currently:
- internal metrics are exposed via a Prometheus interface which defaults to port `8888`
- logs are emitted to stdout
Traces are not exposed by default. There is an effort underway to [change this][issue7532]. The work includes supporting
configuration of the OpenTelemetry SDK used to produce the Collector's internal telemetry. This feature is
currently behind two feature gates:
The Collector does not expose traces by default, but an effort is underway to
[change this][issue7532]. The work includes supporting configuration of the
OpenTelemetry SDK used to produce the Collector's internal telemetry. This
feature is behind two feature gates:
```bash
--feature-gates=telemetry.useOtelWithSDKConfigurationForInternalTelemetry
```
The gate `useOtelWithSDKConfigurationForInternalTelemetry` enables the Collector to parse configuration
that aligns with the [OpenTelemetry Configuration] schema. The support for this schema is still
experimental, but it does allow telemetry to be exported via OTLP.
The gate `useOtelWithSDKConfigurationForInternalTelemetry` enables the Collector
to parse any configuration that aligns with the [OpenTelemetry Configuration]
schema. Support for this schema is experimental, but it does allow telemetry to
be exported using OTLP.
The following configuration can be used in combination with the feature gates aforementioned
to emit internal metrics and traces from the Collector to an OTLP backend:
The following configuration can be used in combination with the aforementioned
feature gates to emit internal metrics and traces from the Collector to an OTLP
backend:
```yaml
service:
telemetry:
metrics:
readers:
- periodic:
interval: 5000
exporter:
otlp:
protocol: grpc/protobuf
endpoint: https://backend:4317
traces:
processors:
- batch:
exporter:
otlp:
protocol: grpc/protobuf
endpoint: https://backend2:4317
telemetry:
metrics:
readers:
- periodic:
interval: 5000
exporter:
otlp:
protocol: grpc/protobuf
endpoint: https://backend:4317
traces:
processors:
- batch:
exporter:
otlp:
protocol: grpc/protobuf
endpoint: https://backend2:4317
```
See the configuration's [example][kitchen-sink] for additional configuration options.
See the [example configuration][kitchen-sink] for additional options.
Note that this configuration does not support emitting logs as there is no support for [logs] in
OpenTelemetry Go SDK at this time.
> This configuration does not support emitting logs as there is no support for
> [logs] in the OpenTelemetry Go SDK at this time.
You can also configure the Collector to send its own traces using the OTLP
exporter. Send the traces to an OTLP server running on the same Collector, so it
goes through configured pipelines. For example:
```yaml
service:
telemetry:
traces:
processors:
batch:
exporter:
otlp:
protocol: grpc/protobuf
endpoint: ${MY_POD_IP}:4317
```
## Goals of internal telemetry
The Collector's internal telemetry is an important part of fulfilling
OpenTelemetry's [project vision](vision.md). The following section explains the
priorities for making the Collector an observable service.
### Observable elements
The following aspects of the Collector need to be observable.
- [Current values]
- Some of the current values and rates might be calculated as derivatives of
cumulative values in the backend, so it's an open question whether to expose
them separately or not.
- [Cumulative values]
- [Trace or log events]
- For start or stop events, an appropriate hysteresis must be defined to avoid
generating too many events. Note that start and stop events can't be
detected in the backend simply as derivatives of current rates. The events
include additional data that is not present in the current value.
- [Host metrics]
- Host metrics can help users determine if the observed problem in a service
is caused by a different process on the same host.
### Impact
We need to be able to assess the impact of these observability improvements on the core performance of the Collector.
The impact of these observability improvements on the core performance of the
Collector must be assessed.
### Configurable Level of Observability
### Configurable level of observability
Some of the metrics/traces can be high volume and may not be desirable to always observe. We should consider adding an observability verboseness “level” that allows configuring the Collector to send more or less observability data (or even finer granularity to allow turning on/off specific metrics).
Some metrics and traces can be high volume and users might not always want to
observe them. An observability verboseness “level” allows configuration of the
Collector to send more or less observability data or with even finer
granularity, to allow turning on or off specific metrics.
The default level of observability must be defined in a way that has insignificant performance impact on the service.
The default level of observability must be defined in a way that has
insignificant performance impact on the service.
[issue7532]: https://github.com/open-telemetry/opentelemetry-collector/issues/7532
[issue7454]: https://github.com/open-telemetry/opentelemetry-collector/issues/7454
[Internal telemetry]:
https://opentelemetry.io/docs/collector/internal-telemetry/
[Troubleshooting]: https://opentelemetry.io/docs/collector/troubleshooting/
[issue7532]:
https://github.com/open-telemetry/opentelemetry-collector/issues/7532
[issue7454]:
https://github.com/open-telemetry/opentelemetry-collector/issues/7454
[logs]: https://github.com/open-telemetry/opentelemetry-go/issues/3827
[OpenTelemetry Configuration]: https://github.com/open-telemetry/opentelemetry-configuration
[kitchen-sink]: https://github.com/open-telemetry/opentelemetry-configuration/blob/main/examples/kitchen-sink.yaml
[OpenTelemetry Configuration]:
https://github.com/open-telemetry/opentelemetry-configuration
[kitchen-sink]:
https://github.com/open-telemetry/opentelemetry-configuration/blob/main/examples/kitchen-sink.yaml
[Current values]:
https://opentelemetry.io/docs/collector/internal-telemetry/#values-observable-with-internal-metrics
[Cumulative values]:
https://opentelemetry.io/docs/collector/internal-telemetry/#values-observable-with-internal-metrics
[Trace or log events]:
https://opentelemetry.io/docs/collector/internal-telemetry/#events-observable-with-internal-logs
[Host metrics]:
https://opentelemetry.io/docs/collector/internal-telemetry/#lists-of-internal-metrics

View File

@ -1,328 +1,5 @@
# Troubleshooting
## Observability
To troubleshoot the Collector, see the [Troubleshooting] page.
The Collector offers multiple ways to measure the health of the Collector
as well as investigate issues.
### Logs
Logs can be helpful in identifying issues. Always start by checking the log
output and looking for potential issues.
The verbosity level defaults to `INFO` and can be adjusted.
Set the log level in the config `service::telemetry::logs`
```yaml
service:
telemetry:
logs:
level: "debug"
```
### Metrics
Prometheus metrics are exposed locally on port `8888` and path `/metrics`. For
containerized environments it may be desirable to expose this port on a
public interface instead of just locally.
Set the address in the config `service::telemetry::metrics`
```yaml
service:
telemetry:
metrics:
address: ":8888"
```
A Grafana dashboard for these metrics can be found
[here](https://grafana.com/grafana/dashboards/15983-opentelemetry-collector/).
You can enhance metrics telemetry level using `level` field. The following is a list of all possible values and their explanations.
- "none" indicates that no telemetry data should be collected;
- "basic" is the recommended and covers the basics of the service telemetry.
- "normal" adds some other indicators on top of basic.
- "detailed" adds dimensions and views to the previous levels.
For example:
```yaml
service:
telemetry:
metrics:
level: detailed
address: ":8888"
```
Also note that a Collector can be configured to scrape its own metrics and send
it through configured pipelines. For example:
```yaml
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'otelcol'
scrape_interval: 10s
static_configs:
- targets: ['0.0.0.0:8888']
metric_relabel_configs:
- source_labels: [ __name__ ]
regex: '.*grpc_io.*'
action: drop
exporters:
debug:
service:
pipelines:
metrics:
receivers: [prometheus]
processors: []
exporters: [debug]
```
### Traces
OpenTelemetry Collector has an ability to send it's own traces using OTLP exporter. You can send the traces to OTLP server running on the same OpenTelemetry Collector, so it goes through configured pipelines. For example:
```yaml
service:
telemetry:
traces:
processors:
batch:
exporter:
otlp:
protocol: grpc/protobuf
endpoint: ${MY_POD_IP}:4317
```
### zPages
The
[zpages](https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension/README.md)
extension, which if enabled is exposed locally on port `55679`, can be used to
check receivers and exporters trace operations via `/debug/tracez`. `zpages`
may contain error logs that the Collector does not emit.
For containerized environments it may be desirable to expose this port on a
public interface instead of just locally. This can be configured via the
extensions configuration section. For example:
```yaml
extensions:
zpages:
endpoint: 0.0.0.0:55679
```
### Local exporters
[Local
exporters](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#general-information)
can be configured to inspect the data being processed by the Collector.
For live troubleshooting purposes consider leveraging the `debug` exporter,
which can be used to confirm that data is being received, processed and
exported by the Collector.
```yaml
receivers:
zipkin:
exporters:
debug:
service:
pipelines:
traces:
receivers: [zipkin]
processors: []
exporters: [debug]
```
Get a Zipkin payload to test. For example create a file called `trace.json`
that contains:
```json
[
{
"traceId": "5982fe77008310cc80f1da5e10147519",
"parentId": "90394f6bcffb5d13",
"id": "67fae42571535f60",
"kind": "SERVER",
"name": "/m/n/2.6.1",
"timestamp": 1516781775726000,
"duration": 26000,
"localEndpoint": {
"serviceName": "api"
},
"remoteEndpoint": {
"serviceName": "apip"
},
"tags": {
"data.http_response_code": "201"
}
}
]
```
With the Collector running, send this payload to the Collector. For example:
```console
$ curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @trace.json
```
You should see a log entry like the following from the Collector:
```
2023-09-07T09:57:43.468-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2}
```
You can also configure the `debug` exporter so the entire payload is printed:
```yaml
exporters:
debug:
verbosity: detailed
```
With the modified configuration if you re-run the test above the log output should look like:
```
2023-09-07T09:57:12.820-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2}
2023-09-07T09:57:12.821-0700 info ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0
Resource attributes:
-> service.name: Str(telemetrygen)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope telemetrygen
Span #0
Trace ID : 0c636f29e29816ea76e6a5b8cd6601cf
Parent ID : 1a08eba9395c5243
ID : 10cebe4b63d47cae
Name : okey-dokey
Kind : Internal
Start time : 2023-09-07 16:57:12.045933 +0000 UTC
End time : 2023-09-07 16:57:12.046058 +0000 UTC
Status code : Unset
Status message :
Attributes:
-> span.kind: Str(server)
-> net.peer.ip: Str(1.2.3.4)
-> peer.service: Str(telemetrygen)
```
### Health Check
The
[health_check](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/README.md)
extension, which by default is available on all interfaces on port `13133`, can
be used to ensure the Collector is functioning properly.
```yaml
extensions:
health_check:
service:
extensions: [health_check]
```
It returns a response like the following:
```json
{
"status": "Server available",
"upSince": "2020-11-11T04:12:31.6847174Z",
"uptime": "49.0132518s"
}
```
### pprof
The
[pprof](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/pprofextension/README.md)
extension, which by default is available locally on port `1777`, allows you to profile the
Collector as it runs. This is an advanced use-case that should not be needed in most circumstances.
## Common Issues
To see logs for the Collector:
On a Linux systemd system, logs can be found using `journalctl`:
`journalctl | grep otelcol`
or to find only errors:
`journalctl | grep otelcol | grep Error`
### Collector exit/restart
The Collector may exit/restart because:
- Memory pressure due to missing or misconfigured
[memory_limiter](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md)
processor.
- Improperly sized for load.
- Improperly configured (for example, a queue size configured higher
than available memory).
- Infrastructure resource limits (for example Kubernetes).
### Data being dropped
Data may be dropped for a variety of reasons, but most commonly because of an:
- Improperly sized Collector resulting in Collector being unable to process and export the data as fast as it is received.
- Exporter destination unavailable or accepting the data too slowly.
To mitigate drops, it is highly recommended to configure the
[batch](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md)
processor. In addition, it may be necessary to configure the [queued retry
options](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/exporterhelper#configuration)
on enabled exporters.
### Receiving data not working
If you are unable to receive data then this is likely because
either:
- There is a network configuration issue
- The receiver configuration is incorrect
- The receiver is defined in the `receivers` section, but not enabled in any `pipelines`
- The client configuration is incorrect
Check the Collector logs as well as `zpages` for potential issues.
### Processing data not working
Most processing issues are a result of either a misunderstanding of how the
processor works or a misconfiguration of the processor.
Examples of misunderstanding include:
- The attributes processors only work for "tags" on spans. Span name is
handled by the span processor.
- Processors for trace data (except tail sampling) work on individual spans.
### Exporting data not working
If you are unable to export to a destination then this is likely because
either:
- There is a network configuration issue
- The exporter configuration is incorrect
- The destination is unavailable
Check the collector logs as well as `zpages` for potential issues.
More often than not, exporting data does not work because of a network
configuration issue. This could be due to a firewall, DNS, or proxy
issue. Note that the Collector does have
[proxy support](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#proxy-support).
### Startup failing in Windows Docker containers (v0.90.1 and earlier)
The process may fail to start in a Windows Docker container with the following
error: `The service process could not connect to the service controller`. In
this case the `NO_WINDOWS_SERVICE=1` environment variable should be set to force
the collector to be started as if it were running in an interactive terminal,
without attempting to run as a Windows service.
### Null Maps in Configuration
If you've ever experienced issues during configuration resolution where sections, like `processors:` from earlier configuration are removed, see [confmap](../confmap/README.md#troubleshooting)
[Troubleshooting]: https://opentelemetry.io/docs/collector/troubleshooting/

View File

@ -8,7 +8,7 @@ This is a living document that is expected to evolve over time.
Highly stable and performant under varying loads. Well-behaved under extreme load, with predictable, low resource consumption.
## Observable
Expose own operational metrics in a clear way. Be an exemplar of observable service. Allow configuring the level of observability (more or less metrics, traces, logs, etc reported). See [more details](observability.md).
Expose own operational metrics in a clear way. Be an exemplar of observable service. Allow configuring the level of observability (more or less metrics, traces, logs, etc reported). See [more details](https://opentelemetry.io/docs/collector/internal-telemetry/).
## Multi-Data
Support traces, metrics, logs and other relevant data types.

View File

@ -18,7 +18,7 @@ Outputs telemetry data to the console for debugging purposes.
See also the [Troubleshooting][troubleshooting_docs] document for examples on using this exporter.
[troubleshooting_docs]: ../../docs/troubleshooting.md
[troubleshooting_docs]: https://opentelemetry.io/docs/collector/troubleshooting/#local-exporters
## Getting Started