opentelemetry-collector/processor/batchprocessor
Tyler Helmuth 77bb849aa0
[component] Refactor to use pipeline.ID and pipeline.Signal (#11204)
#### Description
Depends on
https://github.com/open-telemetry/opentelemetry-collector/pull/11209

This PR is a non-breaking implementation of
https://github.com/open-telemetry/opentelemetry-collector/pull/10947. It
adds a new module, `pipeline`, which houses a `pipeline.ID` and
`pipeline.Signal`. `pipeline.ID` is used to identify a pipeline within
the service. `pipeline.Signal` is uses to identify the signal associated
to a pipeline.

I do this work begrudgingly. As the PR shows, this is a huge refactor
when done in a non-breaking way, will require 3 full releases, and
doesn't benefit our [End Users or, in my opinion, our Component
Developers or Collector Library
Users](https://github.com/open-telemetry/opentelemetry-collector/blob/main/CONTRIBUTING.md#target-audiences).
I view this refactor as a Nice-To-Have, not a requirement for Component
1.0.

<!-- Issue number if applicable -->
#### Link to tracking issue
Works towards
https://github.com/open-telemetry/opentelemetry-collector/issues/9429
2024-09-23 07:38:59 -07:00
..
internal/metadata [mdatagen] Avoid public APIs with internal params (#11040) 2024-09-13 09:37:43 -07:00
testdata [chore] change config tests to unmarshal only the config for that component (#5895) 2022-08-11 06:57:13 -07:00
Makefile splitting batch/memorylimiter processors into their own modules (#6427) 2022-11-03 08:32:19 -07:00
README.md [mdatagen] support using a different github project in mdatagen README issues list (#10677) 2024-08-22 13:18:16 -07:00
batch_processor.go [processor] deprecate CreateSettings -> Settings (#10336) 2024-06-06 09:34:53 -07:00
batch_processor_test.go [chore]: enable require-error rule from testifylint (#11199) 2024-09-18 15:02:22 -07:00
config.go [chore] minor fix in the doc (#7808) 2023-06-01 11:43:57 -07:00
config_test.go [chore]: enable require-error rule from testifylint (#11199) 2024-09-18 15:02:22 -07:00
documentation.md [batchprocessor] Update metric units (#10658) 2024-08-13 10:50:27 -07:00
factory.go [processor] deprecate CreateSettings -> Settings (#10336) 2024-06-06 09:34:53 -07:00
factory_test.go [processor] deprecate CreateSettings -> Settings (#10336) 2024-06-06 09:34:53 -07:00
generated_component_telemetry_test.go [mdatagen] add LeveledMeter method (#10933) 2024-08-21 08:05:13 -07:00
generated_component_test.go [chore] small test improvements (#11211) 2024-09-18 13:47:25 -07:00
generated_package_test.go [mdatagen] generate goleak package test (#9959) 2024-04-17 13:10:34 -07:00
go.mod [component] Refactor to use pipeline.ID and pipeline.Signal (#11204) 2024-09-23 07:38:59 -07:00
go.sum fix(deps): update module google.golang.org/grpc to v1.66.2 (#11187) 2024-09-18 10:33:58 +02:00
metadata.yaml [service] deprecate TelemetrySettings.MeterProvider (#10912) 2024-08-29 12:05:54 -07:00
metrics.go Remove obsreportconfig package, reduce dependencies (#11148) 2024-09-11 14:07:13 -07:00
splitlogs.go [chore] use license shortform (#7694) 2023-05-18 13:11:17 -07:00
splitlogs_test.go move internal/testdata to pdata/testdata (#9885) 2024-04-08 08:36:57 -07:00
splitmetrics.go [chore] use license shortform (#7694) 2023-05-18 13:11:17 -07:00
splitmetrics_test.go [chore]: enable bool-compare rule from testifylint (#10993) 2024-08-30 12:00:42 +02:00
splittraces.go [chore] use license shortform (#7694) 2023-05-18 13:11:17 -07:00
splittraces_test.go move internal/testdata to pdata/testdata (#9885) 2024-04-08 08:36:57 -07:00

README.md

Batch Processor

Status
Stability beta: traces, metrics, logs
Distributions core, contrib, k8s
Issues Open issues Closed issues

The batch processor accepts spans, metrics, or logs and places them into batches. Batching helps better compress the data and reduce the number of outgoing connections required to transmit the data. This processor supports both size and time based batching.

It is highly recommended to configure the batch processor on every collector. The batch processor should be defined in the pipeline after the memory_limiter as well as any sampling processors. This is because batching should happen after any data drops such as sampling.

Please refer to config.go for the config spec.

The following configuration options can be modified:

  • send_batch_size (default = 8192): Number of spans, metric data points, or log records after which a batch will be sent regardless of the timeout. send_batch_size acts as a trigger and does not affect the size of the batch. If you need to enforce batch size limits sent to the next component in the pipeline see send_batch_max_size.
  • timeout (default = 200ms): Time duration after which a batch will be sent regardless of size. If set to zero, send_batch_size is ignored as data will be sent immediately, subject to only send_batch_max_size.
  • send_batch_max_size (default = 0): The upper limit of the batch size. 0 means no upper limit of the batch size. This property ensures that larger batches are split into smaller units. It must be greater than or equal to send_batch_size.
  • metadata_keys (default = empty): When set, this processor will create one batcher instance per distinct combination of values in the client.Metadata.
  • metadata_cardinality_limit (default = 1000): When metadata_keys is not empty, this setting limits the number of unique combinations of metadata key values that will be processed over the lifetime of the process.

See notes about metadata batching below.

Examples:

This configuration contains one default batch processor and a second with custom settings. The batch/2 processor will buffer up to 10000 spans, metric data points, or log records for up to 10 seconds without splitting data items to enforce a maximum batch size.

processors:
  batch:
  batch/2:
    send_batch_size: 10000
    timeout: 10s

This configuration will enforce a maximum batch size limit of 10000 spans, metric data points, or log records without introducing any artificial delays.

processors:
  batch:
    send_batch_max_size: 10000
    timeout: 0s

Refer to config.yaml for detailed examples on using the processor.

Batching and client metadata

Batching by metadata enables support for multi-tenant OpenTelemetry Collector pipelines with batching over groups of data having the same authorization metadata. For example:

processors:
  batch:
    # batch data by tenant-id
    metadata_keys:
    - tenant_id

    # limit to 10 batcher processes before raising errors
    metadata_cardinality_limit: 10

Receivers should be configured with include_metadata: true so that metadata keys are available to the processor.

Note that each distinct combination of metadata triggers the allocation of a new background task in the Collector that runs for the lifetime of the process, and each background task holds one pending batch of up to send_batch_size records. Batching by metadata can therefore substantially increase the amount of memory dedicated to batching.

The maximum number of distinct combinations is limited to the configured metadata_cardinality_limit, which defaults to 1000 to limit memory impact.

Users of the batching processor configured with metadata keys should consider use of an Auth extension to validate the relevant metadata-key values.

The number of batch processors currently in use is exported as the otelcol_processor_batch_metadata_cardinality metric.