opentelemetry-collector

Commit Graph

Author	SHA1	Message	Date
Pablo Baeyens	f846829309	[chore] Document scalability, performance and resiliency requirements for stable components (#12994 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description <!-- Issue number if applicable --> This updates the stability requirements to require certain benchmarking, testing and documentation to ensure scalability, resiliency and performance expectations are documented at a minimum level. To operationalize this, I will work on: - Adding automated context propagation tests for components for which it is easy to do so (connectors and processors) - Adding a new section to the table with a link to the latest benchmarking results. #### Link to tracking issue Fixes #11866 Fixes #11868 Fixes #11593	2025-05-12 11:54:44 +00:00
Pablo Baeyens	7adca809a6	[chore] Add testing requirements for stable components (#12971 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description <!-- Issue number if applicable --> Adds wording regarding testing requirements for stable components. The intent is for the lifecycle tests to be handled via mdatagen. This follows the work done on open-telemetry/opentelemetry-collector-contrib/issues/39543, with which now we have component coverage per component. #### Link to tracking issue Fixes #11867	2025-05-07 08:15:47 +00:00
Pablo Baeyens	28ca163a92	[docs/component-stability.md] Add criteria for graduating between stability levels (#11864 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Code ownership and maintenance of components continues to be an issue, with varying levels of support across contrib. As we approach 1.0 and the ability to mark components as stable, we want to make sure that components that we deem as 'stable' have a healthy community around them. We have three datapoints that we can leverage here: how many codeowners a component has, how diverse these are in terms of employers and how actively the codeowners have been responding to issues/PRs in the recent past. We need criteria that 1. Are reasonable predictors of the component health over the short/medium term 2. Are not too onerous on the code owners Some notes: 1. Some beta components do not meet the criteria listed on the PR. This will be the case even after the transition for some components. This PR makes no claim as to what should happen to these components stability (so, de facto, they will stay as is). 2. The OTLP receiver and exporters do not meet this criteria today because they don't have listed code owners. We can solve this either by carving out an exception or by listing code owners. 3. We need automation and templates to enforce this. <!-- Issue number if applicable --> #### Link to tracking issue Fixes #11850 --------- Co-authored-by: Christos Markou <chrismarkou92@gmail.com>	2025-04-14 09:00:01 +00:00
Christos Markou	a9c5de2b65	[mdatagen] Add deprecation date and migration note fields for deprecated components (#12464 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR adds deprecation date and migration note fields for deprecated components as described at https://github.com/open-telemetry/opentelemetry-collector/issues/12359. Example metadata file: ```yaml status: class: receiver stability: development: [logs] beta: [traces] stable: [metrics] deprecated: [profiles] deprecation: profiles: migration: "no migration needed" date: "2025-02-05" ``` Example README.md: \| Status \| \| \| ------------- \|-----------\| \| Stability \| [deprecated]: profiles \| \| \| [development]: logs \| \| \| [beta]: traces \| \| \| [stable]: metrics \| \| Deprecation of profiles \| [Date]: 2025-02-05 \| \| \| [Migration Note]: no migration needed \| [deprecated]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#deprecated [development]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#development [beta]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#beta [stable]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#stable [Date]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#deprecation-information [Migration Note]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#deprecation-information I'd appreciate any suggestions if there is a better way to represent this information in the markdown table. <!-- Issue number if applicable --> #### Link to tracking issue Fixes https://github.com/open-telemetry/opentelemetry-collector/issues/12359 <!--Describe what testing was performed and which tests were added.--> #### Testing Added <!--Describe the documentation added.--> #### Documentation Added <!--Please delete paragraphs that you did not use before submitting.--> /cc @atoulme who filed the feature request issue originally. Signed-off-by: ChrsMark <chrismarkou92@gmail.com>	2025-03-04 19:12:54 +00:00
Jade Guiton	1bb0469b16	[service] Move batchprocessor metrics to normal level and update level guidelines (#12525 ) #### Description This PR: - requires "level: normal" before outputting batch processor metrics (in addition to one specific metric which was already restricted to "level: detailed") - clarifies wording in the telemetry level guidelines and documentation, and adds said guidelines to the requirements for stable components. Some rationale for these changes can be found in the tracking issue and [this comment](https://github.com/open-telemetry/opentelemetry-collector/issues/7890#issuecomment-2684652956). #### Link to tracking issue Resolves #7890 #### To be discussed Should we add a feature gate for this, in case a user relies on "level: basic" outputting batch processor metrics? This feels like a niche use case, so considering the "alpha" stability level of these metrics, I don't think it's really necessary. Considering batch processor metrics had already been switched to "normal" once (#9767), but were turned back to basic at some later point (not sure when), we might also want to add tests to avoid further regressions (especially as the handling of telemetry levels is bound to change further with #11754). --------- Co-authored-by: Dmitrii Anoshin <anoshindx@gmail.com>	2025-03-04 15:57:23 +00:00
Pablo Baeyens	54c3cd8596	[chore][docs/component-stability.md] Add documentation requirements for components based on their stability level (#11871 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description <!-- Issue number if applicable --> Adds requirements for documentation for different stability levels. I expect many of these will be done through automation over time :) #### Link to tracking issue Fixes #11852	2025-01-22 16:59:29 +00:00
Jade Guiton	f70a4b158d	[chore] Reword component stability doc (#12161 ) #### Description This PR slightly changes the wording of the "Stability levels and versioning" doc (`docs/component-stability.md`), which I found a bit confusing, in order to: - Emphasize the important fact that stability levels for a component are defined _per signal_. At the moment this is only alluded to at the beginning and assumed in the last section. Moreover, things like the "Unmaintained" level may give the impression that stability levels always apply to an entire component. - More cleanly separate the part about behavior changes from the part about API changes in the "Versioning" section. This should not change the content or interpretation of the document.	2025-01-22 16:22:04 +00:00
Pablo Baeyens	50104db5f3	[chore][docs/component-stability.md] Add a 'Moving between stability levels' section (#11937 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Split off from #11864, describes how the graduation would work without any additional criteria. Rendered diagram: ```mermaid stateDiagram-v2 state Maintained { InDevelopment --> Alpha Alpha --> Beta Beta --> Stable } InDevelopment: In Development Maintained --> Unmaintained Unmaintained --> Maintained Maintained --> Deprecated Deprecated --> Maintained: (should be rare) ``` --------- Co-authored-by: Christos Markou <chrismarkou92@gmail.com>	2024-12-19 11:26:26 +00:00
Pablo Baeyens	26f0fcfe90	[chore] [docs/component-stability.md] Document relationship between versioning and stability (#11863 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Documents relationship between component stability and versioning. <!-- Issue number if applicable --> #### Link to tracking issue Fixes #11851	2024-12-16 16:06:01 +00:00
Jade Guiton	8ac40a01a5	Define observability requirements for stable components (#11772 ) ## Description This PR defines observability requirements for components at the "Stable" stability levels. The goal is to ensure that Collector pipelines are properly observable, to help in debugging configuration issues. #### Approach - The requirements are deliberately not too specific, in order to be adaptable to each specific component, and so as to not over-burden component authors. - After discussing it with @mx-psi, this list of requirements explicitly includes things that may end up being emitted automatically as part of the Pipeline Instrumentation RFC (#11406), with only a note at the beginning explaining that not everything may need to be implemented manually. Feel free to share if you don't think this is the right approach for these requirements. #### Link to tracking issue Resolves #11581 ## Important note regarding the Pipeline Instrumentation RFC I included this paragraph in the part about error count metrics: > The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: > - only include errors internal to the component, or; > - allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components. The [Pipeline Instrumentation RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) (hereafter abbreviated "PI"), once implemented, should allow monitoring component errors via the `outcome` attribute, which is either `success` or `failure`, depending on whether the `Consumer` API call returned an error. Note that this does not work for receivers, or allow differentiating between different types of errors; for that reason, I believe additional component-specific error metrics will often still be required, but it would be nice to cover as many cases as possible automatically. However, at the moment, errors are (usually) propagated upstream through the chain of `Consume` calls, so in case of error the `failure` state will end up applied to all components upstream of the actual source of the error. This means the PI metrics do not fit the first bullet point. Moreover, I would argue that even post-processing the PI metrics does not reliably allow distinguishing the ultimate source of errors (the second bullet point). One simple idea is to compute `consumed.items{outcome:failure} - produced.items{outcome:failure}` to get the number of errors originating in a component. But this only works if output items map one-to-one to input items: if a processor or connector outputs fewer items than it consumes (because it aggregates them, or translates to a different signal type), this formula will return false positives. If these false positives are mixed with real errors from the component and/or from downstream, the situation becomes impossible to analyze by just looking at the metrics. For these reasons, I believe we should do one of four things: 1. Change the way we use the `Consumer` API to no longer propagate errors, making the PI metric outcomes more precise. We could catch errors in whatever wrapper we already use to emit the PI metrics, log them for posterity, and simply not propagate them. Note that some components already more or less do this, such as the `batchprocessor`, but this option may in principle break components which rely on downstream errors (for retry purposes for example). 3. Keep propagating errors, but modify or extend the RFC to require distinguishing between internal and propagated errors (maybe add a third `outcome` value, or add another attribute). This could be implemented by somehow propagating additional state from one `Consume` call to another, allowing us to establish the first appearance of a given error value in the pipeline. 5. Loosen this requirement so that the PI metrics suffice in their current state. 6. Leave everything as-is and make component authors implement their own somewhat redundant error count metrics. --------- Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com> Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>	2024-12-16 09:16:23 +00:00
Tyler Helmuth	4ed80bbc4d	[chore] add text about unmaintained vendor components (#11616 ) #### Description This PR adds language to our definition of `unmaintained` that allows vendor-specific components to become unmaintained if they lose an active code owner from the contributing vendor. This is necessary to prevent a vendor-specific component form being maintained only by the community instead of by the vendor. Since vendors had privileged access to getting components accepted in contrib they must be held to a higher standard for maintaining those components. --------- Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com> Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>	2024-12-03 16:16:19 -08:00
Alex Boten	164c28a60e	update time period before removing an unmaintained component (#11664 ) After reviewing the 6 month period, the period seems too long to get feedback on whether an unmaintained component should be removed. I'm proposing shortening it to 3 months. --------- Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>	2024-11-13 09:38:22 -08:00
Pablo Baeyens	37184b0b3f	[chore][docs] Move configuration changes guidelines to component stability doc (#11572 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description <!-- Issue number if applicable --> Works towards #11553. Unifies information about component stability in a single document. #### Link to tracking issue Fixes #11571	2024-10-30 15:55:05 +01:00
Pablo Baeyens	ad37bac82f	[chore][docs] Move component stability to a separate document (#11561 ) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description <!-- Issue number if applicable --> The goal is to work towards #11553 in this new document. This only copies the contents of README.md verbatim. #### Link to tracking issue Fixes #11560	2024-10-30 09:50:11 +01:00

14 Commits