Commit Graph

14 Commits

Author SHA1 Message Date
Pablo Baeyens f846829309
[chore] Document scalability, performance and resiliency requirements for stable components (#12994)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
<!-- Issue number if applicable -->

This updates the stability requirements to require certain benchmarking,
testing and documentation to ensure scalability, resiliency and
performance expectations are documented at a minimum level.

To operationalize this, I will work on:
- Adding automated context propagation tests for components for which it
is easy to do so (connectors and processors)
- Adding a new section to the table with a link to the latest
benchmarking results.

#### Link to tracking issue

Fixes #11866
Fixes #11868
Fixes #11593
2025-05-12 11:54:44 +00:00
Pablo Baeyens 7adca809a6
[chore] Add testing requirements for stable components (#12971)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

<!-- Issue number if applicable -->

Adds wording regarding testing requirements for stable components. The
intent is for the lifecycle tests to be handled via mdatagen.

This follows the work done on
open-telemetry/opentelemetry-collector-contrib/issues/39543, with which
now we have component coverage per component.

#### Link to tracking issue
Fixes #11867
2025-05-07 08:15:47 +00:00
Pablo Baeyens 28ca163a92
[docs/component-stability.md] Add criteria for graduating between stability levels (#11864)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Code ownership and maintenance of components continues to be an issue,
with varying levels of support across contrib. As we approach 1.0 and
the ability to mark components as stable, we want to make sure that
components that we deem as 'stable' have a healthy community around
them. We have three datapoints that we can leverage here: how many
codeowners a component has, how diverse these are in terms of employers
and how actively the codeowners have been responding to issues/PRs in
the recent past.

We need criteria that
1. Are reasonable predictors of the component health over the
short/medium term
2. Are not too onerous on the code owners

Some notes:
1. Some beta components do not meet the criteria listed on the PR. This
will be the case even after the transition for some components. This PR
makes no claim as to what should happen to these components stability
(so, de facto, they will stay as is).
2. The OTLP receiver and exporters do not meet this criteria today
because they don't have listed code owners. We can solve this either by
carving out an exception or by listing code owners.
3. We need automation and templates to enforce this.

<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes #11850

---------

Co-authored-by: Christos Markou <chrismarkou92@gmail.com>
2025-04-14 09:00:01 +00:00
Christos Markou a9c5de2b65
[mdatagen] Add deprecation date and migration note fields for deprecated components (#12464)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

This PR adds deprecation date and migration note fields for deprecated
components as described at
https://github.com/open-telemetry/opentelemetry-collector/issues/12359.

Example metadata file:

```yaml
status:
  class: receiver
  stability:
    development: [logs]
    beta: [traces]
    stable: [metrics]
    deprecated: [profiles]
  deprecation:
    profiles:
      migration: "no migration needed"
      date: "2025-02-05"
```

Example README.md:

| Status        |           |
| ------------- |-----------|
| Stability     | [deprecated]: profiles   |
|               | [development]: logs   |
|               | [beta]: traces   |
|               | [stable]: metrics   |
| Deprecation of profiles | [Date]: 2025-02-05   |
|                      | [Migration Note]: no migration needed   |

[deprecated]:
https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#deprecated
[development]:
https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#development
[beta]:
https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#beta
[stable]:
https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#stable
[Date]:
https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#deprecation-information
[Migration Note]:
https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#deprecation-information

I'd appreciate any suggestions if there is a better way to represent
this information in the markdown table.

<!-- Issue number if applicable -->
#### Link to tracking issue

Fixes
https://github.com/open-telemetry/opentelemetry-collector/issues/12359

<!--Describe what testing was performed and which tests were added.-->
#### Testing

Added

<!--Describe the documentation added.-->
#### Documentation

Added

<!--Please delete paragraphs that you did not use before submitting.-->

/cc @atoulme who filed the feature request issue originally.

Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
2025-03-04 19:12:54 +00:00
Jade Guiton 1bb0469b16
[service] Move batchprocessor metrics to normal level and update level guidelines (#12525)
#### Description

This PR:
- requires "level: normal" before outputting batch processor metrics (in
addition to one specific metric which was already restricted to "level:
detailed")
- clarifies wording in the telemetry level guidelines and documentation,
and adds said guidelines to the requirements for stable components.

Some rationale for these changes can be found in the tracking issue and
[this
comment](https://github.com/open-telemetry/opentelemetry-collector/issues/7890#issuecomment-2684652956).

#### Link to tracking issue
Resolves #7890

#### To be discussed

Should we add a feature gate for this, in case a user relies on "level:
basic" outputting batch processor metrics? This feels like a niche use
case, so considering the "alpha" stability level of these metrics, I
don't think it's really necessary.

Considering batch processor metrics had already been switched to
"normal" once (#9767), but were turned back to basic at some later point
(not sure when), we might also want to add tests to avoid further
regressions (especially as the handling of telemetry levels is bound to
change further with #11754).

---------

Co-authored-by: Dmitrii Anoshin <anoshindx@gmail.com>
2025-03-04 15:57:23 +00:00
Pablo Baeyens 54c3cd8596
[chore][docs/component-stability.md] Add documentation requirements for components based on their stability level (#11871)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

<!-- Issue number if applicable -->

Adds requirements for documentation for different stability levels.

I expect many of these will be done through automation over time :)

#### Link to tracking issue

Fixes #11852
2025-01-22 16:59:29 +00:00
Jade Guiton f70a4b158d
[chore] Reword component stability doc (#12161)
#### Description

This PR slightly changes the wording of the "Stability levels and
versioning" doc (`docs/component-stability.md`), which I found a bit
confusing, in order to:
- Emphasize the important fact that stability levels for a component are
defined _per signal_. At the moment this is only alluded to at the
beginning and assumed in the last section. Moreover, things like the
"Unmaintained" level may give the impression that stability levels
always apply to an entire component.
- More cleanly separate the part about behavior changes from the part
about API changes in the "Versioning" section.

This should not change the content or interpretation of the document.
2025-01-22 16:22:04 +00:00
Pablo Baeyens 50104db5f3
[chore][docs/component-stability.md] Add a 'Moving between stability levels' section (#11937)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Split off from #11864, describes how the graduation would work without
any additional criteria.

Rendered diagram:


```mermaid
stateDiagram-v2
    state Maintained {
    InDevelopment --> Alpha
    Alpha --> Beta
    Beta --> Stable
    }
    InDevelopment: In Development
    Maintained --> Unmaintained
    Unmaintained --> Maintained
    Maintained --> Deprecated
    Deprecated --> Maintained: (should be rare)
```

---------

Co-authored-by: Christos Markou <chrismarkou92@gmail.com>
2024-12-19 11:26:26 +00:00
Pablo Baeyens 26f0fcfe90
[chore] [docs/component-stability.md] Document relationship between versioning and stability (#11863)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Documents relationship between component stability and versioning.

<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes #11851
2024-12-16 16:06:01 +00:00
Jade Guiton 8ac40a01a5
Define observability requirements for stable components (#11772)
## Description

This PR defines observability requirements for components at the
"Stable" stability levels. The goal is to ensure that Collector
pipelines are properly observable, to help in debugging configuration
issues.

#### Approach

- The requirements are deliberately not too specific, in order to be
adaptable to each specific component, and so as to not over-burden
component authors.
- After discussing it with @mx-psi, this list of requirements explicitly
includes things that may end up being emitted automatically as part of
the Pipeline Instrumentation RFC (#11406), with only a note at the
beginning explaining that not everything may need to be implemented
manually.

Feel free to share if you don't think this is the right approach for
these requirements.

#### Link to tracking issue
Resolves #11581

## Important note regarding the Pipeline Instrumentation RFC

I included this paragraph in the part about error count metrics:
> The goal is to be able to easily pinpoint the source of data loss in
the Collector pipeline, so this should either:
>   - only include errors internal to the component, or;
> - allow distinguishing said errors from ones originating in an
external service, or propagated from downstream Collector components.

The [Pipeline Instrumentation
RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md)
(hereafter abbreviated "PI"), once implemented, should allow monitoring
component errors via the `outcome` attribute, which is either `success`
or `failure`, depending on whether the `Consumer` API call returned an
error.

Note that this does not work for receivers, or allow differentiating
between different types of errors; for that reason, I believe additional
component-specific error metrics will often still be required, but it
would be nice to cover as many cases as possible automatically.

However, at the moment, errors are (usually) propagated upstream through
the chain of `Consume` calls, so in case of error the `failure` state
will end up applied to all components upstream of the actual source of
the error. This means the PI metrics do not fit the first bullet point.

Moreover, I would argue that even post-processing the PI metrics does
not reliably allow distinguishing the ultimate source of errors (the
second bullet point). One simple idea is to compute
`consumed.items{outcome:failure} - produced.items{outcome:failure}` to
get the number of errors originating in a component. But this only works
if output items map one-to-one to input items: if a processor or
connector outputs fewer items than it consumes (because it aggregates
them, or translates to a different signal type), this formula will
return false positives. If these false positives are mixed with real
errors from the component and/or from downstream, the situation becomes
impossible to analyze by just looking at the metrics.

For these reasons, I believe we should do one of four things:
1. Change the way we use the `Consumer` API to no longer propagate
errors, making the PI metric outcomes more precise.
We could catch errors in whatever wrapper we already use to emit the PI
metrics, log them for posterity, and simply not propagate them.
Note that some components already more or less do this, such as the
`batchprocessor`, but this option may in principle break components
which rely on downstream errors (for retry purposes for example).
3. Keep propagating errors, but modify or extend the RFC to require
distinguishing between internal and propagated errors (maybe add a third
`outcome` value, or add another attribute).
This could be implemented by somehow propagating additional state from
one `Consume` call to another, allowing us to establish the first
appearance of a given error value in the pipeline.
5. Loosen this requirement so that the PI metrics suffice in their
current state.
6. Leave everything as-is and make component authors implement their own
somewhat redundant error count metrics.

---------

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
2024-12-16 09:16:23 +00:00
Tyler Helmuth 4ed80bbc4d
[chore] add text about unmaintained vendor components (#11616)
#### Description

This PR adds language to our definition of `unmaintained` that allows
vendor-specific components to become unmaintained if they lose an active
code owner from the contributing vendor. This is necessary to prevent a
vendor-specific component form being maintained only by the community
instead of by the vendor. Since vendors had privileged access to getting
components accepted in contrib they must be held to a higher standard
for maintaining those components.

---------

Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
2024-12-03 16:16:19 -08:00
Alex Boten 164c28a60e
update time period before removing an unmaintained component (#11664)
After reviewing the 6 month period, the period seems too long to get
feedback on whether an unmaintained component should be removed. I'm
proposing shortening it to 3 months.

---------

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
2024-11-13 09:38:22 -08:00
Pablo Baeyens 37184b0b3f
[chore][docs] Move configuration changes guidelines to component stability doc (#11572)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

<!-- Issue number if applicable -->
Works towards #11553. Unifies information about component stability in a
single document.

#### Link to tracking issue
Fixes #11571
2024-10-30 15:55:05 +01:00
Pablo Baeyens ad37bac82f
[chore][docs] Move component stability to a separate document (#11561)
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

<!-- Issue number if applicable -->

The goal is to work towards #11553 in this new document. This only
copies the contents of README.md verbatim.

#### Link to tracking issue
Fixes #11560
2024-10-30 09:50:11 +01:00