<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Code ownership and maintenance of components continues to be an issue,
with varying levels of support across contrib. As we approach 1.0 and
the ability to mark components as stable, we want to make sure that
components that we deem as 'stable' have a healthy community around
them. We have three datapoints that we can leverage here: how many
codeowners a component has, how diverse these are in terms of employers
and how actively the codeowners have been responding to issues/PRs in
the recent past.
We need criteria that
1. Are reasonable predictors of the component health over the
short/medium term
2. Are not too onerous on the code owners
Some notes:
1. Some beta components do not meet the criteria listed on the PR. This
will be the case even after the transition for some components. This PR
makes no claim as to what should happen to these components stability
(so, de facto, they will stay as is).
2. The OTLP receiver and exporters do not meet this criteria today
because they don't have listed code owners. We can solve this either by
carving out an exception or by listing code owners.
3. We need automation and templates to enforce this.
<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes#11850
---------
Co-authored-by: Christos Markou <chrismarkou92@gmail.com>
A few weeks ago, I mentioned to the Collector leads about my intention
to resign as maintainer/approver. My current focus on building
OllyGarden isn't leaving much room to be an approver or maintainer.
The plan right now is to ramp up again as approver/maintainer in the
future once time allows.
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Added the cspell to check spelling in .md, .yaml files.
<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes#9287
<!--Describe what testing was performed and which tests were added.-->
#### Testing
<!--Describe the documentation added.-->
#### Documentation
<!--Please delete paragraphs that you did not use before submitting.-->
---------
Signed-off-by: Yuri Oliveira <yurimsa@gmail.com>
There was no mention of disabling the merge queue which is needed if we
need to merge a commit (instead of squashing it)
Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
This PR changes the release workflow to autofill the release notes from
`CHANGELOG.md` and `CHANGELOG-API.md` into the generated GH release.
It makes use od `awk` and `sed` to build the release notes step by step
from the changelog files.
The [default chloggen
template](c43cb0331c/chloggen/internal/chlog/summary.tmpl)
was added and a `<!--preview-version-->` tag was added to easily filter
out the changelog of just the latest version.
<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes#10191
<!--Describe what testing was performed and which tests were added.-->
#### Testing
Tested on my fork.
Release with autofilled changelog:
https://github.com/mowies/opentelemetry-collector/releases/tag/v0.121.0
Workflow that did it:
https://github.com/mowies/opentelemetry-collector/actions/runs/13899615357/job/38888008499
<!--Describe the documentation added.-->
#### Documentation
The release checklist was updated accordingly.
<!--Please delete paragraphs that you did not use before submitting.-->
---------
Signed-off-by: Moritz Wiesinger <moritz.wiesinger@dynatrace.com>
#### Description
Once
[contrib#38534](https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/38534)
is merged, the manual changes that were necessary in step 1 of releasing
contrib should now be included in step 2 (the Prepare Release CI
workflow). This PR updates the release doc to remove step 1.
#### Link to tracking issue
Updates #12294
#### Description
A very minor whitespace issue was preventing the list from formatting
correctly on one .md doc page. This fixes that _very minor_ issue.
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Simplifies description of automated release steps. While there is some
value in having the description of the automated steps somewhere, I
think this runs the risk of getting outdated and us having to look at
the code directly, so I would rather just remove it from here and
improve the comments/code of the automation over time. See
open-telemetry/opentelemetry-collector-releases/pull/856 for one
improvement of this kind.
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Updates #12533
#### Description
This PR:
- requires "level: normal" before outputting batch processor metrics (in
addition to one specific metric which was already restricted to "level:
detailed")
- clarifies wording in the telemetry level guidelines and documentation,
and adds said guidelines to the requirements for stable components.
Some rationale for these changes can be found in the tracking issue and
[this
comment](https://github.com/open-telemetry/opentelemetry-collector/issues/7890#issuecomment-2684652956).
#### Link to tracking issue
Resolves#7890
#### To be discussed
Should we add a feature gate for this, in case a user relies on "level:
basic" outputting batch processor metrics? This feels like a niche use
case, so considering the "alpha" stability level of these metrics, I
don't think it's really necessary.
Considering batch processor metrics had already been switched to
"normal" once (#9767), but were turned back to basic at some later point
(not sure when), we might also want to add tests to avoid further
regressions (especially as the handling of telemetry levels is bound to
change further with #11754).
---------
Co-authored-by: Dmitrii Anoshin <anoshindx@gmail.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Add guidelines for naming Go modules
Note the 3-week window between .128 and .129, as we'll likely have OTel
Community Day on the week of June 23. An alternative to that is to have
the release on June 23 and assign to someone who knows already that they
won't be there anyway.
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
<!-- Issue number if applicable -->
Reworks breaking changes section to include information about our
approach to feature gates.
---------
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
fix metadata.yaml github_project, add a go:generate instruction to
confmap and a banner to README
### This PR
- adds the githubgen tool as a dependency in internal/tools
- uses githubgen to generate codeowners and issue template files
- updates lots of metadata files by
- taking the existing codeowners file and feeding the info from there
back into the component metadata.yaml files or creating new
metadata.yaml files where none existed yet
- adds distributions.yaml as a basis the mostly already existing
`distributions:` keys in metadata.yaml files (needed for githubgen to
work correctly)
- adds relevant make commands to make the githubgen tool usage mostly
transparent to users
This change is a prerequisite to be able to ping codeowners reliably
with automated tooling as a next step.
Part of #11562
---------
Signed-off-by: Moritz Wiesinger <moritz.wiesinger@dynatrace.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
<!-- Issue number if applicable -->
Use h2 (hN-1) titles for h2 (hN-1) sections instead of h3 (hN)
### Context
The [Pipeline Component Telemetry
RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md)
was recently accepted (#11406). The document states the following
regarding error monitoring:
> For both [consumed and produced] metrics, an `outcome` attribute with
possible values `success` and `failure` should be automatically
recorded, corresponding to whether or not the corresponding function
call returned an error. Specifically, consumed measurements will be
recorded with `outcome` as `failure` when a call from the previous
component the `ConsumeX` function returns an error, and `success`
otherwise. Likewise, produced measurements will be recorded with
`outcome` as `failure` when a call to the next consumer's `ConsumeX`
function returns an error, and `success` otherwise.
[Observability requirements for stable pipeline
components](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#observability-requirements)
were also recently merged (#11772). The document states the following
regarding error monitoring:
> The goal is to be able to easily pinpoint the source of data loss in
the Collector pipeline, so this should either:
> - only include errors internal to the component, or;
> - allow distinguishing said errors from ones originating in an
external service, or propagated from downstream Collector components.
Because errors are typically propagated across `ConsumeX` calls in a
pipeline (except for components with an internal queue like
`processor/batch`), the error observability mechanism proposed by the
RFC implies that Pipeline Telemetry will record failures for every
component interface upstream of the component that actually emitted the
error, which does not match the goals set out in the observability
requirements, and makes it much harder to tell which component errors
are coming from from the emitted telemetry.
### Description
This PR amends the Pipeline Component Telemetry RFC with the following:
- restrict the `outcome=failure` value to cases where the error comes
from the very next component (the component on which `ConsumeX` was
called);
- add a third possible value for the `outcome` attribute: `rejected`,
for cases where an error observed at an interface comes from further
downstream (the component did not "fail", but its output was
"rejected");
- propose a mechanism to determine which of the two values should be
used.
The current proposal for the mechanism is for the pipeline
instrumentation layer to wrap errors in an unexported `downstream`
struct, which upstream layers could check for with `errors.As` to check
whether the error has already been "attributed" to a component. This is
the same mechanism currently used for tracking permanent vs. retryable
errors. Please check the diff for details.
### Possible alternatives
There are a few alternatives to this amendment, which were discussed as
part of the observability requirements PR:
- loosen the observability requirements for stable components to not
require distinguishing internal errors from downstream ones → makes it
harder to identify the source of an error;
- modify the way we use the `Consumer` API to no longer propagate errors
upstream → prevents proper propagation of backpressure through the
pipeline (although this is likely already a problem with the `batch`
prcessor);
- let component authors make their own custom telemetry to solve the
problem → higher barrier to entry, especially for people wanting to
opensource existing components.
---------
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
<!--Describe the documentation added.-->
#### Documentation
This is a documentation-only change to fix some typos in the Pipeline
Component Telemetry RFC doc.
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
<!-- Issue number if applicable -->
Proposal for creating a new `collector-release-approvers` group.
Announced at:
- #otel-collector-dev on 2024-10-30:
https://cloud-native.slack.com/archives/C07CCCMRXBK/p1730307025302339
- Collector SIG on 2024-11-05 (TBD)
The stakeholders for this PR are:
- @open-telemetry/collector-approvers
- @open-telemetry/collector-contrib-approvers
---------
Co-authored-by: Andrzej Stencel <andrzej.stencel@elastic.co>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
<!-- Issue number if applicable -->
Adds requirements for documentation for different stability levels.
I expect many of these will be done through automation over time :)
#### Link to tracking issue
Fixes#11852
#### Description
This PR slightly changes the wording of the "Stability levels and
versioning" doc (`docs/component-stability.md`), which I found a bit
confusing, in order to:
- Emphasize the important fact that stability levels for a component are
defined _per signal_. At the moment this is only alluded to at the
beginning and assumed in the last section. Moreover, things like the
"Unmaintained" level may give the impression that stability levels
always apply to an entire component.
- More cleanly separate the part about behavior changes from the part
about API changes in the "Versioning" section.
This should not change the content or interpretation of the document.
This was done in the [specification
repo](6c626defb7),
and allows us to use a github action instead of install npm packages as
part of the build process (which kept bringing security warnings back)
Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
updates documentation to include changes in
https://github.com/open-telemetry/opentelemetry-collector-releases/pull/684
<!--Describe what testing was performed and which tests were added.-->
#### Testing
run locally and via workflows in jackgopack4 fork
<!--Describe the documentation added.-->
#### Documentation
updates to release.md in docs folder
<!--Please delete paragraphs that you did not use before submitting.-->
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
- Documents when to split into separate modules, including general rules
as well as specific conventions we are currently using
- Rephrases the wording on #11836 to add it into a general list.
- Documents how to split into separate modules.
<!-- Issue number if applicable -->
#### Link to tracking issue
Follows #11836, Fixes#11436, Fixes#11623
---------
Co-authored-by: Jade Guiton <jade.guiton@datadoghq.com>
#### Description
If more PRs are merged after the release PR commit, make sure to
checkout the release branch to the release PR commit rather than the
mainline head.
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Split off from #11864, describes how the graduation would work without
any additional criteria.
Rendered diagram:
```mermaid
stateDiagram-v2
state Maintained {
InDevelopment --> Alpha
Alpha --> Beta
Beta --> Stable
}
InDevelopment: In Development
Maintained --> Unmaintained
Unmaintained --> Maintained
Maintained --> Deprecated
Deprecated --> Maintained: (should be rare)
```
---------
Co-authored-by: Christos Markou <chrismarkou92@gmail.com>
#### Description
In #10058 I mentioned:
> There is a tangentially related issue with PermanentErrors and the
underlying finite state machine that governs transitions between
statuses. Currently, a PermanentError is a final state. That is, once a
component enters this state, no further transitions are allowed. In
light of the work I did on the alternative health check extension, I
believe we should allow a transition from PermanentError to Stopping to
consistently prioritize lifecycle events for components. This transition
also make sense from a practical perspective. A component in a
PermanentError state is one that has been started and is running,
although in a likely degraded state. The collector will call shutdown on
the component (when the collector is shutting down) and we should allow
the status to reflect that.
This PR makes the suggested change and updates the documentation to
reflect that. As this is an internal change, I have not included a
changelog. Also note, we can close#10058 after this as we've already
removed status aggregation from core during the recent component status
refactor.
<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes#10058
<!--Describe what testing was performed and which tests were added.-->
#### Testing
units
<!--Describe the documentation added.-->
#### Documentation
Updated docs/component-status.md and associated diagram.
<!--Please delete paragraphs that you did not use before submitting.-->
Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>
Co-authored-by: Antoine Toulme <atoulme@splunk.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Documents relationship between component stability and versioning.
<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes#11851
## Description
This PR defines observability requirements for components at the
"Stable" stability levels. The goal is to ensure that Collector
pipelines are properly observable, to help in debugging configuration
issues.
#### Approach
- The requirements are deliberately not too specific, in order to be
adaptable to each specific component, and so as to not over-burden
component authors.
- After discussing it with @mx-psi, this list of requirements explicitly
includes things that may end up being emitted automatically as part of
the Pipeline Instrumentation RFC (#11406), with only a note at the
beginning explaining that not everything may need to be implemented
manually.
Feel free to share if you don't think this is the right approach for
these requirements.
#### Link to tracking issue
Resolves#11581
## Important note regarding the Pipeline Instrumentation RFC
I included this paragraph in the part about error count metrics:
> The goal is to be able to easily pinpoint the source of data loss in
the Collector pipeline, so this should either:
> - only include errors internal to the component, or;
> - allow distinguishing said errors from ones originating in an
external service, or propagated from downstream Collector components.
The [Pipeline Instrumentation
RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md)
(hereafter abbreviated "PI"), once implemented, should allow monitoring
component errors via the `outcome` attribute, which is either `success`
or `failure`, depending on whether the `Consumer` API call returned an
error.
Note that this does not work for receivers, or allow differentiating
between different types of errors; for that reason, I believe additional
component-specific error metrics will often still be required, but it
would be nice to cover as many cases as possible automatically.
However, at the moment, errors are (usually) propagated upstream through
the chain of `Consume` calls, so in case of error the `failure` state
will end up applied to all components upstream of the actual source of
the error. This means the PI metrics do not fit the first bullet point.
Moreover, I would argue that even post-processing the PI metrics does
not reliably allow distinguishing the ultimate source of errors (the
second bullet point). One simple idea is to compute
`consumed.items{outcome:failure} - produced.items{outcome:failure}` to
get the number of errors originating in a component. But this only works
if output items map one-to-one to input items: if a processor or
connector outputs fewer items than it consumes (because it aggregates
them, or translates to a different signal type), this formula will
return false positives. If these false positives are mixed with real
errors from the component and/or from downstream, the situation becomes
impossible to analyze by just looking at the metrics.
For these reasons, I believe we should do one of four things:
1. Change the way we use the `Consumer` API to no longer propagate
errors, making the PI metric outcomes more precise.
We could catch errors in whatever wrapper we already use to emit the PI
metrics, log them for posterity, and simply not propagate them.
Note that some components already more or less do this, such as the
`batchprocessor`, but this option may in principle break components
which rely on downstream errors (for retry purposes for example).
3. Keep propagating errors, but modify or extend the RFC to require
distinguishing between internal and propagated errors (maybe add a third
`outcome` value, or add another attribute).
This could be implemented by somehow propagating additional state from
one `Consume` call to another, allowing us to establish the first
appearance of a given error value in the pipeline.
5. Loosen this requirement so that the PI metrics suffice in their
current state.
6. Leave everything as-is and make component authors implement their own
somewhat redundant error count metrics.
---------
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Provide caption of a link in README
<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes #
<!--Describe what testing was performed and which tests were added.-->
#### Testing
<!--Describe the documentation added.-->
#### Documentation
<!--Please delete paragraphs that you did not use before submitting.-->
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
<!-- Issue number if applicable -->
Adds post-release steps including release retro and schedule updating.
#### Link to tracking issue
Fixes#11858
---------
Co-authored-by: Yang Song <songy23@users.noreply.github.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Fixes#11859