[service/internal] Allow components to transition from PermanentError to Stopping (#10958)

#### Description
In #10058 I mentioned:

> There is a tangentially related issue with PermanentErrors and the
underlying finite state machine that governs transitions between
statuses. Currently, a PermanentError is a final state. That is, once a
component enters this state, no further transitions are allowed. In
light of the work I did on the alternative health check extension, I
believe we should allow a transition from PermanentError to Stopping to
consistently prioritize lifecycle events for components. This transition
also make sense from a practical perspective. A component in a
PermanentError state is one that has been started and is running,
although in a likely degraded state. The collector will call shutdown on
the component (when the collector is shutting down) and we should allow
the status to reflect that.

This PR makes the suggested change and updates the documentation to
reflect that. As this is an internal change, I have not included a
changelog. Also note, we can close #10058 after this as we've already
removed status aggregation from core during the recent component status
refactor.

<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes #10058

<!--Describe what testing was performed and which tests were added.-->
#### Testing
units

<!--Describe the documentation added.-->
#### Documentation
Updated docs/component-status.md and associated diagram.

<!--Please delete paragraphs that you did not use before submitting.-->

Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>
Co-authored-by: Antoine Toulme <atoulme@splunk.com>
This commit is contained in:
Matthew Wear 2024-12-17 05:02:44 -08:00 committed by GitHub
parent 4fc50a85ed
commit 58a5ffc888
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 16 additions and 7 deletions

View File

@ -36,7 +36,7 @@ There is a finite state machine underlying the status reporting API that governs
![State Diagram](img/component-status-state-diagram.png)
The finite state machine ensures that components progress through the lifecycle properly and it manages transitions through runtime states so that components do not need to track their state internally. Only changes in status result in new events being generated; repeat reports of the same status are ignored. PermanentError and FatalError are permanent runtime states. A component in these states cannot make any further state transitions.
The finite state machine ensures that components progress through the lifecycle properly and it manages transitions through runtime states so that components do not need to track their state internally. Only changes in status result in new events being generated; repeat reports of the same status are ignored. PermanentError is a permanent runtime state. A component in a PermanentError state cannot transtion to OK or RecoverableError, but it can transition to Stopping. FatalError is a final state. A component in a FatalError state cannot make any further state transitions.
![Status Event Generation](img/component-status-event-generation.png)
@ -61,6 +61,7 @@ Under most circumstances, a component does not need to report explicit status du
**Runtime**
![Runtime State Diagram](img/component-status-runtime-states.png)
During runtime a component should not have to keep track of its state. A component should report status as operations succeed or fail and the finite state machine will handle the rest. Changes in status will result in new status events being emitted. Repeat reports of the same status will no-op. Similarly, attempts to make an invalid state transition, such as PermanentError to OK, will have no effect.
We intend to define guidelines to help component authors distinguish between recoverable and permanent errors on a per-component type basis and we'll update this document as we make decisions. See [this issue](https://github.com/open-telemetry/opentelemetry-collector/issues/9957) for current thoughts and discussions.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 82 KiB

After

Width:  |  Height:  |  Size: 101 KiB

View File

@ -351,6 +351,8 @@ func TestStatusReportedOnStartupShutdown(t *testing.T) {
instanceIDs[eStErr]: {
componentstatus.NewEvent(componentstatus.StatusStarting),
componentstatus.NewPermanentErrorEvent(assert.AnError),
componentstatus.NewEvent(componentstatus.StatusStopping),
componentstatus.NewEvent(componentstatus.StatusStopped),
},
},
startupErr: assert.AnError,
@ -362,6 +364,8 @@ func TestStatusReportedOnStartupShutdown(t *testing.T) {
instanceIDs[rStErr]: {
componentstatus.NewEvent(componentstatus.StatusStarting),
componentstatus.NewPermanentErrorEvent(assert.AnError),
componentstatus.NewEvent(componentstatus.StatusStopping),
componentstatus.NewEvent(componentstatus.StatusStopped),
},
instanceIDs[eNoErr]: {
componentstatus.NewEvent(componentstatus.StatusStarting),

View File

@ -70,8 +70,10 @@ func newFSM(onTransition onTransitionFunc) *fsm {
componentstatus.StatusFatalError: {},
componentstatus.StatusStopping: {},
},
componentstatus.StatusPermanentError: {},
componentstatus.StatusFatalError: {},
componentstatus.StatusPermanentError: {
componentstatus.StatusStopping: {},
},
componentstatus.StatusFatalError: {},
componentstatus.StatusStopping: {
componentstatus.StatusRecoverableError: {},
componentstatus.StatusPermanentError: {},

View File

@ -75,17 +75,19 @@ func TestStatusFSM(t *testing.T) {
expectedErrorCount: 2,
},
{
name: "PermanentError is terminal",
name: "PermanentError is stoppable",
reportedStatuses: []componentstatus.Status{
componentstatus.StatusStarting,
componentstatus.StatusOK,
componentstatus.StatusPermanentError,
componentstatus.StatusOK,
componentstatus.StatusStopping,
},
expectedStatuses: []componentstatus.Status{
componentstatus.StatusStarting,
componentstatus.StatusOK,
componentstatus.StatusPermanentError,
componentstatus.StatusStopping,
},
expectedErrorCount: 1,
},
@ -154,7 +156,7 @@ func TestValidSeqsToStopped(t *testing.T) {
}
for _, ev := range events {
name := fmt.Sprintf("transition from: %s to: %s invalid", ev.Status(), componentstatus.StatusStopped)
name := fmt.Sprintf("transition from: %s to: %s", ev.Status(), componentstatus.StatusStopped)
t.Run(name, func(t *testing.T) {
fsm := newFSM(func(*componentstatus.Event) {})
if ev.Status() != componentstatus.StatusStarting {
@ -165,9 +167,9 @@ func TestValidSeqsToStopped(t *testing.T) {
err := fsm.transition(componentstatus.NewEvent(componentstatus.StatusStopped))
require.ErrorIs(t, err, errInvalidStateTransition)
// stopping -> stopped is allowed for non-fatal, non-permanent errors
// stopping -> stopped is allowed for non-fatal errors
err = fsm.transition(componentstatus.NewEvent(componentstatus.StatusStopping))
if ev.Status() == componentstatus.StatusPermanentError || ev.Status() == componentstatus.StatusFatalError {
if ev.Status() == componentstatus.StatusFatalError {
require.ErrorIs(t, err, errInvalidStateTransition)
} else {
require.NoError(t, err)