[service/internal] Allow components to transition from PermanentError to Stopping (#10958)
#### Description In #10058 I mentioned: > There is a tangentially related issue with PermanentErrors and the underlying finite state machine that governs transitions between statuses. Currently, a PermanentError is a final state. That is, once a component enters this state, no further transitions are allowed. In light of the work I did on the alternative health check extension, I believe we should allow a transition from PermanentError to Stopping to consistently prioritize lifecycle events for components. This transition also make sense from a practical perspective. A component in a PermanentError state is one that has been started and is running, although in a likely degraded state. The collector will call shutdown on the component (when the collector is shutting down) and we should allow the status to reflect that. This PR makes the suggested change and updates the documentation to reflect that. As this is an internal change, I have not included a changelog. Also note, we can close #10058 after this as we've already removed status aggregation from core during the recent component status refactor. <!-- Issue number if applicable --> #### Link to tracking issue Fixes #10058 <!--Describe what testing was performed and which tests were added.--> #### Testing units <!--Describe the documentation added.--> #### Documentation Updated docs/component-status.md and associated diagram. <!--Please delete paragraphs that you did not use before submitting.--> Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com> Co-authored-by: Antoine Toulme <atoulme@splunk.com>
This commit is contained in:
parent
4fc50a85ed
commit
58a5ffc888
|
|
@ -36,7 +36,7 @@ There is a finite state machine underlying the status reporting API that governs
|
|||
|
||||

|
||||
|
||||
The finite state machine ensures that components progress through the lifecycle properly and it manages transitions through runtime states so that components do not need to track their state internally. Only changes in status result in new events being generated; repeat reports of the same status are ignored. PermanentError and FatalError are permanent runtime states. A component in these states cannot make any further state transitions.
|
||||
The finite state machine ensures that components progress through the lifecycle properly and it manages transitions through runtime states so that components do not need to track their state internally. Only changes in status result in new events being generated; repeat reports of the same status are ignored. PermanentError is a permanent runtime state. A component in a PermanentError state cannot transtion to OK or RecoverableError, but it can transition to Stopping. FatalError is a final state. A component in a FatalError state cannot make any further state transitions.
|
||||
|
||||

|
||||
|
||||
|
|
@ -61,6 +61,7 @@ Under most circumstances, a component does not need to report explicit status du
|
|||
**Runtime**
|
||||
|
||||

|
||||
|
||||
During runtime a component should not have to keep track of its state. A component should report status as operations succeed or fail and the finite state machine will handle the rest. Changes in status will result in new status events being emitted. Repeat reports of the same status will no-op. Similarly, attempts to make an invalid state transition, such as PermanentError to OK, will have no effect.
|
||||
|
||||
We intend to define guidelines to help component authors distinguish between recoverable and permanent errors on a per-component type basis and we'll update this document as we make decisions. See [this issue](https://github.com/open-telemetry/opentelemetry-collector/issues/9957) for current thoughts and discussions.
|
||||
|
|
|
|||
Binary file not shown.
|
Before Width: | Height: | Size: 82 KiB After Width: | Height: | Size: 101 KiB |
|
|
@ -351,6 +351,8 @@ func TestStatusReportedOnStartupShutdown(t *testing.T) {
|
|||
instanceIDs[eStErr]: {
|
||||
componentstatus.NewEvent(componentstatus.StatusStarting),
|
||||
componentstatus.NewPermanentErrorEvent(assert.AnError),
|
||||
componentstatus.NewEvent(componentstatus.StatusStopping),
|
||||
componentstatus.NewEvent(componentstatus.StatusStopped),
|
||||
},
|
||||
},
|
||||
startupErr: assert.AnError,
|
||||
|
|
@ -362,6 +364,8 @@ func TestStatusReportedOnStartupShutdown(t *testing.T) {
|
|||
instanceIDs[rStErr]: {
|
||||
componentstatus.NewEvent(componentstatus.StatusStarting),
|
||||
componentstatus.NewPermanentErrorEvent(assert.AnError),
|
||||
componentstatus.NewEvent(componentstatus.StatusStopping),
|
||||
componentstatus.NewEvent(componentstatus.StatusStopped),
|
||||
},
|
||||
instanceIDs[eNoErr]: {
|
||||
componentstatus.NewEvent(componentstatus.StatusStarting),
|
||||
|
|
|
|||
|
|
@ -70,8 +70,10 @@ func newFSM(onTransition onTransitionFunc) *fsm {
|
|||
componentstatus.StatusFatalError: {},
|
||||
componentstatus.StatusStopping: {},
|
||||
},
|
||||
componentstatus.StatusPermanentError: {},
|
||||
componentstatus.StatusFatalError: {},
|
||||
componentstatus.StatusPermanentError: {
|
||||
componentstatus.StatusStopping: {},
|
||||
},
|
||||
componentstatus.StatusFatalError: {},
|
||||
componentstatus.StatusStopping: {
|
||||
componentstatus.StatusRecoverableError: {},
|
||||
componentstatus.StatusPermanentError: {},
|
||||
|
|
|
|||
|
|
@ -75,17 +75,19 @@ func TestStatusFSM(t *testing.T) {
|
|||
expectedErrorCount: 2,
|
||||
},
|
||||
{
|
||||
name: "PermanentError is terminal",
|
||||
name: "PermanentError is stoppable",
|
||||
reportedStatuses: []componentstatus.Status{
|
||||
componentstatus.StatusStarting,
|
||||
componentstatus.StatusOK,
|
||||
componentstatus.StatusPermanentError,
|
||||
componentstatus.StatusOK,
|
||||
componentstatus.StatusStopping,
|
||||
},
|
||||
expectedStatuses: []componentstatus.Status{
|
||||
componentstatus.StatusStarting,
|
||||
componentstatus.StatusOK,
|
||||
componentstatus.StatusPermanentError,
|
||||
componentstatus.StatusStopping,
|
||||
},
|
||||
expectedErrorCount: 1,
|
||||
},
|
||||
|
|
@ -154,7 +156,7 @@ func TestValidSeqsToStopped(t *testing.T) {
|
|||
}
|
||||
|
||||
for _, ev := range events {
|
||||
name := fmt.Sprintf("transition from: %s to: %s invalid", ev.Status(), componentstatus.StatusStopped)
|
||||
name := fmt.Sprintf("transition from: %s to: %s", ev.Status(), componentstatus.StatusStopped)
|
||||
t.Run(name, func(t *testing.T) {
|
||||
fsm := newFSM(func(*componentstatus.Event) {})
|
||||
if ev.Status() != componentstatus.StatusStarting {
|
||||
|
|
@ -165,9 +167,9 @@ func TestValidSeqsToStopped(t *testing.T) {
|
|||
err := fsm.transition(componentstatus.NewEvent(componentstatus.StatusStopped))
|
||||
require.ErrorIs(t, err, errInvalidStateTransition)
|
||||
|
||||
// stopping -> stopped is allowed for non-fatal, non-permanent errors
|
||||
// stopping -> stopped is allowed for non-fatal errors
|
||||
err = fsm.transition(componentstatus.NewEvent(componentstatus.StatusStopping))
|
||||
if ev.Status() == componentstatus.StatusPermanentError || ev.Status() == componentstatus.StatusFatalError {
|
||||
if ev.Status() == componentstatus.StatusFatalError {
|
||||
require.ErrorIs(t, err, errInvalidStateTransition)
|
||||
} else {
|
||||
require.NoError(t, err)
|
||||
|
|
|
|||
Loading…
Reference in New Issue