16 KiB
Error Conditions and Reporting
Elafros uses the standard Kubernetes API
pattern
for reporting configuration errors and current state of the system by writing the
report in the status
section. There are two mechanisms commonly used
in status
:
-
Conditions represent true/false statements about the current state of the resource.
-
Other fields may provide status on the most recently retrieved state of the system as it relates to the resource (example: number of replicas or traffic assignments).
Both of these mechanisms often include additional data from the
controller such as observedGeneration
(to determine whether the
controller has seen the latest updates to the spec).
Conditions
Conditions provide an easy mechanism for client user interfaces to
indicate the current state of resources to a user. Elafros resources
should follow the k8s API conventions for
condition
and the patterns described in this section.
Elafros condition type
Each resource should define a small number of success conditions as
type
s. This should bias towards fewer than 5 high-level progress
categories which are separate and meaningful for customers. For a
Revision, these might be BuildSucceeded
, ResourcesAvailable
and
ContainerHealthy
.
Where it makes sense, resources should define a top-level "happy
state" condition type
which indicates that the resource is set up
correctly and ready to serve.
- For long-running resources, this condition
type
should beReady
. - For objects which run to completion, the condition
type
should beSucceeded
.
Elafros condition status
Each condition's status
should be one of:
Unknown
when the controller is actively working to achieve the condition.False
when the reconciliation has failed. This should be a terminal failure state until user action occurs.True
when the reconciliation has succeeded. Once all transition conditions have succeeded, the "happy state" condition should be set toTrue
.
Type names should be chosen such that these interpretations are clear:
BuildSucceeded
works becauseTrue
= success andFalse
= failure.BuildCompleted
does not, becauseFalse
could mean "in-progress".
Conditions may also be omitted entirely if reconciliation has been skipped. When all conditions have succeeded, the "happy state" should clear other conditions for output legibility. Until the "happy state" is set, conditions should be persisted for the benefit of UI tools representing progress on the outcome.
Conditions with a status of False
will also supply additional details
about the failure in the "Reason" and "Message" sections.
Elafros condition reason
and message
The fields reason
and message
should be considered to have unlimited
cardinality, unlike type
and status
.
If a resource has a "happy state" type
, it will surface the
reason
and message
from the first failing sub Condition.
The values reason
takes on (while camelcase words) should be treated as opaque.
Clients shouldn't programmatically act on their values, but bias towards using
reason
as a terse explanation of the state for end-users, whereas message
is the long-form of this.
Example scenarios
Example user and system error scenarios are included below along with how the status is presented to CLI and UI tools via the API.
Deployment-Related Failures
The following scenarios will generally occur when attempting to deploy changes to the software stack by updating the Service or Configuration resources to cause a new Revision to be created.
Revision failed to become Ready
If the latest Revision fails to become Ready
for any reason within
some reasonable timeframe, the Configuration and Service should signal
this with the LatestRevisionReady
status, copying the reason and the
message from the Ready
condition on the Revision.
GET /api/elafros.dev/v1alpha1/namespaces/default/configurations/my-service
...
status:
latestReadyRevisionName: abc
latestCreatedRevisionName: bcd # Hasn't become "Ready"
conditions:
- type: LatestRevisionReady
status: False
reason: BuildFailed
meassage: "Build Step XYZ failed with error message: $LASTLOGLINE"
GET /api/elafros.dev/v1alpha1/namespaces/default/services/my-service
...
status:
latestReadyRevisionName: abc
latestCreatedRevisionName: bcd # Hasn't become "Ready"
conditions:
- type: Ready
status: True # If an earlier version is serving
- type: LatestRevisionReady
status: False
reason: BuildFailed
meassage: "Build Step XYZ failed with error message: $LASTLOGLINE"
Build failed
If the Build steps failed while creating a Revision, you can examine
the Failed
condition on the Build or the BuildSucceeded
condition
on the Revision (which copies the value from the build referenced by
spec.buildName
). In addition, the Build resource (but not the
Revision) should have a status field to link to the log output of the
build.
GET /apis/build.dev/v1alpha1/namespaces/default/builds/build-1acub3
...
status:
# Link to log stream; could be ELK or Stackdriver, for example
buildLogsLink: "http://logging.infra.mycompany.com/...?filter=..."
conditions:
- type: Failed
status: True
reason: BuildStepFailed # could also be SourceMissing, etc
message: "Step XYZ failed with error message: $LASTLOGLINE"
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
conditions:
- type: Ready
status: False
reason: BuildFailed
message: "Build Step XYZ failed with error message: $LASTLOGLINE"
- type: BuildSucceeded
status: False
reason: BuildStepFailed
message: "Step XYZ failed with error message: $LASTLOGLINE"
Resource exhausted while creating a revision
Since a Revision is only metadata, the Revision will be created, but will have a condition indicating the underlying failure, possibly indicating the failed underlying resource. In a multitenant environment, the customer might not have have access or visibility into the underlying resources in the hosting environment.
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
conditions:
- type: Ready
status: False
reason: NoDeployment
message: "The controller could not create a deployment named ela-abc-e13ac."
- type: ResourcesProvisioned
status: False
reason: NoDeployment
message: "The controller could not create a deployment named ela-abc-e13ac."
Container image not present in repository
Revisions might be created while a Build is still creating the
container image or uploading it to the repository. If the build is
being performed by a CRD in the cluster, the spec.buildName
attribute will be set (and see the Build failed
example). In other cases when the build is not supplied, the container
image referenced might not be present in the registry (either because
of a typo or because it was deleted). In this case, the Ready
condition will be set to False
with a reason of
ContainerMissing
. This condition could be corrected if the image
becomes available at a later time. Elafros could also make a defensive
copy of the container image to avoid having to surface this error if
the original docker image is deleted.
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
conditions:
- type: Ready
status: False
reason: ContainerMissing
message: "Unable to fetch image 'gcr.io/...': <literal error>"
- type: ContainerHealthy
status: False
reason: ContainerMissing
message: "Unable to fetch image 'gcr.io/...': <literal error>"
Container image fails at startup on Revision
Particularly for development cases with interpreted languages like
Node or Python, syntax errors might only be caught at container
startup time. For this reason, implementations should start a copy of
the container on deployment, before marking the container Ready
. If
this container fails to start, the Ready
condition will be set to
False
, the reason will be set to ExitCode%d
with the exit code of
the application, and the termination message from the container will
be provided. (Containers will be run with the default
terminationMessagePath
and a terminationMessagePolicy
of
FallbackToLogsOnError
.) Additionally, the Revision status.logsUrl
should be present, which provides the address of an endpoint which can
be used to fetch the logs for the failed process.
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
logUrl: "http://logging.infra.mycompany.com/...?filter=revision_uid=a1e34&..."
conditions:
- type: Ready
status: False
reason: ExitCode127
message: "Container failed with: SyntaxError: Unexpected identifier"
- type: ContainerHealthy
status: False
reason: ExitCode127
message: "Container failed with: SyntaxError: Unexpected identifier"
Deployment progressing slowly/stuck
See the kubernetes documentation for how this is handled for
Deployments. For
Revisions, we will start by assuming a single timeout for deployment
(rather than configurable), and report that the Revision was not Ready,
with a reason ProgressDeadlineExceeded
. Note that we will only report
ProgressDeadlineExceeded
if we could not determine another reason (such
as quota failures, missing build, or container execution failures).
Since container setup time also affects the ability of 0 to 1
autoscaling, the Ready
failure with ProgressDeadlineExceeded
reason should be considered a terminal condition, even if Kubernetes
might attempt to make progress even after the deadline.
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
conditions:
- type: Ready
status: False
reason: ProgressDeadlineExceeded
message: "Did not pass readiness checks in 120 seconds."
Routing-Related Failures
The following scenarios are most likely to occur when attempting to roll out a change by shifting traffic to a new Revision. Some of these conditions can also occur under normal operations due to (for example) operator error causing live resources to be deleted.
Traffic not assigned
If some percentage of traffic cannot be assigned to a live
(materialized or scaled-to-zero) Revision, the Route will report the
Ready
condition as False
. The Service will mirror this status in
its' Ready
condition. For example, for a newly-created Service where
the first Revision is unable to serve:
GET /apis/elafros.dev/v1alpha1/namespaces/default/routes/my-service
...
status:
domain: my-service.default.mydomain.com
traffic:
- revisionName: "Not found"
percent: 100
conditions:
- type: Ready
status: False
reason: RevisionMissing
message: "The configuration 'abc' does not have a LatestReadyRevision."
GET /apis/elafros.dev/v1alpha1/namespaces/default/services/my-service
...
status:
latestCreatedRevisionname: abc
# no latestReadyRevisionName, because abc failed
domain: my-service.default.mydomain.com
conditions:
- type: Ready
status: False
reason: RevisionMissing
message: "The configuration 'abc' does not have a LatestReadyRevision."
- type: LatestRevisionReady
status: False
reason: ExitCode127
message: "Container failed with: SyntaxError: Unexpected identifier"
Revision not found by Route
If a Revision is referenced in a Route's spec.traffic
, and the Revision
cannot be found, the AllTrafficAssigned
condition will be marked as False
with a reason of RevisionMissing
, and the Revision will be omitted from the
Route's status.traffic
.
GET /apis/elafros.dev/v1alpha1/namespaces/default/routes/my-service
...
status:
traffic:
- revisionName: abc
name: current
percent: 100
conditions:
- type: Ready
status: False
reason: RevisionMissing
message: "Revision 'qyzz' referenced in traffic not found"
- type: AllTrafficAssigned
status: False
reason: RevisionMissing
message: "Revision 'qyzz' referenced in traffic not found"
Configuration not found by Route
If a Route references the latestReadyRevisionName
of a Configuration
and the Configuration cannot be found, the AllTrafficAssigned
condition
will be marked as False with a reason of ConfigurationMissing
, and the
Revision will be omitted from the Route's status.traffic
.
GET /apis/elafros.dev/v1alpha1/namespaces/default/routes/my-service
...
status:
traffic: []
conditions:
- type: Ready
status: False
reason: ConfigurationMissing
message: "Configuration 'abc' referenced in traffic not found"
- type: AllTrafficAssigned
status: False
reason: ConfigurationMissing
message: "Configuration 'abc' referenced in traffic not found"
Latest Revision of a Configuration deleted
If the most recent Revision is deleted, the Configuration will set
LatestRevisionReady
to False.
If the deleted Revision was also the most recent to become ready, the
Configuration will also clear the latestReadyRevisionName
. Additionally,
if the Configuration in this case is referenced by a Route, the Route will
set the AllTrafficAssigned
condition to False with reason
RevisionMissing
, as above.
GET /apis/elafros.dev/v1alpha1/namespaces/default/configurations/my-service
...
metadata:
generation: 1234 # only updated when spec changes
...
spec:
...
status:
latestCreatedRevision: abc
conditions:
- type: LatestRevisionReady
status: False
reason: RevisionMissing
message: "The latest Revision appears to have been deleted."
observedGeneration: 1234
Traffic shift progressing slowly/stuck
Similar to deployment slowness, if the transfer of traffic (either via
gradual or abrupt rollout) takes longer than a certain timeout to
complete/update, the RolloutInProgress
condition will remain at
True, but the reason will be set to ProgressDeadlineExceeded
.
GET /apis/elafros.dev/v1alpha1/namespaces/default/routes/my-service
...
status:
traffic:
- revisionName: abc
percent: 75
- revisionName: def
percent: 25
conditions:
- type: Ready
status: False
reason: ProgressDeadlineExceeded
# reason is a short status, message provides error details
message: "Unable to update traffic split for more than 120 seconds."