10 KiB
Error Conditions and Reporting
Elafros uses the standard Kubernetes API pattern for reporting
configuration errors and current state of the system by writing the
report in the status
section. There are two mechanisms commonly used
in status:
-
conditions represent true/false statements about the current state of the resource.
-
other fields may provide status on the most recently retrieved state of the system as it relates to the resource (example: number of replicas or traffic assignments).
Both of these mechanisms often include additional data from the
controller such as observedGeneration
(to determine whether the
controller has seen the latest updates to the spec). Example user and
system error scenarios are included below along with how the status is
presented to CLI and UI tools via the API.
- Revision failed to become Ready
- Build failed
- Revision not found by Route
- Configuration not found by Route
- Latest Revision of a Configuration deleted
- Resource exhausted while creating a revision
- Deployment progressing slowly/stuck
- Traffic shift progressing slowly/stuck
- Container image not present in repository
- Container image fails at startup on Revision
Revision failed to become Ready
If the latest Revision fails to become Ready
for any reason within some reasonable
timeframe, the Configuration should signal this
with the LatestRevisionReady
status, copying the reason and the message
from the Ready
condition on the Revision.
...
status:
latestReadyRevisionName: abc
latestCreatedRevisionName: bcd # Hasn't become "Ready"
conditions:
- type: LatestRevisionReady
status: False
reason: ContainerMissing
message: "Unable to start because container is missing and build failed."
Build failed
If the Build steps failed while creating a Revision, you can examine
the Failed
condition on the Build or the BuildFailed
condition on
the Revision (which copies the value from the build referenced by
spec.buildName
). In addition, the Build resource (but not the
Revision) should have a status field to link to the log output of the
build.
GET /apis/build.dev/v1alpha1/namespaces/default/builds/build-1acub3
...
status:
# Link to log stream; could be ELK or Stackdriver, for example
buildLogsLink: "http://logging.infra.mycompany.com/...?filter=..."
conditions:
- type: Failed
status: True
reason: BuildStepFailed # could also be SourceMissing, etc
# reason is a short status, message provides error details
message: "Step XYZ failed with error message: $LASTLOGLINE"
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
conditions:
- type: Ready
status: False
reason: ContainerMissing
message: "Unable to start because container is missing and build failed."
- type: BuildFailed
status: True
reason: BuildStepFailed
# reason is a short status, message provides error details
message: "Step XYZ failed with error message: $LASTLOGLINE"
Revision not found by Route
If a Revision is referenced in the Route's spec.rollout.traffic
, the
corresponding entry in the status.traffic
list will be set to "Not
found", and the TrafficDropped
condition will be marked as True,
with a reason of RevisionMissing
.
GET /apis/elafros.dev/v1alpha1/namespaces/default/routes/abc
...
status:
traffic:
- revisionName: abc
name: current
percent: 100
- revisionName: "Not found"
name: next
percent: 0
conditions:
- type: RolloutInProgress
status: False
- type: TrafficDropped
status: True
reason: RevisionMissing
# reason is a short status, message provides error details
message: "Revision 'qyzz' referenced in rollout.traffic not found"
Configuration not found by Route
If a Route references the latestReadyRevisionName
of a Configuration
and the Configuration cannot be found, the corresponding entry in
status.traffic
list will be set to "Not found", and the
TrafficDropped
condition will be marked as True with a reason of
ConfigurationMissing
.
GET /apis/elafros.dev/v1alpha1/namespaces/default/routes/abc
...
status:
traffic:
- revisionName: "Not found"
percent: 100
conditions:
- type: RolloutInProgress
status: False
- type: TrafficDropped
status: True
reason: ConfigurationMissing
# reason is a short status, message provides error details
message: "Revision 'my-service' referenced in rollout.traffic not found"
Latest Revision of a Configuration deleted
If the most recent (or most recently ready) Revision is deleted, the
Configuration will clear the latestReadyRevisionName
. If the
Configuration is referenced by a Route, the Route will set the
TrafficDropped
condition with reason RevisionMissing
, as above.
GET /apis/elafros.dev/v1alpha1/namespaces/default/configurations/my-service
...
metadata:
generation: 1234 # only updated when spec changes
...
spec:
...
status:
latestCreatedRevision: abc
conditions:
- type: LatestRevisionReady
status: False
reason: RevisionMissing
message: "The latest Revision appears to have been deleted."
observedGeneration: 1234
Resource exhausted while creating a revision
Since a Revision is only metadata, the Revision will be created, but will have a condition indicating the underlying failure, possibly indicating the failed underlying resource. In a multitenant environment, the customer might not have have access or visibility into the underlying resources in the hosting environment.
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
conditions:
- type: Ready
status: False
reason: NoDeployment
message: "The controller could not create a deployment named ela-abc-e13ac."
Deployment progressing slowly/stuck
See
the kubernetes documentation for how this is handled for Deployments. For
Revisions, we will start by assuming a single timeout for deployment
(rather than configurable), and report that the Revision was not
Ready, with a reason ProgressDeadlineExceeded
. Note that we will
only report ProgressDeadlineExceeded
if we could not determine
another reason (such as quota failures, missing build, or container
execution failures).
Kubernetes controllers will continue attempting to make progress
(possibly at a less-aggressive rate) when they encounter a case where
the desired status cannot match the actual status, so if the
underlying deployment is slow, it might eventually finish after
reporting ProgressDeadlineExceeded
.
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
conditions:
- type: Ready
status: False
reason: ProgressDeadlineExceeded
message: "Unable to create pods for more than 120 seconds."
Traffic shift progressing slowly/stuck
Similar to deployment slowness, if the transfer of traffic (either via
gradual or abrupt rollout) takes longer than a certain timeout to
complete/update, the RolloutInProgress
condition will remain at
True, but the reason will be set to ProgressDeadlineExceeded
.
GET /apis/elafros.dev/v1alpha1/namespaces/default/routes/abc
...
status:
traffic:
- revisionName: abc
percent: 75
- revisionName: def
percent: 25
conditions:
- type: RolloutInProgress
status: True
reason: ProgressDeadlineExceeded
# reason is a short status, message provides error details
message: "Unable to update traffic split for more than 120 seconds."
Container image not present in repository
Revisions might be created while a Build is still creating the container image or uploading it to the repository. If the build is being performed by a CRD in the cluster, the spec.buildName attribute will be set (and see the Build failed example). In other cases when the build is not supplied, the container image referenced might not be present in the registry (either because of a typo or because it was deleted). In this case, the Ready condition will be set to False with a reason of ContainerMissing. This condition could be corrected if the image becomes available at a later time. We can also make a defensive copy of the container image to avoid this error due to deleted source container.
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
conditions:
- type: Ready
status: False
reason: ContainerMissing
message: "Unable to fetch image 'gcr.io/...': <literal error>"
- type: Failed
status: True
reason: ContainerMissing
message: "Unable to fetch image 'gcr.io/...': <literal error>"
Container image fails at startup on Revision
Particularly for development cases with interpreted languages like
Node or Python, syntax errors or the like might only be caught at
container startup time. For this reason, implementations may choose to
start a single copy of the container on deployment, before making the
container Ready. If the initial container fails to start, the Ready
condition will be set to False and the reason will be set to
ExitCode:%d
with the exit code of the application, and the last line
of output in the message. Additionally, the Revision will include a
logsUrl
which provides the address of an endpoint which can be used to
fetch the logs for the failed process.
GET /apis/elafros.dev/v1alpha1/namespaces/default/revisions/abc
...
status:
logUrl: "http://logging.infra.mycompany.com/...?filter=revision=abc&..."
conditions:
- type: Ready
status: False
reason: ExitCode:127
message: "Container failed with: SyntaxError: Unexpected identifier"