Merge pull request #33536 from alculquicondor/job-failures

Accurate explanation for the calculation of number of failures in Job
This commit is contained in:
Kubernetes Prow Robot 2022-06-15 21:42:48 -07:00 committed by GitHub
commit b6e815beb4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 14 additions and 4 deletions

View File

@ -253,9 +253,19 @@ due to a logical error in configuration etc.
To do so, set `.spec.backoffLimit` to specify the number of retries before
considering a Job as failed. The back-off limit is set by default to 6. Failed
Pods associated with the Job are recreated by the Job controller with an
exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The
back-off count is reset when a Job's Pod is deleted or successful without any
other Pods for the Job failing around that time.
exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.
The number of retries is calculated in two ways:
- The number of Pods with `.status.phase = "Failed"`.
- When using `restartPolicy = "OnFailure"`, the number of retries in all the
containers of Pods with `.status.phase` equal to `Pending` or `Running`.
If either of the calculations reaches the `.spec.backoffLimit`, the Job is
considered failed.
When the [`JobTrackingWithFinalizers`](#job-tracking-with-finalizers) feature is
disabled, the number of failed Pods is only based on Pods that are still present
in the API.
{{< note >}}
If your job has `restartPolicy = "OnFailure"`, keep in mind that your Pod running the Job