Merge pull request #33536 from alculquicondor/job-failures

Accurate explanation for the calculation of number of failures in Job
2022-06-15 21:42:48 -07:00 · 2022-06-15 21:42:48 -07:00 · b6e815beb4
parent a7c6e4d61d 63272f9368
commit b6e815beb4
1 changed files with 14 additions and 4 deletions
--- a/content/en/docs/concepts/workloads/controllers/job.md
+++ b/content/en/docs/concepts/workloads/controllers/job.md
@ -253,9 +253,19 @@ due to a logical error in configuration etc.
 To do so, set `.spec.backoffLimit` to specify the number of retries before
 considering a Job as failed. The back-off limit is set by default to 6. Failed
 Pods associated with the Job are recreated by the Job controller with an
-exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The
-back-off count is reset when a Job's Pod is deleted or successful without any
-other Pods for the Job failing around that time.
+exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.
+
+The number of retries is calculated in two ways:
+- The number of Pods with `.status.phase = "Failed"`.
+- When using `restartPolicy = "OnFailure"`, the number of retries in all the
+  containers of Pods with `.status.phase` equal to `Pending` or `Running`.
+
+If either of the calculations reaches the `.spec.backoffLimit`, the Job is
+considered failed.
+
+When the [`JobTrackingWithFinalizers`](#job-tracking-with-finalizers) feature is
+disabled, the number of failed Pods is only based on Pods that are still present
+in the API.

 {{< note >}}
 If your job has `restartPolicy = "OnFailure"`, keep in mind that your Pod running the Job