moving 'Pod failure policy' under 'Handling Pod and container failures' (#40999)

2023-05-08 14:55:32 +05:30 · 2023-05-08 14:55:32 +05:30 · 1793e17a74
parent 10b1aa817e
commit 1793e17a74
1 changed files with 94 additions and 94 deletions
--- a/content/en/docs/concepts/workloads/controllers/job.md
+++ b/content/en/docs/concepts/workloads/controllers/job.md
@ -358,6 +358,100 @@ will be terminated once the job backoff limit has been reached. This can make de
 from failed Jobs is not lost inadvertently.
 {{< /note >}}
 ### Pod failure policy {#pod-failure-policy}
 {{< feature-state for_k8s_version="v1.26" state="beta" >}}
 {{< note >}}
 You can only configure a Pod failure policy for a Job if you have the
 `JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
 enabled in your cluster. Additionally, it is recommended
 to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
 Pod disruption conditions in the Pod failure policy (see also:
 [Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
 available in Kubernetes {{< skew currentVersion >}}.
 {{< /note >}}
 A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
 your cluster to handle Pod failures based on the container exit codes and the
 Pod conditions.
 In some situations, you  may want to have a better control when handling Pod
 failures than the control provided by the [Pod backoff failure policy](#pod-backoff-failure-policy),
 which is based on the Job's `.spec.backoffLimit`. These are some examples of use cases:
 * To optimize costs of running workloads by avoiding unnecessary Pod restarts,
  you can terminate a Job as soon as one of its Pods fails with an exit code
  indicating a software bug.
 * To guarantee that your Job finishes even if there are disruptions, you can
  ignore Pod failures caused by disruptions  (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
  {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
  or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
  that they don't count towards the `.spec.backoffLimit` limit of retries.
 You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
 to meet the above use cases. This policy can handle Pod failures based on the
 container exit codes and the Pod conditions.
 Here is a manifest for a Job that defines a `podFailurePolicy`:
 {{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
 In the example above, the first rule of the Pod failure policy specifies that
 the Job should be marked failed if the `main` container fails with the 42 exit
 code. The following are the rules for the `main` container specifically:
 - an exit code of 0 means that the container succeeded
 - an exit code of 42 means that the **entire Job** failed
 - any other exit code represents that the container failed, and hence the entire
  Pod. The Pod will be re-created if the total number of restarts is
  below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
 {{< note >}}
 Because the Pod template specifies a `restartPolicy: Never`,
 the kubelet does not restart the `main` container in that particular Pod.
 {{< /note >}}
 The second rule of the Pod failure policy, specifying the `Ignore` action for
 failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
 being counted towards the `.spec.backoffLimit` limit of retries.
 {{< note >}}
 If the Job failed, either by the Pod failure policy or Pod backoff
 failure policy, and the Job is running multiple Pods, Kubernetes terminates all
 the Pods in that Job that are still Pending or Running.
 {{< /note >}}
 These are some requirements and semantics of the API:
 - if you want to use a `.spec.podFailurePolicy` field for a Job, you must
  also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
 - the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
  are evaluated in order. Once a rule matches a Pod failure, the remaining rules
  are ignored. When no rule matches the Pod failure, the default
  handling applies.
 - you may want to restrict a rule to a specific container by specifying its name
  in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
  applies to all containers. When specified, it should match one the container
  or `initContainer` names in the Pod template.
 - you may specify the action taken when a Pod failure policy is matched by
  `spec.podFailurePolicy.rules[*].action`. Possible values are:
  - `FailJob`: use to indicate that the Pod's job should be marked as Failed and
     all running Pods should be terminated.
  - `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
     should not be incremented and a replacement Pod should be created.
  - `Count`: use to indicate that the Pod should be handled in the default way.
     The counter towards the `.spec.backoffLimit` should be incremented.
 {{< note >}}
 When you use a `podFailurePolicy`, the job controller only matches Pods in the
 `Failed` phase. Pods with a deletion timestamp that are not in a terminal phase
 (`Failed` or `Succeeded`) are considered still terminating. This implies that
 terminating pods retain a [tracking finalizer](#job-tracking-with-finalizers)
 until they reach a terminal phase.
 Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
 (see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)). This
 ensures that deleted pods have their finalizers removed by the Job controller.
 {{< /note >}}
 ## Job termination and cleanup
 When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either.
@ -725,100 +819,6 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010
 `manualSelector: true` tells the system that you know what you are doing and to allow this
 mismatch.
 ### Pod failure policy {#pod-failure-policy}
 {{< feature-state for_k8s_version="v1.26" state="beta" >}}
 {{< note >}}
 You can only configure a Pod failure policy for a Job if you have the
 `JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
 enabled in your cluster. Additionally, it is recommended
 to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
 Pod disruption conditions in the Pod failure policy (see also:
 [Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
 available in Kubernetes {{< skew currentVersion >}}.
 {{< /note >}}
 A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
 your cluster to handle Pod failures based on the container exit codes and the
 Pod conditions.
 In some situations, you  may want to have a better control when handling Pod
 failures than the control provided by the [Pod backoff failure policy](#pod-backoff-failure-policy),
 which is based on the Job's `.spec.backoffLimit`. These are some examples of use cases:
 * To optimize costs of running workloads by avoiding unnecessary Pod restarts,
  you can terminate a Job as soon as one of its Pods fails with an exit code
  indicating a software bug.
 * To guarantee that your Job finishes even if there are disruptions, you can
  ignore Pod failures caused by disruptions  (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
  {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
  or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
  that they don't count towards the `.spec.backoffLimit` limit of retries.
 You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
 to meet the above use cases. This policy can handle Pod failures based on the
 container exit codes and the Pod conditions.
 Here is a manifest for a Job that defines a `podFailurePolicy`:
 {{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
 In the example above, the first rule of the Pod failure policy specifies that
 the Job should be marked failed if the `main` container fails with the 42 exit
 code. The following are the rules for the `main` container specifically:
 - an exit code of 0 means that the container succeeded
 - an exit code of 42 means that the **entire Job** failed
 - any other exit code represents that the container failed, and hence the entire
  Pod. The Pod will be re-created if the total number of restarts is
  below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
 {{< note >}}
 Because the Pod template specifies a `restartPolicy: Never`,
 the kubelet does not restart the `main` container in that particular Pod.
 {{< /note >}}
 The second rule of the Pod failure policy, specifying the `Ignore` action for
 failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
 being counted towards the `.spec.backoffLimit` limit of retries.
 {{< note >}}
 If the Job failed, either by the Pod failure policy or Pod backoff
 failure policy, and the Job is running multiple Pods, Kubernetes terminates all
 the Pods in that Job that are still Pending or Running.
 {{< /note >}}
 These are some requirements and semantics of the API:
 - if you want to use a `.spec.podFailurePolicy` field for a Job, you must
  also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
 - the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
  are evaluated in order. Once a rule matches a Pod failure, the remaining rules
  are ignored. When no rule matches the Pod failure, the default
  handling applies.
 - you may want to restrict a rule to a specific container by specifying its name
  in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
  applies to all containers. When specified, it should match one the container
  or `initContainer` names in the Pod template.
 - you may specify the action taken when a Pod failure policy is matched by
  `spec.podFailurePolicy.rules[*].action`. Possible values are:
  - `FailJob`: use to indicate that the Pod's job should be marked as Failed and
     all running Pods should be terminated.
  - `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
     should not be incremented and a replacement Pod should be created.
  - `Count`: use to indicate that the Pod should be handled in the default way.
     The counter towards the `.spec.backoffLimit` should be incremented.
 {{< note >}}
 When you use a `podFailurePolicy`, the job controller only matches Pods in the
 `Failed` phase. Pods with a deletion timestamp that are not in a terminal phase
 (`Failed` or `Succeeded`) are considered still terminating. This implies that
 terminating pods retain a [tracking finalizer](#job-tracking-with-finalizers)
 until they reach a terminal phase.
 Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
 (see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)). This
 ensures that deleted pods have their finalizers removed by the Job controller.
 {{< /note >}}
 ### Job tracking with finalizers
 {{< feature-state for_k8s_version="v1.26" state="stable" >}}