4.7 KiB
| title | content_type | min-kubernetes-server-version | weight |
|---|---|---|---|
| Handling retriable and non-retriable pod failures with Pod failure policy | task | v1.25 | 60 |
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
This document shows you how to use the Pod failure policy, in combination with the default Pod backoff failure policy, to improve the control over the handling of container- or Pod-level failure within a {{<glossary_tooltip text="Job" term_id="job">}}.
The definition of Pod failure policy may help you to:
- better utilize the computational resources by avoiding unnecessary Pod retries.
- avoid Job failures due to Pod disruptions (such {{<glossary_tooltip text="preemption" term_id="preemption" >}}, {{<glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}} or {{<glossary_tooltip text="taint" term_id="taint" >}}-based eviction).
{{% heading "prerequisites" %}}
You should already be familiar with the basic use of Job.
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
{{< note >}}
As the features are in Alpha, prepare the Kubernetes cluster with the two
feature gates
enabled: JobPodFailurePolicy and PodDisruptionConditions.
{{< /note >}}
Using Pod failure policy to avoid unnecessary Pod retries
With the following example, you can learn how to use Pod failure policy to avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable software bug.
First, create a Job based on the config:
{{< codenew file="/controllers/job-pod-failure-policy-failjob.yaml" >}}
by running:
kubectl create -f job-pod-failure-policy-failjob.yaml
After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
In the Job status, see a job Failed condition with the field reason
equal PodFailurePolicy. Additionally, the message field contains a
more detailed information about the Job termination, such as:
Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0.
For comparison, if the Pod failure policy was disabled it would take 6 retries of the Pod, taking at least 2 minutes.
Clean up
Delete the Job you created:
kubectl delete jobs/job-pod-failure-policy-failjob
The cluster automatically cleans up the Pods.
Using Pod failure policy to ignore Pod disruptions
With the following example, you can learn how to use Pod failure policy to
ignore Pod disruptions from incrementing the Pod retry counter towards the
.spec.backoffLimit limit.
{{< caution >}} Timing is important for this example, so you may want to read the steps before execution. In order to trigger a Pod disruption it is important to drain the node while the Pod is running on it (within 90s since the Pod is scheduled). {{< /caution >}}
-
Create a Job based on the config:
{{< codenew file="/controllers/job-pod-failure-policy-ignore.yaml" >}}
by running:
kubectl create -f job-pod-failure-policy-ignore.yaml -
Run this command to check the
nodeNamethe Pod is scheduled to:nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}') -
Drain the node to evict the Pod before it completes (within 90s):
kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0 -
Inspect the
.status.failedto check the counter for the Job is not incremented:kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml -
Uncordon the node:
kubectl uncordon nodes/$nodeName
The Job resumes and succeeds.
For comparison, if the Pod failure policy was disabled the Pod disruption would
result in terminating the entire Job (as the .spec.backoffLimit is set to 0).
Cleaning up
Delete the Job you created:
kubectl delete jobs/job-pod-failure-policy-ignore
The cluster automatically cleans up the Pods.
Alternatives
You could rely solely on the
Pod backoff failure policy,
by specifying the Job's .spec.backoffLimit field. However, in many situations
it is problematic to find a balance between setting a low value for .spec.backoffLimit
to avoid unnecessary Pod retries, yet high enough to make sure the Job would
not be terminated by Pod disruptions.