Doc for Alpha feature PodSchedulingReadiness

This commit is contained in:
Wei Huang 2022-11-02 10:22:35 -07:00
parent b8fc810198
commit 21a7c4cc7e
No known key found for this signature in database
GPG Key ID: 17AFE05D01EA77B2
5 changed files with 132 additions and 0 deletions

View File

@ -28,6 +28,7 @@ of terminating one or more Pods on Nodes.
* [Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework) * [Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework)
* [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/) * [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/)
* [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/) * [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/)
* [Pod Scheduling Readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)
## Pod Disruption ## Pod Disruption

View File

@ -0,0 +1,110 @@
---
title: Pod Scheduling Readiness
content_type: concept
weight: 40
---
<!-- overview -->
{{< feature-state for_k8s_version="v1.26" state="alpha" >}}
Pods were considered ready for scheduling once created. Kubernetes scheduler
does its due diligence to find nodes to place all pending Pods. However, in a
real-world case, some Pods may stay in a "miss-essential-resources" state for a long period.
These Pods actually churn the scheduler (and downstream integrators like Cluster AutoScaler)
in an unnecessary manner.
By specifying/removing a Pod's `.spec.schedulingGates`, you can control when a Pod is ready
to be considered for scheduling.
<!-- body -->
## Configuring Pod schedulingGates
The `schedulingGates` field contains a list of strings, and each string literal is perceived as a
criteria that Pod should be satisfied before considered schedulable. This field can be initialized
only when a Pod is created (either by the client, or mutated during admission). After creation,
each schedulingGate can be removed in arbitrary order, but addition of a new scheduling gate is disallowed.
{{<mermaid>}}
stateDiagram-v2
s1: pod created
s2: pod scheduling gated
s3: pod scheduling ready
s4: pod running
if: empty scheduling gates?
state if <<choice>>
[*] --> s1
s1 --> if
s2 --> if: scheduling gate removed
if --> s2: no
if --> s3: yes
s3 --> s4
s4 --> [*]
{{< /mermaid >}}
## Usage example
To mark a Pod not-ready for scheduling, you can create it with one or more scheduling gates like this:
{{< codenew file="pods/pod-with-scheduling-gates.yaml" >}}
After the Pod's creation, you can check its state using:
```bash
kubectl get pod test-pod
```
The output reveals it's in `SchedulingGated` state:
```bash
NAME READY STATUS RESTARTS AGE
test-pod 0/1 SchedulingGated 0 7s
```
You can also check its `schedulingGates` field by running:
```bash
kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}'
```
The output is:
```bash
[{"name":"foo"},{"name":"bar"}]
```
To inform scheduler this Pod is ready for scheduling, you can remove its `schedulingGates` entirely
by re-applying a modified manifest:
{{< codenew file="pods/pod-without-scheduling-gates.yaml" >}}
You can check if the `schedulingGates` is cleared by running:
```bash
kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}'
```
The output is expected to be empty. And you can check its latest status by running:
```bash
kubectl get pod test-pod -o wide
```
Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's state get
transited from previous `SchedulingGated` to `Running`:
```bash
NAME READY STATUS RESTARTS AGE IP NODE
test-pod 1/1 Running 0 15s 10.0.0.4 node-2
```
## Observability
The metric `scheduler_pending_pods` comes with a new label `"gated"` to distinguish whether a Pod
has been tried scheduling but claimed as unschedulable, or explicitly marked as not ready for
scheduling. You can use `scheduler_pending_pods{queue="gated"}` to check the metric result.
## {{% heading "whatsnext" %}}
* Read the [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) for more details

View File

@ -152,6 +152,7 @@ For a reference to old feature gates that are removed, please refer to
| `PodDeletionCost` | `true` | Beta | 1.22 | | | `PodDeletionCost` | `true` | Beta | 1.22 | |
| `PodDisruptionConditions` | `false` | Alpha | 1.25 | - | | `PodDisruptionConditions` | `false` | Alpha | 1.25 | - |
| `PodHasNetworkCondition` | `false` | Alpha | 1.25 | | | `PodHasNetworkCondition` | `false` | Alpha | 1.25 | |
| `PodSchedulingReadiness` | `false` | Alpha | 1.26 | |
| `ProbeTerminationGracePeriod` | `false` | Alpha | 1.21 | 1.21 | | `ProbeTerminationGracePeriod` | `false` | Alpha | 1.21 | 1.21 |
| `ProbeTerminationGracePeriod` | `false` | Beta | 1.22 | 1.24 | | `ProbeTerminationGracePeriod` | `false` | Beta | 1.22 | 1.24 |
| `ProbeTerminationGracePeriod` | `true` | Beta | 1.25 | | | `ProbeTerminationGracePeriod` | `true` | Beta | 1.25 | |
@ -652,6 +653,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
pod stats from the CRI container runtime rather than gathering them from cAdvisor. pod stats from the CRI container runtime rather than gathering them from cAdvisor.
- `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption. - `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption.
- `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods. - `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods.
- `PodSchedulingReadiness`: Enable setting `schedulingGates` field to control a Pod's [scheduling readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness).
- `PodSecurity`: Enables the `PodSecurity` admission plugin. - `PodSecurity`: Enables the `PodSecurity` admission plugin.
- `PreferNominatedNode`: This flag tells the scheduler whether the nominated - `PreferNominatedNode`: This flag tells the scheduler whether the nominated
nodes will be checked first before looping through all the other nodes in nodes will be checked first before looping through all the other nodes in

View File

@ -0,0 +1,11 @@
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
schedulingGates:
- name: foo
- name: bar
containers:
- name: pause
image: registry.k8s.io/pause:3.6

View File

@ -0,0 +1,8 @@
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: pause
image: registry.k8s.io/pause:3.6