Add scheduler concept guide (#14637)
* Add scheduler concept page * Rename scheduling overview * Fix non-ASCII colon symbols * Reword scheduler concept page * Move scheduler performance tuning into scheduling * Signpost from overview to kube-scheduler, etc * Add whatsnext section to scheduler concept * Restructure scheduling concept Now there's a concept page for scheduling, some of the details in the performance tuning page can have a better home. * Omit link to (unwritten) scheduling extensions page * Drop deprecated / superseded filtering rules
This commit is contained in:
parent
6b0711ae01
commit
d550d112af
|
@ -111,5 +111,8 @@ A [Cluster-level logging](/docs/concepts/cluster-administration/logging/) mechan
|
|||
saving container logs to a central log store with search/browsing interface.
|
||||
|
||||
{{% /capture %}}
|
||||
|
||||
|
||||
{{% capture whatsnext %}}
|
||||
* Learn about [Nodes](/docs/concepts/architecture/nodes/)
|
||||
* Learn about [kube-scheduler](/docs/concepts/scheduling/kube-scheduler/)
|
||||
* Read etcd's official [documentation](https://etcd.io/docs/)
|
||||
{{% /capture %}}
|
||||
|
|
|
@ -0,0 +1,5 @@
|
|||
---
|
||||
title: "Scheduling"
|
||||
weight: 90
|
||||
---
|
||||
|
|
@ -0,0 +1,186 @@
|
|||
---
|
||||
title: Kubernetes Scheduler
|
||||
content_template: templates/concept
|
||||
weight: 60
|
||||
---
|
||||
|
||||
{{% capture overview %}}
|
||||
|
||||
In Kubernetes, _scheduling_ refers to making sure that {{< glossary_tooltip text="Pods" term_id="pod" >}}
|
||||
are matched to {{< glossary_tooltip text="Nodes" term_id="node" >}} so that
|
||||
{{< glossary_tooltip term_id="kubelet" >}} can run them.
|
||||
|
||||
{{% /capture %}}
|
||||
|
||||
{{% capture body %}}
|
||||
|
||||
## Scheduling overview {#scheduling}
|
||||
|
||||
A scheduler watches for newly created Pods that have no Node assigned. For
|
||||
every Pod that the scheduler discovers, the scheduler becomes responsible
|
||||
for finding the best Node for that Pod to run on. The scheduler reaches
|
||||
this placement decision taking into account the scheduling principles
|
||||
described below.
|
||||
|
||||
If you want to understand why Pods are placed onto a particular Node,
|
||||
or if you're planning to implement a custom scheduler yourself, this
|
||||
page will help you learn about scheduling.
|
||||
|
||||
## kube-scheduler
|
||||
|
||||
[kube-scheduler](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/)
|
||||
is the default scheduler for Kubernetes and runs as part of the
|
||||
{{< glossary_tooltip text="control plane" term_id="control-plane" >}}.
|
||||
kube-scheduler is designed so that, if you want and need to, you can
|
||||
write your own scheduling component and use that instead.
|
||||
|
||||
For every newly created pods or other unscheduled pods, kube-scheduler
|
||||
selects a optimal node for them to run on. However, every container in
|
||||
pods has different requirements for resources and every pod also has
|
||||
different requirements. Therefore, existing nodes need to be filtered
|
||||
according to the specific scheduling requirements.
|
||||
|
||||
In a cluster, Nodes that meet the scheduling requirements for a Pod
|
||||
are called _feasible_ nodes. If none of the nodes are suitable, the pod
|
||||
remains unscheduled until the scheduler is able to place it.
|
||||
|
||||
The scheduler finds feasible Nodes for a Pod and then runs a set of
|
||||
functions to score the feasible Nodes and picks a Node with the highest
|
||||
score among the feasible ones to run the Pod. The scheduler then notifies
|
||||
the API server about this decision in a process called _binding_.
|
||||
|
||||
Factors that need taken into account for scheduling decisions include
|
||||
individual and collective resource requirements, hardware / software /
|
||||
policy constraints, affinity and anti-affinity specifications, data
|
||||
locality, inter-workload interference, and so on.
|
||||
|
||||
## Scheduling with kube-scheduler {#kube-scheduler-implementation}
|
||||
|
||||
kube-scheduler selects a node for the pod in a 2-step operation:
|
||||
|
||||
1. Filtering
|
||||
|
||||
2. Scoring
|
||||
|
||||
|
||||
The _filtering_ step finds the set of Nodes where it's feasible to
|
||||
schedule the Pod. For example, the PodFitsResources filter checks whether a
|
||||
candidate Node has enough available resource to meet a Pod's specific
|
||||
resource requests. After this step, the node list contains any suitable
|
||||
Nodes; often, there will be more than one. If the list is empty, that
|
||||
Pod isn't (yet) schedulable.
|
||||
|
||||
In the _scoring_ step, the scheduler ranks the remaining nodes to choose
|
||||
the most suitable Pod placement. The scheduler assigns a score to each Node
|
||||
that survived filtering, basing this score on the active scoring rules.
|
||||
|
||||
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking.
|
||||
If there is more than one node with equal scores, kube-scheduler selects
|
||||
one of these at random.
|
||||
|
||||
|
||||
### Default policies
|
||||
|
||||
kube-scheduler has a default set of scheduling policies.
|
||||
|
||||
### Filtering
|
||||
|
||||
- `PodFitsHostPorts`: Checks if a Node has free ports (the network protocol kind)
|
||||
for the Pod ports the the Pod is requesting.
|
||||
|
||||
- `PodFitsHost`: Checks if a Pod specifies a specific Node by it hostname.
|
||||
|
||||
- `PodFitsResources`: Checks if the Node has free resources (eg, CPU and Memory)
|
||||
to meet the requirement of the Pod.
|
||||
|
||||
- `PodMatchNodeSelector`: Checks if a Pod's Node {{< glossary_tooltip term_id="selector" >}}
|
||||
matches the Node's {{< glossary_tooltip text="label(s)" term_id="label" >}}.
|
||||
|
||||
- `NoVolumeZoneConflict`: Evaluate if the {{< glossary_tooltip text="Volumes" term_id="volume" >}}
|
||||
that a Pod requests are available on the Node, given the failure zone restrictions for
|
||||
that storage.
|
||||
|
||||
- `NoDiskConflict`: Evaluates if a Pod can fit on a Node due to the volumes it requests,
|
||||
and those that are already mounted.
|
||||
|
||||
- `MaxCSIVolumeCount`: Decides how many {{< glossary_tooltip text="CSI" term_id="csi" >}}
|
||||
volumes should be attached, and whether that's over a configured limit.
|
||||
|
||||
- `CheckNodeMemoryPressure`: If a Node is reporting memory pressure, and there's no
|
||||
configured exception, the Pod won't be scheduled there.
|
||||
|
||||
- `CheckNodePIDPressure`: If a Node is reporting that process IDs are scarce, and
|
||||
there's no configured exception, the Pod won't be scheduled there.
|
||||
|
||||
- `CheckNodeDiskPressure`: If a Node is reporting storage pressure (a filesystem that
|
||||
is full or nearly full), and there's no configured exception, the Pod won't be
|
||||
scheduled there.
|
||||
|
||||
- `CheckNodeCondition`: Nodes can report that they have a completely full filesystem,
|
||||
that networking isn't available or that kubelet is otherwise not ready to run Pods.
|
||||
If such a condition is set for a Node, and there's no configured exception, the Pod
|
||||
won't be scheduled there.
|
||||
|
||||
- `PodToleratesNodeTaints`: checks if a Pod's {{< glossary_tooltip text="tolerations" term_id="toleration" >}}
|
||||
can tolerate the Node's {{< glossary_tooltip text="taints" term_id="taint" >}}.
|
||||
|
||||
- `CheckVolumeBinding`: Evaluates if a Pod can fit due to the volumes it requests.
|
||||
This applies for both bound and unbound
|
||||
{{< glossary_tooltip text="PVCs" term_id="persistent-volume-claim" >}}
|
||||
|
||||
### Scoring
|
||||
|
||||
- `SelectorSpreadPriority`: Spreads Pods across hosts, considering Pods that
|
||||
belonging to the same {{< glossary_tooltip text="Service" term_id="service" >}},
|
||||
{{< glossary_tooltip term_id="statefulset" >}} or
|
||||
{{< glossary_tooltip term_id="replica-set" >}}.
|
||||
|
||||
- `InterPodAffinityPriority`: Computes a sum by iterating through the elements
|
||||
of weightedPodAffinityTerm and adding “weight” to the sum if the corresponding
|
||||
PodAffinityTerm is satisfied for that node; the node(s) with the highest sum
|
||||
are the most preferred.
|
||||
|
||||
- `LeastRequestedPriority`: Favors nodes with fewer requested resources. In other
|
||||
words, the more Pods that are placed on a Node, and the more resources those
|
||||
Pods use, the lower the ranking this policy will give.
|
||||
|
||||
- `MostRequestedPriority`: Favors nodes with most requested resources. This policy
|
||||
will fit the scheduled Pods onto the smallest number of Nodes needed to run your
|
||||
overall set of workloads.
|
||||
|
||||
- `RequestedToCapacityRatioPriority`: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
|
||||
|
||||
- `BalancedResourceAllocation`: Favors nodes with balanced resource usage.
|
||||
|
||||
- `NodePreferAvoidPodsPriority`: Priorities nodes according to the node annotation
|
||||
`scheduler.alpha.kubernetes.io/preferAvoidPods`. You can use this to hint that
|
||||
two different Pods shouldn't run on the same Node.
|
||||
|
||||
- `NodeAffinityPriority`: Prioritizes nodes according to node affinity scheduling
|
||||
preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution.
|
||||
You can read more about this in [Assigning Pods to Nodes](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)
|
||||
|
||||
- `TaintTolerationPriority`: Prepares the priority list for all the nodes, based on
|
||||
the number of intolerable taints on the node. This policy adjusts a node's rank
|
||||
taking that list into account.
|
||||
|
||||
- `ImageLocalityPriority`: Favors nodes that already have the
|
||||
{{< glossary_tooltip text="container images" term_id="image" >}} for that
|
||||
Pod cached locally.
|
||||
|
||||
- `ServiceSpreadingPriority`: For a given Service, this policy aims to make sure that
|
||||
the Pods for the Service run on different nodes. It favouring scheduling onto nodes
|
||||
that don't have Pods for the service already assigned there. The overall outcome is
|
||||
that the Service becomes more resilient to a single Node failure.
|
||||
|
||||
- `CalculateAntiAffinityPriorityMap`: This policy helps implement
|
||||
[pod anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity).
|
||||
|
||||
- `EqualPriorityMap`: Gives an equal weight of one to all nodes.
|
||||
|
||||
{{% /capture %}}
|
||||
{{% capture whatsnext %}}
|
||||
* Read about [scheduler performance tuning](/docs/concepts/scheduling/scheduler-perf-tuning/)
|
||||
* Read the [reference documentation](/docs/reference/command-line-tools-reference/kube-scheduler/) for kube-scheduler
|
||||
* Learn about [configuring multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/)
|
||||
{{% /capture %}}
|
|
@ -10,13 +10,19 @@ weight: 70
|
|||
|
||||
{{< feature-state for_k8s_version="1.14" state="beta" >}}
|
||||
|
||||
Kube-scheduler is the Kubernetes default scheduler. It is responsible for
|
||||
placement of Pods on Nodes in a cluster. Nodes in a cluster that meet the
|
||||
scheduling requirements of a Pod are called "feasible" Nodes for the Pod. The
|
||||
scheduler finds feasible Nodes for a Pod and then runs a set of functions to
|
||||
score the feasible Nodes and picks a Node with the highest score among the
|
||||
feasible ones to run the Pod. The scheduler then notifies the API server about
|
||||
this decision in a process called "Binding".
|
||||
[kube-scheduler](/docs/concepts/scheduling/kube-scheduler/#kube-scheduler)
|
||||
is the Kubernetes default scheduler. It is responsible for placement of Pods
|
||||
on Nodes in a cluster.
|
||||
|
||||
Nodes in a cluster that meet the scheduling requirements of a Pod are
|
||||
called _feasible_ Nodes for the Pod. The scheduler finds feasible Nodes
|
||||
for a Pod and then runs a set of functions to score the feasible Nodes,
|
||||
picking a Node with the highest score among the feasible ones to run
|
||||
the Pod. The scheduler then notifies the API server about this decision
|
||||
in a process called _Binding_.
|
||||
|
||||
This page explains performance tuning optimizations that are relevant for
|
||||
large Kubernetes clusters.
|
||||
|
||||
{{% /capture %}}
|
||||
|
||||
|
@ -37,7 +43,7 @@ size of the cluster if it is not specified in the configuration. It uses a
|
|||
linear formula which yields 50% for a 100-node cluster. The formula yields 10%
|
||||
for a 5000-node cluster. The lower bound for the automatic value is 5%. In other
|
||||
words, the scheduler always scores at least 5% of the cluster no matter how
|
||||
large the cluster is, unless the user provides the config option with a value
|
||||
large the cluster is, unless the user provides the config option with a value
|
||||
smaller than 5.
|
||||
|
||||
Below is an example configuration that sets `percentageOfNodesToScore` to 50%.
|
|
@ -103,6 +103,7 @@
|
|||
/docs/concepts/clusters/logging/ /docs/concepts/cluster-administration/logging/ 301
|
||||
/docs/concepts/configuration/container-command-arg/ /docs/tasks/inject-data-application/define-command-argument-container/ 301
|
||||
/docs/concepts/configuration/container-command-args/ /docs/tasks/inject-data-application/define-command-argument-container/ 301
|
||||
/docs/concepts/configuration/scheduler-perf-tuning/ /docs/concepts/scheduling/scheduler-perf-tuning/ 301
|
||||
/docs/concepts/ecosystem/thirdpartyresource/ /docs/tasks/access-kubernetes-api/extend-api-third-party-resource/ 301
|
||||
/docs/concepts/jobs/cron-jobs/ /docs/concepts/workloads/controllers/cron-jobs/ 301
|
||||
/docs/concepts/jobs/run-to-completion-finite-workloads/ /docs/concepts/workloads/controllers/jobs-run-to-completion/ 301
|
||||
|
|
Loading…
Reference in New Issue