489 lines
21 KiB
Markdown
489 lines
21 KiB
Markdown
---
|
|
reviewers:
|
|
- davidopp
|
|
- kevin-wangzefeng
|
|
- bsalamat
|
|
title: Assigning Pods to Nodes
|
|
content_type: concept
|
|
weight: 20
|
|
---
|
|
|
|
|
|
<!-- overview -->
|
|
|
|
You can constrain a {{< glossary_tooltip text="Pod" term_id="pod" >}} so that it can only run on particular set of
|
|
{{< glossary_tooltip text="node(s)" term_id="node" >}}.
|
|
There are several ways to do this and the recommended approaches all use
|
|
[label selectors](/docs/concepts/overview/working-with-objects/labels/) to facilitate the selection.
|
|
Generally such constraints are unnecessary, as the scheduler will automatically do a reasonable placement
|
|
(for example, spreading your Pods across nodes so as not place Pods on a node with insufficient free resources).
|
|
However, there are some circumstances where you may want to control which node
|
|
the Pod deploys to, for example, to ensure that a Pod ends up on a node with an SSD attached to it, or to co-locate Pods from two different
|
|
services that communicate a lot into the same availability zone.
|
|
|
|
<!-- body -->
|
|
|
|
You can use any of the following methods to choose where Kubernetes schedules
|
|
specific Pods:
|
|
|
|
* [nodeSelector](#nodeselector) field matching against [node labels](#built-in-node-labels)
|
|
* [Affinity and anti-affinity](#affinity-and-anti-affinity)
|
|
* [nodeName](#nodename) field
|
|
|
|
## Node labels {#built-in-node-labels}
|
|
|
|
Like many other Kubernetes objects, nodes have
|
|
[labels](/docs/concepts/overview/working-with-objects/labels/). You can [attach labels manually](/docs/tasks/configure-pod-container/assign-pods-nodes/#add-a-label-to-a-node).
|
|
Kubernetes also populates a standard set of labels on all nodes in a cluster. See [Well-Known Labels, Annotations and Taints](/docs/reference/labels-annotations-taints/)
|
|
for a list of common node labels.
|
|
|
|
{{<note>}}
|
|
The value of these labels is cloud provider specific and is not guaranteed to be reliable.
|
|
For example, the value of `kubernetes.io/hostname` may be the same as the node name in some environments
|
|
and a different value in other environments.
|
|
{{</note>}}
|
|
|
|
### Node isolation/restriction
|
|
|
|
Adding labels to nodes allows you to target Pods for scheduling on specific
|
|
nodes or groups of nodes. You can use this functionality to ensure that specific
|
|
Pods only run on nodes with certain isolation, security, or regulatory
|
|
properties.
|
|
|
|
If you use labels for node isolation, choose label keys that the {{<glossary_tooltip text="kubelet" term_id="kubelet">}}
|
|
cannot modify. This prevents a compromised node from setting those labels on
|
|
itself so that the scheduler schedules workloads onto the compromised node.
|
|
|
|
The [`NodeRestriction` admission plugin](/docs/reference/access-authn-authz/admission-controllers/#noderestriction)
|
|
prevents the kubelet from setting or modifying labels with a
|
|
`node-restriction.kubernetes.io/` prefix.
|
|
|
|
To make use of that label prefix for node isolation:
|
|
|
|
1. Ensure you are using the [Node authorizer](/docs/reference/access-authn-authz/node/) and have _enabled_ the `NodeRestriction` admission plugin.
|
|
2. Add labels with the `node-restriction.kubernetes.io/` prefix to your nodes, and use those labels in your [node selectors](#nodeselector).
|
|
For example, `example.com.node-restriction.kubernetes.io/fips=true` or `example.com.node-restriction.kubernetes.io/pci-dss=true`.
|
|
|
|
## nodeSelector
|
|
|
|
`nodeSelector` is the simplest recommended form of node selection constraint.
|
|
You can add the `nodeSelector` field to your Pod specification and specify the
|
|
[node labels](#built-in-node-labels) you want the target node to have.
|
|
Kubernetes only schedules the Pod onto nodes that have each of the labels you
|
|
specify.
|
|
|
|
See [Assign Pods to Nodes](/docs/tasks/configure-pod-container/assign-pods-nodes) for more
|
|
information.
|
|
|
|
## Affinity and anti-affinity
|
|
|
|
`nodeSelector` is the simplest way to constrain Pods to nodes with specific
|
|
labels. Affinity and anti-affinity expands the types of constraints you can
|
|
define. Some of the benefits of affinity and anti-affinity include:
|
|
|
|
* The affinity/anti-affinity language is more expressive. `nodeSelector` only
|
|
selects nodes with all the specified labels. Affinity/anti-affinity gives you
|
|
more control over the selection logic.
|
|
* You can indicate that a rule is *soft* or *preferred*, so that the scheduler
|
|
still schedules the Pod even if it can't find a matching node.
|
|
* You can constrain a Pod using labels on other Pods running on the node (or other topological domain),
|
|
instead of just node labels, which allows you to define rules for which Pods
|
|
can be co-located on a node.
|
|
|
|
The affinity feature consists of two types of affinity:
|
|
|
|
* *Node affinity* functions like the `nodeSelector` field but is more expressive and
|
|
allows you to specify soft rules.
|
|
* *Inter-pod affinity/anti-affinity* allows you to constrain Pods against labels
|
|
on other Pods.
|
|
|
|
### Node affinity
|
|
|
|
Node affinity is conceptually similar to `nodeSelector`, allowing you to constrain which nodes your
|
|
Pod can be scheduled on based on node labels. There are two types of node
|
|
affinity:
|
|
|
|
* `requiredDuringSchedulingIgnoredDuringExecution`: The scheduler can't
|
|
schedule the Pod unless the rule is met. This functions like `nodeSelector`,
|
|
but with a more expressive syntax.
|
|
* `preferredDuringSchedulingIgnoredDuringExecution`: The scheduler tries to
|
|
find a node that meets the rule. If a matching node is not available, the
|
|
scheduler still schedules the Pod.
|
|
|
|
{{<note>}}
|
|
In the preceding types, `IgnoredDuringExecution` means that if the node labels
|
|
change after Kubernetes schedules the Pod, the Pod continues to run.
|
|
{{</note>}}
|
|
|
|
You can specify node affinities using the `.spec.affinity.nodeAffinity` field in
|
|
your Pod spec.
|
|
|
|
For example, consider the following Pod spec:
|
|
|
|
{{< codenew file="pods/pod-with-node-affinity.yaml" >}}
|
|
|
|
In this example, the following rules apply:
|
|
|
|
* The node *must* have a label with the key `kubernetes.io/os` and
|
|
the value `linux`.
|
|
* The node *preferably* has a label with the key `another-node-label-key` and
|
|
the value `another-node-label-value`.
|
|
|
|
You can use the `operator` field to specify a logical operator for Kubernetes to use when
|
|
interpreting the rules. You can use `In`, `NotIn`, `Exists`, `DoesNotExist`,
|
|
`Gt` and `Lt`.
|
|
|
|
`NotIn` and `DoesNotExist` allow you to define node anti-affinity behavior.
|
|
Alternatively, you can use [node taints](/docs/concepts/scheduling-eviction/taint-and-toleration/)
|
|
to repel Pods from specific nodes.
|
|
|
|
{{<note>}}
|
|
If you specify both `nodeSelector` and `nodeAffinity`, *both* must be satisfied
|
|
for the Pod to be scheduled onto a node.
|
|
|
|
If you specify multiple `nodeSelectorTerms` associated with `nodeAffinity`
|
|
types, then the Pod can be scheduled onto a node if one of the specified `nodeSelectorTerms` can be
|
|
satisfied.
|
|
|
|
If you specify multiple `matchExpressions` associated with a single `nodeSelectorTerms`,
|
|
then the Pod can be scheduled onto a node only if all the `matchExpressions` are
|
|
satisfied.
|
|
{{</note>}}
|
|
|
|
See [Assign Pods to Nodes using Node Affinity](/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/)
|
|
for more information.
|
|
|
|
#### Node affinity weight
|
|
|
|
You can specify a `weight` between 1 and 100 for each instance of the
|
|
`preferredDuringSchedulingIgnoredDuringExecution` affinity type. When the
|
|
scheduler finds nodes that meet all the other scheduling requirements of the Pod, the
|
|
scheduler iterates through every preferred rule that the node satisfies and adds the
|
|
value of the `weight` for that expression to a sum.
|
|
|
|
The final sum is added to the score of other priority functions for the node.
|
|
Nodes with the highest total score are prioritized when the scheduler makes a
|
|
scheduling decision for the Pod.
|
|
|
|
For example, consider the following Pod spec:
|
|
|
|
{{< codenew file="pods/pod-with-affinity-anti-affinity.yaml" >}}
|
|
|
|
If there are two possible nodes that match the
|
|
`requiredDuringSchedulingIgnoredDuringExecution` rule, one with the
|
|
`label-1:key-1` label and another with the `label-2:key-2` label, the scheduler
|
|
considers the `weight` of each node and adds the weight to the other scores for
|
|
that node, and schedules the Pod onto the node with the highest final score.
|
|
|
|
{{<note>}}
|
|
If you want Kubernetes to successfully schedule the Pods in this example, you
|
|
must have existing nodes with the `kubernetes.io/os=linux` label.
|
|
{{</note>}}
|
|
|
|
#### Node affinity per scheduling profile
|
|
|
|
{{< feature-state for_k8s_version="v1.20" state="beta" >}}
|
|
|
|
When configuring multiple [scheduling profiles](/docs/reference/scheduling/config/#multiple-profiles), you can associate
|
|
a profile with a node affinity, which is useful if a profile only applies to a specific set of nodes.
|
|
To do so, add an `addedAffinity` to the `args` field of the [`NodeAffinity` plugin](/docs/reference/scheduling/config/#scheduling-plugins)
|
|
in the [scheduler configuration](/docs/reference/scheduling/config/). For example:
|
|
|
|
```yaml
|
|
apiVersion: kubescheduler.config.k8s.io/v1beta3
|
|
kind: KubeSchedulerConfiguration
|
|
|
|
profiles:
|
|
- schedulerName: default-scheduler
|
|
- schedulerName: foo-scheduler
|
|
pluginConfig:
|
|
- name: NodeAffinity
|
|
args:
|
|
addedAffinity:
|
|
requiredDuringSchedulingIgnoredDuringExecution:
|
|
nodeSelectorTerms:
|
|
- matchExpressions:
|
|
- key: scheduler-profile
|
|
operator: In
|
|
values:
|
|
- foo
|
|
```
|
|
|
|
The `addedAffinity` is applied to all Pods that set `.spec.schedulerName` to `foo-scheduler`, in addition to the
|
|
NodeAffinity specified in the PodSpec.
|
|
That is, in order to match the Pod, nodes need to satisfy `addedAffinity` and
|
|
the Pod's `.spec.NodeAffinity`.
|
|
|
|
Since the `addedAffinity` is not visible to end users, its behavior might be
|
|
unexpected to them. Use node labels that have a clear correlation to the
|
|
scheduler profile name.
|
|
|
|
{{< note >}}
|
|
The DaemonSet controller, which [creates Pods for DaemonSets](/docs/concepts/workloads/controllers/daemonset/#scheduled-by-default-scheduler),
|
|
does not support scheduling profiles. When the DaemonSet controller creates
|
|
Pods, the default Kubernetes scheduler places those Pods and honors any
|
|
`nodeAffinity` rules in the DaemonSet controller.
|
|
{{< /note >}}
|
|
|
|
### Inter-pod affinity and anti-affinity
|
|
|
|
Inter-pod affinity and anti-affinity allow you to constrain which nodes your
|
|
Pods can be scheduled on based on the labels of **Pods** already running on that
|
|
node, instead of the node labels.
|
|
|
|
Inter-pod affinity and anti-affinity rules take the form "this
|
|
Pod should (or, in the case of anti-affinity, should not) run in an X if that X
|
|
is already running one or more Pods that meet rule Y", where X is a topology
|
|
domain like node, rack, cloud provider zone or region, or similar and Y is the
|
|
rule Kubernetes tries to satisfy.
|
|
|
|
You express these rules (Y) as [label selectors](/docs/concepts/overview/working-with-objects/labels/#label-selectors)
|
|
with an optional associated list of namespaces. Pods are namespaced objects in
|
|
Kubernetes, so Pod labels also implicitly have namespaces. Any label selectors
|
|
for Pod labels should specify the namespaces in which Kubernetes should look for those
|
|
labels.
|
|
|
|
You express the topology domain (X) using a `topologyKey`, which is the key for
|
|
the node label that the system uses to denote the domain. For examples, see
|
|
[Well-Known Labels, Annotations and Taints](/docs/reference/labels-annotations-taints/).
|
|
|
|
{{< note >}}
|
|
Inter-pod affinity and anti-affinity require substantial amount of
|
|
processing which can slow down scheduling in large clusters significantly. We do
|
|
not recommend using them in clusters larger than several hundred nodes.
|
|
{{< /note >}}
|
|
|
|
{{< note >}}
|
|
Pod anti-affinity requires nodes to be consistently labelled, in other words,
|
|
every node in the cluster must have an appropriate label matching `topologyKey`.
|
|
If some or all nodes are missing the specified `topologyKey` label, it can lead
|
|
to unintended behavior.
|
|
{{< /note >}}
|
|
|
|
#### Types of inter-pod affinity and anti-affinity
|
|
|
|
Similar to [node affinity](#node-affinity) are two types of Pod affinity and
|
|
anti-affinity as follows:
|
|
|
|
* `requiredDuringSchedulingIgnoredDuringExecution`
|
|
* `preferredDuringSchedulingIgnoredDuringExecution`
|
|
|
|
For example, you could use
|
|
`requiredDuringSchedulingIgnoredDuringExecution` affinity to tell the scheduler to
|
|
co-locate Pods of two services in the same cloud provider zone because they
|
|
communicate with each other a lot. Similarly, you could use
|
|
`preferredDuringSchedulingIgnoredDuringExecution` anti-affinity to spread Pods
|
|
from a service across multiple cloud provider zones.
|
|
|
|
To use inter-pod affinity, use the `affinity.podAffinity` field in the Pod spec.
|
|
For inter-pod anti-affinity, use the `affinity.podAntiAffinity` field in the Pod
|
|
spec.
|
|
|
|
#### Pod affinity example {#an-example-of-a-pod-that-uses-pod-affinity}
|
|
|
|
Consider the following Pod spec:
|
|
|
|
{{< codenew file="pods/pod-with-pod-affinity.yaml" >}}
|
|
|
|
This example defines one Pod affinity rule and one Pod anti-affinity rule. The
|
|
Pod affinity rule uses the "hard"
|
|
`requiredDuringSchedulingIgnoredDuringExecution`, while the anti-affinity rule
|
|
uses the "soft" `preferredDuringSchedulingIgnoredDuringExecution`.
|
|
|
|
The affinity rule says that the scheduler can only schedule a Pod onto a node if
|
|
the node is in the same zone as one or more existing Pods with the label
|
|
`security=S1`. More precisely, the scheduler must place the Pod on a node that has the
|
|
`topology.kubernetes.io/zone=V` label, as long as there is at least one node in
|
|
that zone that currently has one or more Pods with the Pod label `security=S1`.
|
|
|
|
The anti-affinity rule says that the scheduler should try to avoid scheduling
|
|
the Pod onto a node that is in the same zone as one or more Pods with the label
|
|
`security=S2`. More precisely, the scheduler should try to avoid placing the Pod on a node that has the
|
|
`topology.kubernetes.io/zone=R` label if there are other nodes in the
|
|
same zone currently running Pods with the `Security=S2` Pod label.
|
|
|
|
See the
|
|
[design doc](https://git.k8s.io/community/contributors/design-proposals/scheduling/podaffinity.md)
|
|
for many more examples of Pod affinity and anti-affinity.
|
|
|
|
You can use the `In`, `NotIn`, `Exists` and `DoesNotExist` values in the
|
|
`operator` field for Pod affinity and anti-affinity.
|
|
|
|
In principle, the `topologyKey` can be any allowed label key with the following
|
|
exceptions for performance and security reasons:
|
|
|
|
* For Pod affinity and anti-affinity, an empty `topologyKey` field is not allowed in both `requiredDuringSchedulingIgnoredDuringExecution`
|
|
and `preferredDuringSchedulingIgnoredDuringExecution`.
|
|
* For `requiredDuringSchedulingIgnoredDuringExecution` Pod anti-affinity rules,
|
|
the admission controller `LimitPodHardAntiAffinityTopology` limits
|
|
`topologyKey` to `kubernetes.io/hostname`. You can modify or disable the
|
|
admission controller if you want to allow custom topologies.
|
|
|
|
In addition to `labelSelector` and `topologyKey`, you can optionally specify a list
|
|
of namespaces which the `labelSelector` should match against using the
|
|
`namespaces` field at the same level as `labelSelector` and `topologyKey`.
|
|
If omitted or empty, `namespaces` defaults to the namespace of the Pod where the
|
|
affinity/anti-affinity definition appears.
|
|
|
|
#### Namespace selector
|
|
{{< feature-state for_k8s_version="v1.22" state="beta" >}}
|
|
|
|
You can also select matching namespaces using `namespaceSelector`, which is a label query over the set of namespaces.
|
|
The affinity term is applied to namespaces selected by both `namespaceSelector` and the `namespaces` field.
|
|
Note that an empty `namespaceSelector` ({}) matches all namespaces, while a null or empty `namespaces` list and
|
|
null `namespaceSelector` matches the namespace of the Pod where the rule is defined.
|
|
|
|
{{<note>}}
|
|
This feature is beta and enabled by default. You can disable it via the
|
|
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
|
`PodAffinityNamespaceSelector` in both kube-apiserver and kube-scheduler.
|
|
{{</note>}}
|
|
|
|
#### More practical use-cases
|
|
|
|
Inter-pod affinity and anti-affinity can be even more useful when they are used with higher
|
|
level collections such as ReplicaSets, StatefulSets, Deployments, etc. These
|
|
rules allow you to configure that a set of workloads should
|
|
be co-located in the same defined topology, eg., the same node.
|
|
|
|
Take, for example, a three-node cluster running a web application with an
|
|
in-memory cache like redis. You could use inter-pod affinity and anti-affinity
|
|
to co-locate the web servers with the cache as much as possible.
|
|
|
|
In the following example Deployment for the redis cache, the replicas get the label `app=store`. The
|
|
`podAntiAffinity` rule tells the scheduler to avoid placing multiple replicas
|
|
with the `app=store` label on a single node. This creates each cache in a
|
|
separate node.
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: redis-cache
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: store
|
|
replicas: 3
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: store
|
|
spec:
|
|
affinity:
|
|
podAntiAffinity:
|
|
requiredDuringSchedulingIgnoredDuringExecution:
|
|
- labelSelector:
|
|
matchExpressions:
|
|
- key: app
|
|
operator: In
|
|
values:
|
|
- store
|
|
topologyKey: "kubernetes.io/hostname"
|
|
containers:
|
|
- name: redis-server
|
|
image: redis:3.2-alpine
|
|
```
|
|
|
|
The following Deployment for the web servers creates replicas with the label `app=web-store`. The
|
|
Pod affinity rule tells the scheduler to place each replica on a node that has a
|
|
Pod with the label `app=store`. The Pod anti-affinity rule tells the scheduler
|
|
to avoid placing multiple `app=web-store` servers on a single node.
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: web-server
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: web-store
|
|
replicas: 3
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: web-store
|
|
spec:
|
|
affinity:
|
|
podAntiAffinity:
|
|
requiredDuringSchedulingIgnoredDuringExecution:
|
|
- labelSelector:
|
|
matchExpressions:
|
|
- key: app
|
|
operator: In
|
|
values:
|
|
- web-store
|
|
topologyKey: "kubernetes.io/hostname"
|
|
podAffinity:
|
|
requiredDuringSchedulingIgnoredDuringExecution:
|
|
- labelSelector:
|
|
matchExpressions:
|
|
- key: app
|
|
operator: In
|
|
values:
|
|
- store
|
|
topologyKey: "kubernetes.io/hostname"
|
|
containers:
|
|
- name: web-app
|
|
image: nginx:1.16-alpine
|
|
```
|
|
|
|
Creating the two preceding Deployments results in the following cluster layout,
|
|
where each web server is co-located with a cache, on three separate nodes.
|
|
|
|
| node-1 | node-2 | node-3 |
|
|
|:--------------------:|:-------------------:|:------------------:|
|
|
| *webserver-1* | *webserver-2* | *webserver-3* |
|
|
| *cache-1* | *cache-2* | *cache-3* |
|
|
|
|
See the [ZooKeeper tutorial](/docs/tutorials/stateful-application/zookeeper/#tolerating-node-failure)
|
|
for an example of a StatefulSet configured with anti-affinity for high
|
|
availability, using the same technique as this example.
|
|
|
|
## nodeName
|
|
|
|
`nodeName` is a more direct form of node selection than affinity or
|
|
`nodeSelector`. `nodeName` is a field in the Pod spec. If the `nodeName` field
|
|
is not empty, the scheduler ignores the Pod and the kubelet on the named node
|
|
tries to place the Pod on that node. Using `nodeName` overrules using
|
|
`nodeSelector` or affinity and anti-affinity rules.
|
|
|
|
Some of the limitations of using `nodeName` to select nodes are:
|
|
|
|
- If the named node does not exist, the Pod will not run, and in
|
|
some cases may be automatically deleted.
|
|
- If the named node does not have the resources to accommodate the
|
|
Pod, the Pod will fail and its reason will indicate why,
|
|
for example OutOfmemory or OutOfcpu.
|
|
- Node names in cloud environments are not always predictable or
|
|
stable.
|
|
|
|
Here is an example of a Pod spec using the `nodeName` field:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: nginx
|
|
spec:
|
|
containers:
|
|
- name: nginx
|
|
image: nginx
|
|
nodeName: kube-01
|
|
```
|
|
|
|
The above Pod will only run on the node `kube-01`.
|
|
|
|
## {{% heading "whatsnext" %}}
|
|
|
|
* Read more about [taints and tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/) .
|
|
* Read the design docs for [node affinity](https://git.k8s.io/community/contributors/design-proposals/scheduling/nodeaffinity.md)
|
|
and for [inter-pod affinity/anti-affinity](https://git.k8s.io/community/contributors/design-proposals/scheduling/podaffinity.md).
|
|
* Learn about how the [topology manager](/docs/tasks/administer-cluster/topology-manager/) takes part in node-level
|
|
resource allocation decisions.
|
|
* Learn how to use [nodeSelector](/docs/tasks/configure-pod-container/assign-pods-nodes/).
|
|
* Learn how to use [affinity and anti-affinity](/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/).
|
|
|
|
|