202 lines
13 KiB
Markdown
202 lines
13 KiB
Markdown
---
|
|
assignees:
|
|
- davidopp
|
|
- kevin-wangzefeng
|
|
|
|
---
|
|
|
|
# Constraining pods to run on particular nodes
|
|
|
|
You can constrain a [pod](/docs/user-guide/pods/) to only be able to run on particular [nodes](/docs/admin/node/) or to prefer to
|
|
run on particular nodes. There are several ways to do this, and they all use
|
|
[label selectors](/docs/user-guide/labels/) to make the selection.
|
|
Generally such constraints are unnecessary, as the scheduler will automatically do a reasonable placement
|
|
(e.g. spread your pods across nodes, not place the pod on a node with insufficient free resources, etc.)
|
|
but there are some circumstances where you may want more control where a pod lands, e.g. to ensure
|
|
that a pod ends up on a machine with an SSD attached to it, or to co-locate pods from two different
|
|
services that communicate a lot into the same availability zone.
|
|
|
|
You can find all the files for these examples [in our docs
|
|
repo here](https://github.com/kubernetes/kubernetes.github.io/tree/{{page.docsbranch}}/docs/user-guide/node-selection).
|
|
|
|
## nodeSelector
|
|
|
|
`nodeSelector` is the simplest form of constraint.
|
|
`nodeSelector` is a field of PodSpec. It specifies a map of key-value pairs. For the pod to be eligible
|
|
to run on a node, the node must have each of the indicated key-value pairs as labels (it can have
|
|
additional labels as well). The most common usage is one key-value pair.
|
|
|
|
Let's walk through an example of how to use `nodeSelector`.
|
|
|
|
### Step Zero: Prerequisites
|
|
|
|
This example assumes that you have a basic understanding of Kubernetes pods and that you have [turned up a Kubernetes cluster](https://github.com/kubernetes/kubernetes#documentation).
|
|
|
|
### Step One: Attach label to the node
|
|
|
|
Run `kubectl get nodes` to get the names of your cluster's nodes. Pick out the one that you want to add a label to.
|
|
|
|
Then, to add a label to the node you've chosen, run `kubectl label nodes <node-name> <label-key>=<label-value>`. For example, if my node name is 'kubernetes-foo-node-1.c.a-robinson.internal' and my desired label is 'disktype=ssd', then I can run `kubectl label nodes kubernetes-foo-node-1.c.a-robinson.internal disktype=ssd`.
|
|
|
|
If this fails with an "invalid command" error, you're likely using an older version of kubectl that doesn't have the `label` command. In that case, see the [previous version](https://github.com/kubernetes/kubernetes/blob/a053dbc313572ed60d89dae9821ecab8bfd676dc/examples/node-selection/README.md) of this guide for instructions on how to manually set labels on a node.
|
|
|
|
Also, note that label keys must be in the form of DNS labels (as described in the [identifiers doc](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/identifiers.md)), meaning that they are not allowed to contain any upper-case letters.
|
|
|
|
You can verify that it worked by re-running `kubectl get nodes --show-labels` and checking that the node now has a label.
|
|
|
|
### Step Two: Add a nodeSelector field to your pod configuration
|
|
|
|
Take whatever pod config file you want to run, and add a nodeSelector section to it, like this. For example, if this is my pod config:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: nginx
|
|
labels:
|
|
env: test
|
|
spec:
|
|
containers:
|
|
- name: nginx
|
|
image: nginx
|
|
```
|
|
|
|
Then add a nodeSelector like so:
|
|
|
|
{% include code.html language="yaml" file="pod.yaml" ghlink="/docs/user-guide/node-selection/pod.yaml" %}
|
|
|
|
When you then run `kubectl create -f pod.yaml`, the pod will get scheduled on the node that you attached the label to! You can verify that it worked by running `kubectl get pods -o wide` and looking at the "NODE" that the pod was assigned to.
|
|
|
|
## Interlude: built-in node labels
|
|
|
|
In addition to labels you [attach yourself](#step-one-attach-label-to-the-node), nodes come pre-populated
|
|
with a standard set of labels. As of Kubernetes v1.4 these labels are
|
|
|
|
* `kubernetes.io/hostname`
|
|
* `failure-domain.beta.kubernetes.io/zone`
|
|
* `failure-domain.beta.kubernetes.io/region`
|
|
* `beta.kubernetes.io/instance-type`
|
|
* `beta.kubernetes.io/os`
|
|
* `beta.kubernetes.io/arch`
|
|
|
|
## Alpha features: affinity and anti-affinity
|
|
|
|
`nodeSelector` provides a very simple way to constrain pods to nodes with particular labels. The affinity/anti-affinity
|
|
feature, currently in alpha, greatly expands the types of constraints you can express. The key enhancements are
|
|
|
|
1. the language is more expressive (not just "AND of exact match")
|
|
1. you can indicate that the rule is "soft"/"preference" rather than a hard requirement, so if the scheduler
|
|
can't satisfy it, the pod will still be scheduled
|
|
1. you can constrain against labels on other pods running on the node (or other topological domain),
|
|
rather than against labels on the node itself, which allows rules about which pods can and cannot be co-located
|
|
|
|
The affinity feature consists of two types of affinity, "node affinity" and "inter-pod affinity/anti-affinity."
|
|
Node affinity is like the existing `nodeSelector` (but with the first two benefits listed above),
|
|
while inter-pod affinity/anti-affinity constrains against pod labels rather than node labels, as
|
|
described in the three item listed above, in addition to having the first and second properties listed above.
|
|
|
|
`nodeSelector` continues to work as usual, but will eventually be deprecated, as node affinity can express
|
|
everything that `nodeSelector` can express.
|
|
|
|
### Node affinity (alpha feature)
|
|
|
|
Node affinity was introduced in Kubernetes 1.2.
|
|
Node affinity is conceptually similar to `nodeSelector` -- it allows you to constrain which nodes your
|
|
pod is eligible to schedule on, based on labels on the node.
|
|
|
|
There are currently two types of node affinity, called `requiredDuringSchedulingIgnoredDuringExecution` and
|
|
`preferredDuringSchedulingIgnoredDuringExecution`. You can think of them as "hard" and "soft" respectively,
|
|
in the sense that the former specifies rules that *must* be met for a pod to schedule onto a node (just like
|
|
`nodeSelector` but using a more expressive syntax), while the latter specifies *preferences* that the scheduler
|
|
will try to enforce but will not guarantee. The "IgnoredDuringExecution" part of the names means that, similar
|
|
to how `nodeSelector` works, if labels on a node change at runtime such that the affinity rules on a pod are no longer
|
|
met, the pod will still continue to run on the node. In the future we plan to offer
|
|
`requiredDuringSchedulingRequiredDuringExecution` which will be just like `requiredDuringSchedulingIgnoredDuringExecution`
|
|
except that it will evict pods from nodes that cease to satisfy the pods' node affinity requirements.
|
|
|
|
Thus an example of `requiredDuringSchedulingIgnoredDuringExecution` would be "only run the pod on nodes with Intel CPUs"
|
|
and an example `preferredDuringSchedulingIgnoredDuringExecution` would be "try to run this set of pods in availability
|
|
zone XYZ, but if it's not possible, then allow some to run elsewhere".
|
|
|
|
Node affinity is currently expressed using an annotation on Pod. Once the feature goes to GA it will use a field of Pod.
|
|
|
|
Here's an example of a pod that uses node affinity:
|
|
|
|
{% include code.html language="yaml" file="pod-with-node-affinity.yaml" ghlink="/docs/user-guide/node-selection/pod-with-node-affinity.yaml" %}
|
|
|
|
This node affinity rule says the pod can only be placed on a node with a label whose key is
|
|
`kubernetes.io/e2e-az-name` and whose value is either `e2e-az1` or `e2e-az2`. In addition,
|
|
among nodes that meet that criteria, nodes with a label whose key is `another-annotation-key` and whose
|
|
value is `another-annotation-value` should be preferred.
|
|
|
|
You can see the operator `In` being used in the example. The new node affinity syntax supports the following operators: `In`, `NotIn`, `Exists`, `DoesNotExist`, `Gt`, `Lt`.
|
|
There is no explicit "node anti-affinity" concept, but `NotIn` and `DoesNotExist` give that behavior.
|
|
|
|
If you specify both `nodeSelector` and `nodeAffinity`, *both* must be satisfied for the pod
|
|
to be scheduled onto a candidate node.
|
|
|
|
For more information on node affinity, see the design doc
|
|
[here](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/nodeaffinity.md).
|
|
|
|
### Inter-pod affinity and anti-affinity (alpha feature)
|
|
|
|
Inter-pod affinity and anti-affinity were introduced in Kubernetes 1.4.
|
|
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to schedule on *based on
|
|
labels on pods that are already running on the node* rather than based on labels on nodes. The rules are of the form "this pod should (or, in the case of
|
|
anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y." Y is expressed
|
|
as a LabelSelector with an associated list of namespaces (or "all" namespaces); unlike nodes, because pods are namespaced
|
|
(and therefore the labels on pods are implicitly namespaced),
|
|
a label selector over pod labels must specify which namespaces the selector should apply to. Conceptually X is a topology domain
|
|
like node, rack, cloud provider zone, cloud provider region, etc. You express it using a `topologyKey` which is the
|
|
key for the node label that the system uses to denote such a topology domain, e.g. see the label keys listed above
|
|
in the section "Interlude: built-in node labels."
|
|
|
|
As with node affinity, there are currently two types of pod affinity and anti-affinity, called `requiredDuringSchedulingIgnoredDuringExecution` and
|
|
`preferredDuringSchedulingIgnoredDuringExecution` which denote "hard" vs. "soft" requirements.
|
|
See the description in the node affinity section earlier.
|
|
An example of `requiredDuringSchedulingIgnoredDuringExecution` affinity would be "co-locate the pods of service A and service B
|
|
in the same zone, since they communicate a lot with each other"
|
|
and an example `preferredDuringSchedulingIgnoredDuringExecution` anti-affinity would be "spread the pods from this service across zones"
|
|
(a hard requirement wouldn't make sense, since you probably have more pods than zones).
|
|
|
|
Inter-pod affinity and anti-affinity are currently expressed using an annotation on Pod. Once the feature goes to GA it will use a field.
|
|
|
|
Here's an example of a pod that uses pod affinity:
|
|
|
|
{% include code.html language="yaml" file="pod-with-pod-affinity.yaml" ghlink="/docs/user-guide/node-selection/pod-with-pod-affinity.yaml" %}
|
|
|
|
The affinity annotation on this pod defines one pod affinity rule and one pod anti-affinity rule. Both
|
|
must be satisfied for the pod to schedule onto a node. The
|
|
pod affinity rule says that the pod can schedule onto a node only if that node is in the same zone
|
|
as at least one already-running pod that has a label with key "security" and value "S1". (More precisely, the pod is eligible to run
|
|
on node N if node N has a label with key `failure-domain.beta.kubernetes.io/zone` and some value V
|
|
such that there is at least one node in the cluster with key `failure-domain.beta.kubernetes.io/zone` and
|
|
value V that is running a pod that has a label with key "security" and value "S1".) The pod anti-affinity
|
|
rule says that the pod cannot schedule onto a node if that node is already running a pod with label
|
|
having key "security" and value "S2". (If the `topologyKey` were `failure-domain.beta.kubernetes.io/zone` then
|
|
it would mean that the pod cannot schedule onto a node if that node is in the same zone as a pod with
|
|
label having key "security" and value "S2".) See the [design doc](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/podaffinity.md).
|
|
for many more examples of pod affinity and anti-affinity, both the `requiredDuringSchedulingIgnoredDuringExecution`
|
|
flavor and the `preferredDuringSchedulingIgnoredDuringExecution` flavor.
|
|
|
|
As with node affinity, the legal operators for pod affinity and anti-affinity are `In`, `NotIn`, `Exists`, `DoesNotExist`, `Gt`, `Lt`.
|
|
|
|
In principle, the `topologyKey` can be any legal label value. However,
|
|
for performance reasons, only a limit set of topology keys are allowed;
|
|
they are specified in the `--failure-domain` command-line argument to the scheduler. By default the allowed topology keys are
|
|
|
|
* `kubernetes.io/hostname`
|
|
* `failure-domain.beta.kubernetes.io/zone`
|
|
* `failure-domain.beta.kubernetes.io/region`
|
|
|
|
In addition to `labelSelector` and `topologyKey`, you can optionally specify a list `namespaces`
|
|
of namespaces which the `labelSelector` should match against (this goes at the same level of the definition as `labelSelector` and `topologyKey`).
|
|
If omitted, it defaults to the namespace of the pod where the affinity/anti-affinity definition appears.
|
|
If defined but empty, it means "all namespaces."
|
|
|
|
All `matchExpressions` associated with `requiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity
|
|
must be satisfied for the pod to schedule onto a node.
|
|
|
|
For more information on inter-pod affinity/anti-affinity, see the design doc
|
|
[here](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/podaffinity.md).
|