Improve taint and toleration documentation
This commit is contained in:
parent
a17a3b9cb1
commit
82ca38f8df
|
@ -58,6 +58,7 @@ toc:
|
||||||
- docs/concepts/configuration/overview.md
|
- docs/concepts/configuration/overview.md
|
||||||
- docs/concepts/configuration/manage-compute-resources-container.md
|
- docs/concepts/configuration/manage-compute-resources-container.md
|
||||||
- docs/concepts/configuration/assign-pod-node.md
|
- docs/concepts/configuration/assign-pod-node.md
|
||||||
|
- docs/concepts/configuration/taint-and-toleration.md
|
||||||
- docs/concepts/configuration/secret.md
|
- docs/concepts/configuration/secret.md
|
||||||
- docs/concepts/configuration/organize-cluster-access-kubeconfig.md
|
- docs/concepts/configuration/organize-cluster-access-kubeconfig.md
|
||||||
|
|
||||||
|
|
|
@ -7868,7 +7868,7 @@ Appears In <a href="#pod-v1-core">Pod</a> <a href="#podtemplatespec-v1-core">Pod
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>tolerations <br /> <em><a href="#toleration-v1-core">Toleration</a> array</em></td>
|
<td>tolerations <br /> <em><a href="#toleration-v1-core">Toleration</a> array</em></td>
|
||||||
<td>If specified, the pod's tolerations.</td>
|
<td>If specified, the pod's tolerations. More info: <a href="https://kubernetes.io/docs/concepts/configuration/taint-and-toleration">https://kubernetes.io/docs/concepts/configuration/taint-and-toleration</a></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>volumes <br /> <em><a href="#volume-v1-core">Volume</a> array</em></td>
|
<td>volumes <br /> <em><a href="#volume-v1-core">Volume</a> array</em></td>
|
||||||
|
|
|
@ -7964,7 +7964,7 @@ Appears In:
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>tolerations <br /> <em><a href="#toleration-v1-core">Toleration</a> array</em></td>
|
<td>tolerations <br /> <em><a href="#toleration-v1-core">Toleration</a> array</em></td>
|
||||||
<td>If specified, the pod's tolerations.</td>
|
<td>If specified, the pod's tolerations. More info: <a href="https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/">https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/</a></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>volumes <br /> <em><a href="#volume-v1-core">Volume</a> array</em> <br /> <strong>patch type</strong>: <em>merge</em> <br /> <strong>patch merge key</strong>: <em>name</em></td>
|
<td>volumes <br /> <em><a href="#volume-v1-core">Volume</a> array</em> <br /> <strong>patch type</strong>: <em>merge</em> <br /> <strong>patch merge key</strong>: <em>name</em></td>
|
||||||
|
|
|
@ -171,7 +171,7 @@ Starting in Kubernetes 1.6, the NodeController is also responsible for evicting
|
||||||
pods that are running on nodes with `NoExecute` taints, when the pods do not tolerate
|
pods that are running on nodes with `NoExecute` taints, when the pods do not tolerate
|
||||||
the taints. Additionally, as an alpha feature that is disabled by default, the
|
the taints. Additionally, as an alpha feature that is disabled by default, the
|
||||||
NodeController is responsible for adding taints corresponding to node problems like
|
NodeController is responsible for adding taints corresponding to node problems like
|
||||||
node unreachable or not ready. See [this documentation](/docs/concepts/configuration/assign-pod-node/#taints-and-tolerations-beta-feature)
|
node unreachable or not ready. See [this documentation](/docs/concepts/configuration/taint-and-toleration)
|
||||||
for details about `NoExecute` taints and the alpha feature.
|
for details about `NoExecute` taints and the alpha feature.
|
||||||
|
|
||||||
### Self-Registration of Nodes
|
### Self-Registration of Nodes
|
||||||
|
|
|
@ -118,4 +118,5 @@ spec:
|
||||||
**Note**: a pod with the _unsafe_ sysctls specified above will fail to launch on
|
**Note**: a pod with the _unsafe_ sysctls specified above will fail to launch on
|
||||||
any node which has not enabled those two _unsafe_ sysctls explicitly. As with
|
any node which has not enabled those two _unsafe_ sysctls explicitly. As with
|
||||||
_node-level_ sysctls it is recommended to use [_taints and toleration_
|
_node-level_ sysctls it is recommended to use [_taints and toleration_
|
||||||
feature](/docs/user-guide/kubectl/v1.6/#taint) or [labels on nodes](/docs/concepts/configuration/assign-pod-node/) to schedule those pods onto the right nodes.
|
feature](/docs/user-guide/kubectl/v1.6/#taint) or [taints on nodes](/docs/concepts/configuration/taint-and-toleration/)
|
||||||
|
to schedule those pods onto the right nodes.
|
||||||
|
|
|
@ -298,231 +298,5 @@ Highly Available database statefulset has one master and three replicas, one may
|
||||||
For more information on inter-pod affinity/anti-affinity, see the design doc
|
For more information on inter-pod affinity/anti-affinity, see the design doc
|
||||||
[here](https://git.k8s.io/community/contributors/design-proposals/podaffinity.md).
|
[here](https://git.k8s.io/community/contributors/design-proposals/podaffinity.md).
|
||||||
|
|
||||||
## Taints and tolerations (beta feature)
|
You may want to check [Taints](/docs/concepts/configuration/taint-and-toleration/)
|
||||||
|
as well, which allow a *node* to *repel* a set of pods.
|
||||||
Node affinity, described earlier, is a property of *pods* that *attracts* them to a set
|
|
||||||
of nodes (either as a preference or a hard requirement). Taints are the opposite --
|
|
||||||
they allow a *node* to *repel* a set of pods.
|
|
||||||
|
|
||||||
Taints and tolerations work together to ensure that pods are not scheduled
|
|
||||||
onto inappropriate nodes. One or more taints are applied to a node; this
|
|
||||||
marks that the node should not accept any pods that do not tolerate the taints.
|
|
||||||
Tolerations are applied to pods, and allow (but do not require) the pods to schedule
|
|
||||||
onto nodes with matching taints.
|
|
||||||
|
|
||||||
You add a taint to a node using [kubectl taint](/docs/user-guide/kubectl/v1.7/#taint).
|
|
||||||
For example,
|
|
||||||
|
|
||||||
```shell
|
|
||||||
kubectl taint nodes node1 key=value:NoSchedule
|
|
||||||
```
|
|
||||||
|
|
||||||
places a taint on node `node1`. The taint has key `key`, value `value`, and taint effect `NoSchedule`.
|
|
||||||
This means that no pod will be able to schedule onto `node1` unless it has a matching toleration.
|
|
||||||
You specify a toleration for a pod in the PodSpec. Both of the following tolerations "match" the
|
|
||||||
taint created by the `kubectl taint` line above, and thus a pod with either toleration would be able
|
|
||||||
to schedule onto `node1`:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
tolerations:
|
|
||||||
- key: "key"
|
|
||||||
operator: "Equal"
|
|
||||||
value: "value"
|
|
||||||
effect: "NoSchedule"
|
|
||||||
```
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
tolerations:
|
|
||||||
- key: "key"
|
|
||||||
operator: "Exists"
|
|
||||||
effect: "NoSchedule"
|
|
||||||
```
|
|
||||||
|
|
||||||
A toleration "matches" a taint if the keys are the same and the effects are the same, and:
|
|
||||||
|
|
||||||
* the `operator` is `Exists` (in which case no `value` should be specified), or
|
|
||||||
* the `operator` is `Equal` and the `value`s are equal
|
|
||||||
|
|
||||||
`Operator` defaults to `Equal` if not specified.
|
|
||||||
|
|
||||||
**NOTE:** There are two special cases:
|
|
||||||
|
|
||||||
* An empty `key` with operator `Exists` matches all keys, values and effects which means this
|
|
||||||
will tolerate everything.
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
tolerations:
|
|
||||||
- operator: "Exists"
|
|
||||||
```
|
|
||||||
|
|
||||||
* An empty `effect` matches all effects with key `key`.
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
tolerations:
|
|
||||||
- key: "key"
|
|
||||||
operator: "Exists"
|
|
||||||
```
|
|
||||||
|
|
||||||
The above example used `effect` of `NoSchedule`. Alternatively, you can use `effect` of `PreferNoSchedule`.
|
|
||||||
This is a "preference" or "soft" version of `NoSchedule` -- the system will *try* to avoid placing a
|
|
||||||
pod that does not tolerate the taint on the node, but it is not required. The third kind of `effect` is
|
|
||||||
`NoExecute`, described later.
|
|
||||||
|
|
||||||
You can put multiple taints on the same node and multiple tolerations on the same pod.
|
|
||||||
The way Kubernetes processes multiple taints and tolerations is like a filter: start
|
|
||||||
with all of a node's taints, then ignore the ones for which the pod has a matching toleration; the
|
|
||||||
remaining un-ignored taints have the indicated effects on the pod. In particular,
|
|
||||||
|
|
||||||
* if there is at least one un-ignored taint with effect `NoSchedule` then Kubernetes will not schedule
|
|
||||||
the pod onto that node
|
|
||||||
* if there is no un-ignored taint with effect `NoSchedule` but there is at least one un-ignored taint with
|
|
||||||
effect `PreferNoSchedule` then Kubernetes will *try* to not schedule the pod onto the node
|
|
||||||
* if there is at least one un-ignored taint with effect `NoExecute` then the pod will be evicted from
|
|
||||||
the node (if it is already running on the node), and will not be
|
|
||||||
scheduled onto the node (if it is not yet running on the node).
|
|
||||||
|
|
||||||
For example, imagine you taint a node like this
|
|
||||||
|
|
||||||
```shell
|
|
||||||
kubectl taint nodes node1 key1=value1:NoSchedule
|
|
||||||
kubectl taint nodes node1 key1=value1:NoExecute
|
|
||||||
kubectl taint nodes node1 key2=value2:NoSchedule
|
|
||||||
```
|
|
||||||
|
|
||||||
And a pod has two tolerations:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
tolerations:
|
|
||||||
- key: "key1"
|
|
||||||
operator: "Equal"
|
|
||||||
value: "value1"
|
|
||||||
effect: "NoSchedule"
|
|
||||||
- key: "key1"
|
|
||||||
operator: "Equal"
|
|
||||||
value: "value1"
|
|
||||||
effect: "NoExecute"
|
|
||||||
```
|
|
||||||
|
|
||||||
In this case, the pod will not be able to schedule onto the node, because there is no
|
|
||||||
toleration matching the third taint. But it will be able to continue running if it is
|
|
||||||
already running on the node when the taint is added, because the third taint is the only
|
|
||||||
one of the three that is not tolerated by the pod.
|
|
||||||
|
|
||||||
Normally, if a taint with effect `NoExecute` is added to a node, then any pods that do
|
|
||||||
not tolerate the taint will be evicted immediately, and any pods that do tolerate the
|
|
||||||
taint will never be evicted. However, a toleration with `NoExecute` effect can specify
|
|
||||||
an optional `tolerationSeconds` field that dictates how long the pod will stay bound
|
|
||||||
to the node after the taint is added. For example,
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
tolerations:
|
|
||||||
- key: "key1"
|
|
||||||
operator: "Equal"
|
|
||||||
value: "value1"
|
|
||||||
effect: "NoExecute"
|
|
||||||
tolerationSeconds: 3600
|
|
||||||
```
|
|
||||||
|
|
||||||
means that if this pod is running and a matching taint is added to the node, then
|
|
||||||
the pod will stay bound to the node for 3600 seconds, and then be evicted. If the
|
|
||||||
taint is removed before that time, the pod will not be evicted.
|
|
||||||
|
|
||||||
### Example use cases
|
|
||||||
|
|
||||||
Taints and tolerations are a flexible way to steer pods away from nodes or evict
|
|
||||||
pods that shouldn't be running. A few of the use cases are
|
|
||||||
|
|
||||||
* **dedicated nodes**: If you want to dedicate a set of nodes for exclusive use by
|
|
||||||
a particular set of users, you can add a taint to those nodes (say,
|
|
||||||
`kubectl taint nodes nodename dedicated=groupName:NoSchedule`) and then add a corresponding
|
|
||||||
toleration to their pods (this would be done most easily by writing a custom
|
|
||||||
[admission controller](/docs/admin/admission-controllers/)).
|
|
||||||
The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as
|
|
||||||
well as any other nodes in the cluster. If you want to dedicate the nodes to them *and*
|
|
||||||
ensure they *only* use the dedicated nodes, then you should additionally add a label similar
|
|
||||||
to the taint to the same set of nodes (e.g. `dedicated=groupName`), and the admission
|
|
||||||
controller should additionally add a node affinity to require that the pods can only schedule
|
|
||||||
onto nodes labeled with `dedicated=groupName`.
|
|
||||||
|
|
||||||
* **nodes with special hardware**: In a cluster where a small subset of nodes have specialized
|
|
||||||
hardware (for example GPUs), it is desirable to keep pods that don't need the specialized
|
|
||||||
hardware off of those nodes, thus leaving room for later-arriving pods that do need the
|
|
||||||
specialized hardware. This can be done by tainting the nodes that have the specialized
|
|
||||||
hardware (e.g. `kubectl taint nodes nodename special=true:NoSchedule` or
|
|
||||||
`kubectl taint nodes nodename special=true:PreferNoSchedule`) and adding a corresponding
|
|
||||||
toleration to pods that use the special hardware. As in the dedicated nodes use case,
|
|
||||||
it is probably easiest to apply the tolerations using a custom
|
|
||||||
[admission controller](/docs/admin/admission-controllers/)).
|
|
||||||
For example, the admission controller could use
|
|
||||||
some characteristic(s) of the pod to determine that the pod should be allowed to use
|
|
||||||
the special nodes and hence the admission controller should add the toleration.
|
|
||||||
To ensure that the pods that need
|
|
||||||
the special hardware *only* schedule onto the nodes that have the special hardware, you will need some
|
|
||||||
additional mechanism, e.g. you could represent the special resource using
|
|
||||||
[opaque integer resources](/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)
|
|
||||||
and request it as a resource in the PodSpec, or you could label the nodes that have
|
|
||||||
the special hardware and use node affinity on the pods that need the hardware.
|
|
||||||
|
|
||||||
* **per-pod-configurable eviction behavior when there are node problems (alpha feature)**,
|
|
||||||
which is described in the next section.
|
|
||||||
|
|
||||||
### Per-pod-configurable eviction behavior when there are node problems (alpha feature)
|
|
||||||
|
|
||||||
Earlier we mentioned the `NoExecute` taint effect, which affects pods that are already
|
|
||||||
running on the node as follows
|
|
||||||
|
|
||||||
* pods that do not tolerate the taint are evicted immediately
|
|
||||||
* pods that tolerate the taint without specifying `tolerationSeconds` in
|
|
||||||
their toleration specification remain bound forever
|
|
||||||
* pods that tolerate the taint with a specified `tolerationSeconds` remain
|
|
||||||
bound for the specified amount of time
|
|
||||||
|
|
||||||
The above behavior is a beta feature. In addition, Kubernetes 1.6 has alpha
|
|
||||||
support for representing node problems (currently only "node unreachable" and
|
|
||||||
"node not ready", corresponding to the NodeCondition "Ready" being "Unknown" or
|
|
||||||
"False" respectively) as taints. When the `TaintBasedEvictions` alpha feature
|
|
||||||
is enabled (you can do this by including `TaintBasedEvictions=true` in `--feature-gates`, such as
|
|
||||||
`--feature-gates=FooBar=true,TaintBasedEvictions=true`), the taints are automatically
|
|
||||||
added by the NodeController and the normal logic for evicting pods from nodes
|
|
||||||
based on the Ready NodeCondition is disabled.
|
|
||||||
(Note: To maintain the existing [rate limiting](/docs/concepts/architecture/nodes/)
|
|
||||||
behavior of pod evictions due to node problems, the system actually adds the taints
|
|
||||||
in a rate-limited way. This prevents massive pod evictions in scenarios such
|
|
||||||
as the master becoming partitioned from the nodes.)
|
|
||||||
This alpha feature, in combination with `tolerationSeconds`, allows a pod
|
|
||||||
to specify how long it should stay bound to a node that has one or both of these problems.
|
|
||||||
|
|
||||||
For example, an application with a lot of local state might want to stay
|
|
||||||
bound to node for a long time in the event of network partition, in the hope
|
|
||||||
that the partition will recover and thus the pod eviction can be avoided.
|
|
||||||
The toleration the pod would use in that case would look like
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
tolerations:
|
|
||||||
- key: "node.alpha.kubernetes.io/unreachable"
|
|
||||||
operator: "Exists"
|
|
||||||
effect: "NoExecute"
|
|
||||||
tolerationSeconds: 6000
|
|
||||||
```
|
|
||||||
|
|
||||||
(For the node not ready case, change the key to `node.alpha.kubernetes.io/notReady`.)
|
|
||||||
|
|
||||||
Note that Kubernetes automatically adds a toleration for
|
|
||||||
`node.alpha.kubernetes.io/notReady` with `tolerationSeconds=300`
|
|
||||||
unless the pod configuration provided
|
|
||||||
by the user already has a toleration for `node.alpha.kubernetes.io/notReady`.
|
|
||||||
Likewise it adds a toleration for
|
|
||||||
`node.alpha.kubernetes.io/unreachable` with `tolerationSeconds=300`
|
|
||||||
unless the pod configuration provided
|
|
||||||
by the user already has a toleration for `node.alpha.kubernetes.io/unreachable`.
|
|
||||||
|
|
||||||
These automatically-added tolerations ensure that
|
|
||||||
the default pod behavior of remaining bound for 5 minutes after one of these
|
|
||||||
problems is detected is maintained.
|
|
||||||
The two default tolerations are added by the [DefaultTolerationSeconds
|
|
||||||
admission controller](https://git.k8s.io/kubernetes/plugin/pkg/admission/defaulttolerationseconds).
|
|
||||||
|
|
||||||
[DaemonSet](/docs/concepts/workloads/controllers/daemonset/) pods are created with
|
|
||||||
`NoExecute` tolerations for `node.alpha.kubernetes.io/unreachable` and `node.alpha.kubernetes.io/notReady`
|
|
||||||
with no `tolerationSeconds`. This ensures that DaemonSet pods are never evicted due
|
|
||||||
to these problems, which matches the behavior when this feature is disabled.
|
|
||||||
|
|
|
@ -0,0 +1,257 @@
|
||||||
|
---
|
||||||
|
approvers:
|
||||||
|
- davidopp
|
||||||
|
- kevin-wangzefeng
|
||||||
|
- bsalamat
|
||||||
|
title: Taints and Tolerations
|
||||||
|
---
|
||||||
|
|
||||||
|
Node affinity, described [here](/docs/concepts/configuration/assign-pod-node/#node-affinity-beta-feature),
|
||||||
|
is a property of *pods* that *attracts* them to a set of nodes (either as a
|
||||||
|
preference or a hard requirement). Taints are the opposite -- they allow a
|
||||||
|
*node* to *repel* a set of pods.
|
||||||
|
|
||||||
|
Taints and tolerations work together to ensure that pods are not scheduled
|
||||||
|
onto inappropriate nodes. One or more taints are applied to a node; this
|
||||||
|
marks that the node should not accept any pods that do not tolerate the taints.
|
||||||
|
Tolerations are applied to pods, and allow (but do not require) the pods to schedule
|
||||||
|
onto nodes with matching taints.
|
||||||
|
|
||||||
|
## Concepts
|
||||||
|
|
||||||
|
You add a taint to a node using [kubectl taint](/docs/user-guide/kubectl/v1.7/#taint).
|
||||||
|
For example,
|
||||||
|
|
||||||
|
```shell
|
||||||
|
kubectl taint nodes node1 key=value:NoSchedule
|
||||||
|
```
|
||||||
|
|
||||||
|
places a taint on node `node1`. The taint has key `key`, value `value`, and taint effect `NoSchedule`.
|
||||||
|
This means that no pod will be able to schedule onto `node1` unless it has a matching toleration.
|
||||||
|
You specify a toleration for a pod in the PodSpec. Both of the following tolerations "match" the
|
||||||
|
taint created by the `kubectl taint` line above, and thus a pod with either toleration would be able
|
||||||
|
to schedule onto `node1`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
tolerations:
|
||||||
|
- key: "key"
|
||||||
|
operator: "Equal"
|
||||||
|
value: "value"
|
||||||
|
effect: "NoSchedule"
|
||||||
|
```
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
tolerations:
|
||||||
|
- key: "key"
|
||||||
|
operator: "Exists"
|
||||||
|
effect: "NoSchedule"
|
||||||
|
```
|
||||||
|
|
||||||
|
A toleration "matches" a taint if the keys are the same and the effects are the same, and:
|
||||||
|
|
||||||
|
* the `operator` is `Exists` (in which case no `value` should be specified), or
|
||||||
|
* the `operator` is `Equal` and the `value`s are equal
|
||||||
|
|
||||||
|
`Operator` defaults to `Equal` if not specified.
|
||||||
|
|
||||||
|
**NOTE:** There are two special cases:
|
||||||
|
|
||||||
|
* An empty `key` with operator `Exists` matches all keys, values and effects which means this
|
||||||
|
will tolerate everything.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
tolerations:
|
||||||
|
- operator: "Exists"
|
||||||
|
```
|
||||||
|
|
||||||
|
* An empty `effect` matches all effects with key `key`.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
tolerations:
|
||||||
|
- key: "key"
|
||||||
|
operator: "Exists"
|
||||||
|
```
|
||||||
|
|
||||||
|
The above example used `effect` of `NoSchedule`. Alternatively, you can use `effect` of `PreferNoSchedule`.
|
||||||
|
This is a "preference" or "soft" version of `NoSchedule` -- the system will *try* to avoid placing a
|
||||||
|
pod that does not tolerate the taint on the node, but it is not required. The third kind of `effect` is
|
||||||
|
`NoExecute`, described later.
|
||||||
|
|
||||||
|
You can put multiple taints on the same node and multiple tolerations on the same pod.
|
||||||
|
The way Kubernetes processes multiple taints and tolerations is like a filter: start
|
||||||
|
with all of a node's taints, then ignore the ones for which the pod has a matching toleration; the
|
||||||
|
remaining un-ignored taints have the indicated effects on the pod. In particular,
|
||||||
|
|
||||||
|
* if there is at least one un-ignored taint with effect `NoSchedule` then Kubernetes will not schedule
|
||||||
|
the pod onto that node
|
||||||
|
* if there is no un-ignored taint with effect `NoSchedule` but there is at least one un-ignored taint with
|
||||||
|
effect `PreferNoSchedule` then Kubernetes will *try* to not schedule the pod onto the node
|
||||||
|
* if there is at least one un-ignored taint with effect `NoExecute` then the pod will be evicted from
|
||||||
|
the node (if it is already running on the node), and will not be
|
||||||
|
scheduled onto the node (if it is not yet running on the node).
|
||||||
|
|
||||||
|
For example, imagine you taint a node like this
|
||||||
|
|
||||||
|
```shell
|
||||||
|
kubectl taint nodes node1 key1=value1:NoSchedule
|
||||||
|
kubectl taint nodes node1 key1=value1:NoExecute
|
||||||
|
kubectl taint nodes node1 key2=value2:NoSchedule
|
||||||
|
```
|
||||||
|
|
||||||
|
And a pod has two tolerations:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
tolerations:
|
||||||
|
- key: "key1"
|
||||||
|
operator: "Equal"
|
||||||
|
value: "value1"
|
||||||
|
effect: "NoSchedule"
|
||||||
|
- key: "key1"
|
||||||
|
operator: "Equal"
|
||||||
|
value: "value1"
|
||||||
|
effect: "NoExecute"
|
||||||
|
```
|
||||||
|
|
||||||
|
In this case, the pod will not be able to schedule onto the node, because there is no
|
||||||
|
toleration matching the third taint. But it will be able to continue running if it is
|
||||||
|
already running on the node when the taint is added, because the third taint is the only
|
||||||
|
one of the three that is not tolerated by the pod.
|
||||||
|
|
||||||
|
Normally, if a taint with effect `NoExecute` is added to a node, then any pods that do
|
||||||
|
not tolerate the taint will be evicted immediately, and any pods that do tolerate the
|
||||||
|
taint will never be evicted. However, a toleration with `NoExecute` effect can specify
|
||||||
|
an optional `tolerationSeconds` field that dictates how long the pod will stay bound
|
||||||
|
to the node after the taint is added. For example,
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
tolerations:
|
||||||
|
- key: "key1"
|
||||||
|
operator: "Equal"
|
||||||
|
value: "value1"
|
||||||
|
effect: "NoExecute"
|
||||||
|
tolerationSeconds: 3600
|
||||||
|
```
|
||||||
|
|
||||||
|
means that if this pod is running and a matching taint is added to the node, then
|
||||||
|
the pod will stay bound to the node for 3600 seconds, and then be evicted. If the
|
||||||
|
taint is removed before that time, the pod will not be evicted.
|
||||||
|
|
||||||
|
## Example Use Cases
|
||||||
|
|
||||||
|
Taints and tolerations are a flexible way to steer pods *away* from nodes or evict
|
||||||
|
pods that shouldn't be running. A few of the use cases are
|
||||||
|
|
||||||
|
* **Dedicated Nodes**: If you want to dedicate a set of nodes for exclusive use by
|
||||||
|
a particular set of users, you can add a taint to those nodes (say,
|
||||||
|
`kubectl taint nodes nodename dedicated=groupName:NoSchedule`) and then add a corresponding
|
||||||
|
toleration to their pods (this would be done most easily by writing a custom
|
||||||
|
[admission controller](/docs/admin/admission-controllers/)).
|
||||||
|
The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as
|
||||||
|
well as any other nodes in the cluster. If you want to dedicate the nodes to them *and*
|
||||||
|
ensure they *only* use the dedicated nodes, then you should additionally add a label similar
|
||||||
|
to the taint to the same set of nodes (e.g. `dedicated=groupName`), and the admission
|
||||||
|
controller should additionally add a node affinity to require that the pods can only schedule
|
||||||
|
onto nodes labeled with `dedicated=groupName`.
|
||||||
|
|
||||||
|
* **Nodes with Special Hardware**: In a cluster where a small subset of nodes have specialized
|
||||||
|
hardware (for example GPUs), it is desirable to keep pods that don't need the specialized
|
||||||
|
hardware off of those nodes, thus leaving room for later-arriving pods that do need the
|
||||||
|
specialized hardware. This can be done by tainting the nodes that have the specialized
|
||||||
|
hardware (e.g. `kubectl taint nodes nodename special=true:NoSchedule` or
|
||||||
|
`kubectl taint nodes nodename special=true:PreferNoSchedule`) and adding a corresponding
|
||||||
|
toleration to pods that use the special hardware. As in the dedicated nodes use case,
|
||||||
|
it is probably easiest to apply the tolerations using a custom
|
||||||
|
[admission controller](/docs/admin/admission-controllers/)).
|
||||||
|
For example, the admission controller could use
|
||||||
|
some characteristic(s) of the pod to determine that the pod should be allowed to use
|
||||||
|
the special nodes and hence the admission controller should add the toleration.
|
||||||
|
To ensure that the pods that need
|
||||||
|
the special hardware *only* schedule onto the nodes that have the special hardware, you will need some
|
||||||
|
additional mechanism, e.g. you could represent the special resource using
|
||||||
|
[opaque integer resources](/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)
|
||||||
|
and request it as a resource in the PodSpec, or you could label the nodes that have
|
||||||
|
the special hardware and use node affinity on the pods that need the hardware.
|
||||||
|
|
||||||
|
* **Taint based Evictions (alpha feature)**: A per-pod-configurable eviction behavior
|
||||||
|
when there are node problems, which is described in the next section.
|
||||||
|
|
||||||
|
## Taint based Evictions
|
||||||
|
|
||||||
|
Earlier we mentioned the `NoExecute` taint effect, which affects pods that are already
|
||||||
|
running on the node as follows
|
||||||
|
|
||||||
|
* pods that do not tolerate the taint are evicted immediately
|
||||||
|
* pods that tolerate the taint without specifying `tolerationSeconds` in
|
||||||
|
their toleration specification remain bound forever
|
||||||
|
* pods that tolerate the taint with a specified `tolerationSeconds` remain
|
||||||
|
bound for the specified amount of time
|
||||||
|
|
||||||
|
The above behavior is a beta feature. In addition, Kubernetes 1.6 has alpha
|
||||||
|
support for representing node problems. In other words, the node controller
|
||||||
|
automatically taints a node when certain condition is true. The builtin taints
|
||||||
|
currently include:
|
||||||
|
|
||||||
|
* `node.alpha.kubernetes.io/notReady`: Node is not ready. This corresponds to
|
||||||
|
the NodeCondition `Ready` being "`False`".
|
||||||
|
* `node.alpha.kubernetes.io/unreachable`: Node is unreachable from the node
|
||||||
|
controller. This corresponds to the NodeCondition `Ready` being "`Unknown`".
|
||||||
|
* `node.kubernetes.io/outOfDisk`: Node becomes out of disk.
|
||||||
|
* `node.kubernetes.io/memoryPressure`: Node has memory pressure.
|
||||||
|
* `node.kubernetes.io/diskPressure`: Node has disk pressure.
|
||||||
|
* `node.kubernetes.io/networkUnavailable`: Node's network is unavailable.
|
||||||
|
* `node.cloudprovider.kubernetes.io/uninitialized`: When kubelet is started
|
||||||
|
with "external" cloud provider, it sets this taint on a node to mark it
|
||||||
|
as unusable. When a controller from the cloud-controller-manager initializes
|
||||||
|
this node, kubelet removes this taint.
|
||||||
|
|
||||||
|
When the `TaintBasedEvictions` alpha feature is enabled (you can do this by
|
||||||
|
including `TaintBasedEvictions=true` in `--feature-gates`, such as
|
||||||
|
`--feature-gates=FooBar=true,TaintBasedEvictions=true`), the taints are automatically
|
||||||
|
added by the NodeController (or kubelet) and the normal logic for evicting pods from nodes
|
||||||
|
based on the Ready NodeCondition is disabled.
|
||||||
|
(Note: To maintain the existing [rate limiting](/docs/concepts/architecture/nodes/)
|
||||||
|
behavior of pod evictions due to node problems, the system actually adds the taints
|
||||||
|
in a rate-limited way. This prevents massive pod evictions in scenarios such
|
||||||
|
as the master becoming partitioned from the nodes.)
|
||||||
|
This alpha feature, in combination with `tolerationSeconds`, allows a pod
|
||||||
|
to specify how long it should stay bound to a node that has one or both of these problems.
|
||||||
|
|
||||||
|
For example, an application with a lot of local state might want to stay
|
||||||
|
bound to node for a long time in the event of network partition, in the hope
|
||||||
|
that the partition will recover and thus the pod eviction can be avoided.
|
||||||
|
The toleration the pod would use in that case would look like
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
tolerations:
|
||||||
|
- key: "node.alpha.kubernetes.io/unreachable"
|
||||||
|
operator: "Exists"
|
||||||
|
effect: "NoExecute"
|
||||||
|
tolerationSeconds: 6000
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that Kubernetes automatically adds a toleration for
|
||||||
|
`node.alpha.kubernetes.io/notReady` with `tolerationSeconds=300`
|
||||||
|
unless the pod configuration provided
|
||||||
|
by the user already has a toleration for `node.alpha.kubernetes.io/notReady`.
|
||||||
|
Likewise it adds a toleration for
|
||||||
|
`node.alpha.kubernetes.io/unreachable` with `tolerationSeconds=300`
|
||||||
|
unless the pod configuration provided
|
||||||
|
by the user already has a toleration for `node.alpha.kubernetes.io/unreachable`.
|
||||||
|
|
||||||
|
These automatically-added tolerations ensure that
|
||||||
|
the default pod behavior of remaining bound for 5 minutes after one of these
|
||||||
|
problems is detected is maintained.
|
||||||
|
The two default tolerations are added by the [DefaultTolerationSeconds
|
||||||
|
admission controller](https://git.k8s.io/kubernetes/plugin/pkg/admission/defaulttolerationseconds).
|
||||||
|
|
||||||
|
[DaemonSet](/docs/concepts/workloads/controllers/daemonset/) pods are created with
|
||||||
|
`NoExecute` tolerations for the following taints with no `tolerationSeconds`:
|
||||||
|
|
||||||
|
* `node.alpha.kubernetes.io/unreachable`
|
||||||
|
* `node.alpha.kubernetes.io/notReady`
|
||||||
|
* `node.kubernetes.io/memoryPressure`
|
||||||
|
* `node.kubernetes.io/diskPressure`
|
||||||
|
* `node.kubernetes.io/outOfDisk` (*only for critical pods*)
|
||||||
|
|
||||||
|
This ensures that DaemonSet pods are never evicted due to these problems,
|
||||||
|
which matches the behavior when this feature is disabled.
|
|
@ -4152,7 +4152,7 @@ When an object is created, the system will populate this list with the current s
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td class="tableblock halign-left valign-top"><p class="tableblock">tolerations</p></td>
|
<td class="tableblock halign-left valign-top"><p class="tableblock">tolerations</p></td>
|
||||||
<td class="tableblock halign-left valign-top"><p class="tableblock">If specified, the pod’s tolerations.</p></td>
|
<td class="tableblock halign-left valign-top"><p class="tableblock">If specified, the pod’s tolerations. More info: <a href="https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/">https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/</a></p></td>
|
||||||
<td class="tableblock halign-left valign-top"><p class="tableblock">false</p></td>
|
<td class="tableblock halign-left valign-top"><p class="tableblock">false</p></td>
|
||||||
<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#_v1_toleration">v1.Toleration</a> array</p></td>
|
<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="#_v1_toleration">v1.Toleration</a> array</p></td>
|
||||||
<td class="tableblock halign-left valign-top"></td>
|
<td class="tableblock halign-left valign-top"></td>
|
||||||
|
|
|
@ -1167,7 +1167,7 @@ Appears In <a href="#pod-v1-core">Pod</a> <a href="#podtemplatespec-v1-core">Pod
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>tolerations <br /> <em><a href="#toleration-v1-core">Toleration</a> array</em></td>
|
<td>tolerations <br /> <em><a href="#toleration-v1-core">Toleration</a> array</em></td>
|
||||||
<td>If specified, the pod's tolerations.</td>
|
<td>If specified, the pod's tolerations. More info: <a href="http://kubernetes.io/docs/concepts/configuration/taint-and-toleration">http://kubernetes.io/docs/concepts/configuration/taint-and-toleration/</a></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>volumes <br /> <em><a href="#volume-v1-core">Volume</a> array</em></td>
|
<td>volumes <br /> <em><a href="#volume-v1-core">Volume</a> array</em></td>
|
||||||
|
|
|
@ -1261,7 +1261,7 @@ Appears In:
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>tolerations <br /> <em><a href="#toleration-v1-core">Toleration</a> array</em></td>
|
<td>tolerations <br /> <em><a href="#toleration-v1-core">Toleration</a> array</em></td>
|
||||||
<td>If specified, the pod's tolerations.</td>
|
<td>If specified, the pod's tolerations. More info: <a href="https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/">https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/</a></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>volumes <br /> <em><a href="#volume-v1-core">Volume</a> array</em> <br /> <strong>patch type</strong>: <em>merge</em> <br /> <strong>patch merge key</strong>: <em>name</em></td>
|
<td>volumes <br /> <em><a href="#volume-v1-core">Volume</a> array</em> <br /> <strong>patch type</strong>: <em>merge</em> <br /> <strong>patch merge key</strong>: <em>name</em></td>
|
||||||
|
|
|
@ -123,6 +123,7 @@ prevent cross talk, or advanced networking policy.
|
||||||
|
|
||||||
By default, there are no restrictions on which nodes may run a pod. Kubernetes offers a
|
By default, there are no restrictions on which nodes may run a pod. Kubernetes offers a
|
||||||
[rich set of policies for controlling placement of pods onto nodes](/docs/concepts/configuration/assign-pod-node/)
|
[rich set of policies for controlling placement of pods onto nodes](/docs/concepts/configuration/assign-pod-node/)
|
||||||
|
and the [taint based pod placement and eviction](/docs/concepts/configuration/taint-and-toleration)
|
||||||
that are available to end users. For many clusters use of these policies to separate workloads
|
that are available to end users. For many clusters use of these policies to separate workloads
|
||||||
can be a convention that authors adopt or enforce via tooling.
|
can be a convention that authors adopt or enforce via tooling.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue