655 lines
30 KiB
Markdown
655 lines
30 KiB
Markdown
---
|
|
reviewers:
|
|
- caesarxuchao
|
|
- dchen1107
|
|
title: Nodes
|
|
content_type: concept
|
|
weight: 10
|
|
---
|
|
|
|
<!-- overview -->
|
|
|
|
Kubernetes runs your workload by placing containers into Pods to run on _Nodes_.
|
|
A node may be a virtual or physical machine, depending on the cluster. Each node
|
|
is managed by the
|
|
{{< glossary_tooltip text="control plane" term_id="control-plane" >}}
|
|
and contains the services necessary to run
|
|
{{< glossary_tooltip text="Pods" term_id="pod" >}}.
|
|
|
|
Typically you have several nodes in a cluster; in a learning or resource-limited
|
|
environment, you might have only one node.
|
|
|
|
The [components](/docs/concepts/overview/components/#node-components) on a node include the
|
|
{{< glossary_tooltip text="kubelet" term_id="kubelet" >}}, a
|
|
{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}}, and the
|
|
{{< glossary_tooltip text="kube-proxy" term_id="kube-proxy" >}}.
|
|
|
|
<!-- body -->
|
|
|
|
## Management
|
|
|
|
There are two main ways to have Nodes added to the {{< glossary_tooltip text="API server" term_id="kube-apiserver" >}}:
|
|
|
|
1. The kubelet on a node self-registers to the control plane
|
|
2. You (or another human user) manually add a Node object
|
|
|
|
After you create a Node {{< glossary_tooltip text="object" term_id="object" >}},
|
|
or the kubelet on a node self-registers, the control plane checks whether the new Node object is
|
|
valid. For example, if you try to create a Node from the following JSON manifest:
|
|
|
|
```json
|
|
{
|
|
"kind": "Node",
|
|
"apiVersion": "v1",
|
|
"metadata": {
|
|
"name": "10.240.79.157",
|
|
"labels": {
|
|
"name": "my-first-k8s-node"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Kubernetes creates a Node object internally (the representation). Kubernetes checks
|
|
that a kubelet has registered to the API server that matches the `metadata.name`
|
|
field of the Node. If the node is healthy (i.e. all necessary services are running),
|
|
then it is eligible to run a Pod. Otherwise, that node is ignored for any cluster activity
|
|
until it becomes healthy.
|
|
|
|
{{< note >}}
|
|
Kubernetes keeps the object for the invalid Node and continues checking to see whether
|
|
it becomes healthy.
|
|
|
|
You, or a {{< glossary_tooltip term_id="controller" text="controller">}}, must explicitly
|
|
delete the Node object to stop that health checking.
|
|
{{< /note >}}
|
|
|
|
The name of a Node object must be a valid
|
|
[DNS subdomain name](/docs/concepts/overview/working-with-objects/names#dns-subdomain-names).
|
|
|
|
### Node name uniqueness
|
|
|
|
The [name](/docs/concepts/overview/working-with-objects/names#names) identifies a Node. Two Nodes
|
|
cannot have the same name at the same time. Kubernetes also assumes that a resource with the same
|
|
name is the same object. In case of a Node, it is implicitly assumed that an instance using the
|
|
same name will have the same state (e.g. network settings, root disk contents)
|
|
and attributes like node labels. This may lead to
|
|
inconsistencies if an instance was modified without changing its name. If the Node needs to be
|
|
replaced or updated significantly, the existing Node object needs to be removed from API server
|
|
first and re-added after the update.
|
|
|
|
### Self-registration of Nodes
|
|
|
|
When the kubelet flag `--register-node` is true (the default), the kubelet will attempt to
|
|
register itself with the API server. This is the preferred pattern, used by most distros.
|
|
|
|
For self-registration, the kubelet is started with the following options:
|
|
|
|
- `--kubeconfig` - Path to credentials to authenticate itself to the API server.
|
|
- `--cloud-provider` - How to talk to a {{< glossary_tooltip text="cloud provider" term_id="cloud-provider" >}}
|
|
to read metadata about itself.
|
|
- `--register-node` - Automatically register with the API server.
|
|
- `--register-with-taints` - Register the node with the given list of
|
|
{{< glossary_tooltip text="taints" term_id="taint" >}} (comma separated `<key>=<value>:<effect>`).
|
|
|
|
No-op if `register-node` is false.
|
|
- `--node-ip` - IP address of the node.
|
|
- `--node-labels` - {{< glossary_tooltip text="Labels" term_id="label" >}} to add when registering the node
|
|
in the cluster (see label restrictions enforced by the
|
|
[NodeRestriction admission plugin](/docs/reference/access-authn-authz/admission-controllers/#noderestriction)).
|
|
- `--node-status-update-frequency` - Specifies how often kubelet posts its node status to the API server.
|
|
|
|
When the [Node authorization mode](/docs/reference/access-authn-authz/node/) and
|
|
[NodeRestriction admission plugin](/docs/reference/access-authn-authz/admission-controllers/#noderestriction)
|
|
are enabled, kubelets are only authorized to create/modify their own Node resource.
|
|
|
|
{{< note >}}
|
|
As mentioned in the [Node name uniqueness](#node-name-uniqueness) section,
|
|
when Node configuration needs to be updated, it is a good practice to re-register
|
|
the node with the API server. For example, if the kubelet being restarted with
|
|
the new set of `--node-labels`, but the same Node name is used, the change will
|
|
not take an effect, as labels are being set on the Node registration.
|
|
|
|
Pods already scheduled on the Node may misbehave or cause issues if the Node
|
|
configuration will be changed on kubelet restart. For example, already running
|
|
Pod may be tainted against the new labels assigned to the Node, while other
|
|
Pods, that are incompatible with that Pod will be scheduled based on this new
|
|
label. Node re-registration ensures all Pods will be drained and properly
|
|
re-scheduled.
|
|
{{< /note >}}
|
|
|
|
### Manual Node administration
|
|
|
|
You can create and modify Node objects using
|
|
{{< glossary_tooltip text="kubectl" term_id="kubectl" >}}.
|
|
|
|
When you want to create Node objects manually, set the kubelet flag `--register-node=false`.
|
|
|
|
You can modify Node objects regardless of the setting of `--register-node`.
|
|
For example, you can set labels on an existing Node or mark it unschedulable.
|
|
|
|
You can use labels on Nodes in conjunction with node selectors on Pods to control
|
|
scheduling. For example, you can constrain a Pod to only be eligible to run on
|
|
a subset of the available nodes.
|
|
|
|
Marking a node as unschedulable prevents the scheduler from placing new pods onto
|
|
that Node but does not affect existing Pods on the Node. This is useful as a
|
|
preparatory step before a node reboot or other maintenance.
|
|
|
|
To mark a Node unschedulable, run:
|
|
|
|
```shell
|
|
kubectl cordon $NODENAME
|
|
```
|
|
|
|
See [Safely Drain a Node](/docs/tasks/administer-cluster/safely-drain-node/)
|
|
for more details.
|
|
|
|
{{< note >}}
|
|
Pods that are part of a {{< glossary_tooltip term_id="daemonset" >}} tolerate
|
|
being run on an unschedulable Node. DaemonSets typically provide node-local services
|
|
that should run on the Node even if it is being drained of workload applications.
|
|
{{< /note >}}
|
|
|
|
## Node status
|
|
|
|
A Node's status contains the following information:
|
|
|
|
* [Addresses](#addresses)
|
|
* [Conditions](#condition)
|
|
* [Capacity and Allocatable](#capacity)
|
|
* [Info](#info)
|
|
|
|
You can use `kubectl` to view a Node's status and other details:
|
|
|
|
```shell
|
|
kubectl describe node <insert-node-name-here>
|
|
```
|
|
|
|
Each section of the output is described below.
|
|
|
|
### Addresses
|
|
|
|
The usage of these fields varies depending on your cloud provider or bare metal configuration.
|
|
|
|
* HostName: The hostname as reported by the node's kernel. Can be overridden via the kubelet
|
|
`--hostname-override` parameter.
|
|
* ExternalIP: Typically the IP address of the node that is externally routable (available from
|
|
outside the cluster).
|
|
* InternalIP: Typically the IP address of the node that is routable only within the cluster.
|
|
|
|
|
|
### Conditions {#condition}
|
|
|
|
The `conditions` field describes the status of all `Running` nodes. Examples of conditions include:
|
|
|
|
{{< table caption = "Node conditions, and a description of when each condition applies." >}}
|
|
| Node Condition | Description |
|
|
|----------------------|-------------|
|
|
| `Ready` | `True` if the node is healthy and ready to accept pods, `False` if the node is not healthy and is not accepting pods, and `Unknown` if the node controller has not heard from the node in the last `node-monitor-grace-period` (default is 40 seconds) |
|
|
| `DiskPressure` | `True` if pressure exists on the disk size—that is, if the disk capacity is low; otherwise `False` |
|
|
| `MemoryPressure` | `True` if pressure exists on the node memory—that is, if the node memory is low; otherwise `False` |
|
|
| `PIDPressure` | `True` if pressure exists on the processes—that is, if there are too many processes on the node; otherwise `False` |
|
|
| `NetworkUnavailable` | `True` if the network for the node is not correctly configured, otherwise `False` |
|
|
{{< /table >}}
|
|
|
|
{{< note >}}
|
|
If you use command-line tools to print details of a cordoned Node, the Condition includes
|
|
`SchedulingDisabled`. `SchedulingDisabled` is not a Condition in the Kubernetes API; instead,
|
|
cordoned nodes are marked Unschedulable in their spec.
|
|
{{< /note >}}
|
|
|
|
In the Kubernetes API, a node's condition is represented as part of the `.status`
|
|
of the Node resource. For example, the following JSON structure describes a healthy node:
|
|
|
|
```json
|
|
"conditions": [
|
|
{
|
|
"type": "Ready",
|
|
"status": "True",
|
|
"reason": "KubeletReady",
|
|
"message": "kubelet is posting ready status",
|
|
"lastHeartbeatTime": "2019-06-05T18:38:35Z",
|
|
"lastTransitionTime": "2019-06-05T11:41:27Z"
|
|
}
|
|
]
|
|
```
|
|
|
|
If the `status` of the Ready condition remains `Unknown` or `False` for longer
|
|
than the `pod-eviction-timeout` (an argument passed to the
|
|
{{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager"
|
|
>}}), then the [node controller](#node-controller) triggers
|
|
{{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
|
|
for all Pods assigned to that node. The default eviction timeout duration is
|
|
**five minutes**.
|
|
In some cases when the node is unreachable, the API server is unable to communicate
|
|
with the kubelet on the node. The decision to delete the pods cannot be communicated to
|
|
the kubelet until communication with the API server is re-established. In the meantime,
|
|
the pods that are scheduled for deletion may continue to run on the partitioned node.
|
|
|
|
The node controller does not force delete pods until it is confirmed that they have stopped
|
|
running in the cluster. You can see the pods that might be running on an unreachable node as
|
|
being in the `Terminating` or `Unknown` state. In cases where Kubernetes cannot deduce from the
|
|
underlying infrastructure if a node has permanently left a cluster, the cluster administrator
|
|
may need to delete the node object by hand. Deleting the node object from Kubernetes causes
|
|
all the Pod objects running on the node to be deleted from the API server and frees up their
|
|
names.
|
|
|
|
When problems occur on nodes, the Kubernetes control plane automatically creates
|
|
[taints](/docs/concepts/scheduling-eviction/taint-and-toleration/) that match the conditions
|
|
affecting the node.
|
|
The scheduler takes the Node's taints into consideration when assigning a Pod to a Node.
|
|
Pods can also have {{< glossary_tooltip text="tolerations" term_id="toleration" >}} that let
|
|
them run on a Node even though it has a specific taint.
|
|
|
|
See [Taint Nodes by Condition](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition)
|
|
for more details.
|
|
|
|
### Capacity and Allocatable {#capacity}
|
|
|
|
Describes the resources available on the node: CPU, memory, and the maximum
|
|
number of pods that can be scheduled onto the node.
|
|
|
|
The fields in the capacity block indicate the total amount of resources that a
|
|
Node has. The allocatable block indicates the amount of resources on a
|
|
Node that is available to be consumed by normal Pods.
|
|
|
|
You may read more about capacity and allocatable resources while learning how
|
|
to [reserve compute resources](/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable)
|
|
on a Node.
|
|
|
|
### Info
|
|
|
|
Describes general information about the node, such as kernel version, Kubernetes
|
|
version (kubelet and kube-proxy version), container runtime details, and which
|
|
operating system the node uses.
|
|
The kubelet gathers this information from the node and publishes it into
|
|
the Kubernetes API.
|
|
|
|
## Heartbeats
|
|
|
|
Heartbeats, sent by Kubernetes nodes, help your cluster determine the
|
|
availability of each node, and to take action when failures are detected.
|
|
|
|
For nodes there are two forms of heartbeats:
|
|
|
|
* updates to the `.status` of a Node
|
|
* [Lease](/docs/reference/kubernetes-api/cluster-resources/lease-v1/) objects
|
|
within the `kube-node-lease`
|
|
{{< glossary_tooltip term_id="namespace" text="namespace">}}.
|
|
Each Node has an associated Lease object.
|
|
|
|
Compared to updates to `.status` of a Node, a Lease is a lightweight resource.
|
|
Using Leases for heartbeats reduces the performance impact of these updates
|
|
for large clusters.
|
|
|
|
The kubelet is responsible for creating and updating the `.status` of Nodes,
|
|
and for updating their related Leases.
|
|
|
|
- The kubelet updates the node's `.status` either when there is change in status
|
|
or if there has been no update for a configured interval. The default interval
|
|
for `.status` updates to Nodes is 5 minutes, which is much longer than the 40
|
|
second default timeout for unreachable nodes.
|
|
- The kubelet creates and then updates its Lease object every 10 seconds
|
|
(the default update interval). Lease updates occur independently from
|
|
updates to the Node's `.status`. If the Lease update fails, the kubelet retries,
|
|
using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.
|
|
|
|
## Node controller
|
|
|
|
The node {{< glossary_tooltip text="controller" term_id="controller" >}} is a
|
|
Kubernetes control plane component that manages various aspects of nodes.
|
|
|
|
The node controller has multiple roles in a node's life. The first is assigning a
|
|
CIDR block to the node when it is registered (if CIDR assignment is turned on).
|
|
|
|
The second is keeping the node controller's internal list of nodes up to date with
|
|
the cloud provider's list of available machines. When running in a cloud
|
|
environment and whenever a node is unhealthy, the node controller asks the cloud
|
|
provider if the VM for that node is still available. If not, the node
|
|
controller deletes the node from its list of nodes.
|
|
|
|
The third is monitoring the nodes' health. The node controller is
|
|
responsible for:
|
|
|
|
- In the case that a node becomes unreachable, updating the `Ready` condition
|
|
in the Node's `.status` field. In this case the node controller sets the
|
|
`Ready` condition to `Unknown`.
|
|
- If a node remains unreachable: triggering
|
|
[API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/)
|
|
for all of the Pods on the unreachable node. By default, the node controller
|
|
waits 5 minutes between marking the node as `Unknown` and submitting
|
|
the first eviction request.
|
|
|
|
By default, the node controller checks the state of each node every 5 seconds.
|
|
This period can be configured using the `--node-monitor-period` flag on the
|
|
`kube-controller-manager` component.
|
|
|
|
### Rate limits on eviction
|
|
|
|
In most cases, the node controller limits the eviction rate to
|
|
`--node-eviction-rate` (default 0.1) per second, meaning it won't evict pods
|
|
from more than 1 node per 10 seconds.
|
|
|
|
The node eviction behavior changes when a node in a given availability zone
|
|
becomes unhealthy. The node controller checks what percentage of nodes in the zone
|
|
are unhealthy (the `Ready` condition is `Unknown` or `False`) at
|
|
the same time:
|
|
|
|
- If the fraction of unhealthy nodes is at least `--unhealthy-zone-threshold`
|
|
(default 0.55), then the eviction rate is reduced.
|
|
- If the cluster is small (i.e. has less than or equal to
|
|
`--large-cluster-size-threshold` nodes - default 50), then evictions are stopped.
|
|
- Otherwise, the eviction rate is reduced to `--secondary-node-eviction-rate`
|
|
(default 0.01) per second.
|
|
|
|
The reason these policies are implemented per availability zone is because one
|
|
availability zone might become partitioned from the control plane while the others remain
|
|
connected. If your cluster does not span multiple cloud provider availability zones,
|
|
then the eviction mechanism does not take per-zone unavailability into account.
|
|
|
|
A key reason for spreading your nodes across availability zones is so that the
|
|
workload can be shifted to healthy zones when one entire zone goes down.
|
|
Therefore, if all nodes in a zone are unhealthy, then the node controller evicts at
|
|
the normal rate of `--node-eviction-rate`. The corner case is when all zones are
|
|
completely unhealthy (none of the nodes in the cluster are healthy). In such a
|
|
case, the node controller assumes that there is some problem with connectivity
|
|
between the control plane and the nodes, and doesn't perform any evictions.
|
|
(If there has been an outage and some nodes reappear, the node controller does
|
|
evict pods from the remaining nodes that are unhealthy or unreachable).
|
|
|
|
The node controller is also responsible for evicting pods running on nodes with
|
|
`NoExecute` taints, unless those pods tolerate that taint.
|
|
The node controller also adds {{< glossary_tooltip text="taints" term_id="taint" >}}
|
|
corresponding to node problems like node unreachable or not ready. This means
|
|
that the scheduler won't place Pods onto unhealthy nodes.
|
|
|
|
## Resource capacity tracking {#node-capacity}
|
|
|
|
Node objects track information about the Node's resource capacity: for example, the amount
|
|
of memory available and the number of CPUs.
|
|
Nodes that [self register](#self-registration-of-nodes) report their capacity during
|
|
registration. If you [manually](#manual-node-administration) add a Node, then
|
|
you need to set the node's capacity information when you add it.
|
|
|
|
The Kubernetes {{< glossary_tooltip text="scheduler" term_id="kube-scheduler" >}} ensures that
|
|
there are enough resources for all the Pods on a Node. The scheduler checks that the sum
|
|
of the requests of containers on the node is no greater than the node's capacity.
|
|
That sum of requests includes all containers managed by the kubelet, but excludes any
|
|
containers started directly by the container runtime, and also excludes any
|
|
processes running outside of the kubelet's control.
|
|
|
|
{{< note >}}
|
|
If you want to explicitly reserve resources for non-Pod processes, see
|
|
[reserve resources for system daemons](/docs/tasks/administer-cluster/reserve-compute-resources/#system-reserved).
|
|
{{< /note >}}
|
|
|
|
## Node topology
|
|
|
|
{{< feature-state state="beta" for_k8s_version="v1.18" >}}
|
|
|
|
If you have enabled the `TopologyManager`
|
|
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/), then
|
|
the kubelet can use topology hints when making resource assignment decisions.
|
|
See [Control Topology Management Policies on a Node](/docs/tasks/administer-cluster/topology-manager/)
|
|
for more information.
|
|
|
|
## Graceful node shutdown {#graceful-node-shutdown}
|
|
|
|
{{< feature-state state="beta" for_k8s_version="v1.21" >}}
|
|
|
|
The kubelet attempts to detect node system shutdown and terminates pods running on the node.
|
|
|
|
Kubelet ensures that pods follow the normal
|
|
[pod termination process](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
|
|
during the node shutdown.
|
|
|
|
The Graceful node shutdown feature depends on systemd since it takes advantage of
|
|
[systemd inhibitor locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) to
|
|
delay the node shutdown with a given duration.
|
|
|
|
Graceful node shutdown is controlled with the `GracefulNodeShutdown`
|
|
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) which is
|
|
enabled by default in 1.21.
|
|
|
|
Note that by default, both configuration options described below,
|
|
`shutdownGracePeriod` and `shutdownGracePeriodCriticalPods` are set to zero,
|
|
thus not activating the graceful node shutdown functionality.
|
|
To activate the feature, the two kubelet config settings should be configured appropriately and
|
|
set to non-zero values.
|
|
|
|
During a graceful shutdown, kubelet terminates pods in two phases:
|
|
|
|
1. Terminate regular pods running on the node.
|
|
2. Terminate [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
|
|
running on the node.
|
|
|
|
Graceful node shutdown feature is configured with two
|
|
[`KubeletConfiguration`](/docs/tasks/administer-cluster/kubelet-config-file/) options:
|
|
|
|
* `shutdownGracePeriod`:
|
|
* Specifies the total duration that the node should delay the shutdown by. This is the total
|
|
grace period for pod termination for both regular and
|
|
[critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical).
|
|
* `shutdownGracePeriodCriticalPods`:
|
|
* Specifies the duration used to terminate
|
|
[critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
|
|
during a node shutdown. This value should be less than `shutdownGracePeriod`.
|
|
|
|
For example, if `shutdownGracePeriod=30s`, and
|
|
`shutdownGracePeriodCriticalPods=10s`, kubelet will delay the node shutdown by
|
|
30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved
|
|
for gracefully terminating normal pods, and the last 10 seconds would be
|
|
reserved for terminating [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical).
|
|
|
|
{{< note >}}
|
|
When pods were evicted during the graceful node shutdown, they are marked as shutdown.
|
|
Running `kubectl get pods` shows the status of the the evicted pods as `Terminated`.
|
|
And `kubectl describe pod` indicates that the pod was evicted because of node shutdown:
|
|
|
|
```
|
|
Reason: Terminated
|
|
Message: Pod was terminated in response to imminent node shutdown.
|
|
```
|
|
|
|
{{< /note >}}
|
|
|
|
## Non Graceful node shutdown {#non-graceful-node-shutdown}
|
|
|
|
{{< feature-state state="alpha" for_k8s_version="v1.24" >}}
|
|
|
|
A node shutdown action may not be detected by kubelet's Node Shutdown Manager,
|
|
either because the command does not trigger the inhibitor locks mechanism used by
|
|
kubelet or because of a user error, i.e., the ShutdownGracePeriod and
|
|
ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above
|
|
section [Graceful Node Shutdown](#graceful-node-shutdown) for more details.
|
|
|
|
When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods
|
|
that are part of a StatefulSet will be stuck in terminating status on
|
|
the shutdown node and cannot move to a new running node. This is because kubelet on
|
|
the shutdown node is not available to delete the pods so the StatefulSet cannot
|
|
create a new pod with the same name. If there are volumes used by the pods, the
|
|
VolumeAttachments will not be deleted from the original shutdown node so the volumes
|
|
used by these pods cannot be attached to a new running node. As a result, the
|
|
application running on the StatefulSet cannot function properly. If the original
|
|
shutdown node comes up, the pods will be deleted by kubelet and new pods will be
|
|
created on a different running node. If the original shutdown node does not come up,
|
|
these pods will be stuck in terminating status on the shutdown node forever.
|
|
|
|
To mitigate the above situation, a user can manually add the taint `node.kubernetes.io/out-of-service` with either `NoExecute`
|
|
or `NoSchedule` effect to a Node marking it out-of-service.
|
|
If the `NodeOutOfServiceVolumeDetach`[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
|
is enabled on `kube-controller-manager`, and a Node is marked out-of-service with this taint, the
|
|
pods on the node will be forcefully deleted if there are no matching tolerations on it and volume
|
|
detach operations for the pods terminating on the node will happen immediately. This allows the
|
|
Pods on the out-of-service node to recover quickly on a different node.
|
|
|
|
During a non-graceful shutdown, Pods are terminated in the two phases:
|
|
|
|
1. Force delete the Pods that do not have matching `out-of-service` tolerations.
|
|
2. Immediately perform detach volume operation for such pods.
|
|
|
|
{{< note >}}
|
|
- Before adding the taint `node.kubernetes.io/out-of-service` , it should be verified
|
|
that the node is already in shutdown or power off state (not in the middle of
|
|
restarting).
|
|
- The user is required to manually remove the out-of-service taint after the pods are
|
|
moved to a new node and the user has checked that the shutdown node has been
|
|
recovered since the user was the one who originally added the taint.
|
|
{{< /note >}}
|
|
|
|
### Pod Priority based graceful node shutdown {#pod-priority-graceful-node-shutdown}
|
|
|
|
{{< feature-state state="alpha" for_k8s_version="v1.23" >}}
|
|
|
|
To provide more flexibility during graceful node shutdown around the ordering
|
|
of pods during shutdown, graceful node shutdown honors the PriorityClass for
|
|
Pods, provided that you enabled this feature in your cluster. The feature
|
|
allows cluster administers to explicitly define the ordering of pods
|
|
during graceful node shutdown based on
|
|
[priority classes](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass).
|
|
|
|
The [Graceful Node Shutdown](#graceful-node-shutdown) feature, as described
|
|
above, shuts down pods in two phases, non-critical pods, followed by critical
|
|
pods. If additional flexibility is needed to explicitly define the ordering of
|
|
pods during shutdown in a more granular way, pod priority based graceful
|
|
shutdown can be used.
|
|
|
|
When graceful node shutdown honors pod priorities, this makes it possible to do
|
|
graceful node shutdown in multiple phases, each phase shutting down a
|
|
particular priority class of pods. The kubelet can be configured with the exact
|
|
phases and shutdown time per phase.
|
|
|
|
Assuming the following custom pod
|
|
[priority classes](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass)
|
|
in a cluster,
|
|
|
|
|Pod priority class name|Pod priority class value|
|
|
|-------------------------|------------------------|
|
|
|`custom-class-a` | 100000 |
|
|
|`custom-class-b` | 10000 |
|
|
|`custom-class-c` | 1000 |
|
|
|`regular/unset` | 0 |
|
|
|
|
Within the [kubelet configuration](/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-KubeletConfiguration)
|
|
the settings for `shutdownGracePeriodByPodPriority` could look like:
|
|
|
|
|Pod priority class value|Shutdown period|
|
|
|------------------------|---------------|
|
|
| 100000 |10 seconds |
|
|
| 10000 |180 seconds |
|
|
| 1000 |120 seconds |
|
|
| 0 |60 seconds |
|
|
|
|
The corresponding kubelet config YAML configuration would be:
|
|
|
|
```yaml
|
|
shutdownGracePeriodByPodPriority:
|
|
- priority: 100000
|
|
shutdownGracePeriodSeconds: 10
|
|
- priority: 10000
|
|
shutdownGracePeriodSeconds: 180
|
|
- priority: 1000
|
|
shutdownGracePeriodSeconds: 120
|
|
- priority: 0
|
|
shutdownGracePeriodSeconds: 60
|
|
```
|
|
|
|
The above table implies that any pod with `priority` value >= 100000 will get
|
|
just 10 seconds to stop, any pod with value >= 10000 and < 100000 will get 180
|
|
seconds to stop, any pod with value >= 1000 and < 10000 will get 120 seconds to stop.
|
|
Finally, all other pods will get 60 seconds to stop.
|
|
|
|
One doesn't have to specify values corresponding to all of the classes. For
|
|
example, you could instead use these settings:
|
|
|
|
|Pod priority class value|Shutdown period|
|
|
|------------------------|---------------|
|
|
| 100000 |300 seconds |
|
|
| 1000 |120 seconds |
|
|
| 0 |60 seconds |
|
|
|
|
|
|
In the above case, the pods with `custom-class-b` will go into the same bucket
|
|
as `custom-class-c` for shutdown.
|
|
|
|
If there are no pods in a particular range, then the kubelet does not wait
|
|
for pods in that priority range. Instead, the kubelet immediately skips to the
|
|
next priority class value range.
|
|
|
|
If this feature is enabled and no configuration is provided, then no ordering
|
|
action will be taken.
|
|
|
|
Using this feature requires enabling the `GracefulNodeShutdownBasedOnPodPriority`
|
|
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
|
, and setting `ShutdownGracePeriodByPodPriority` in the
|
|
[kubelet config](/docs/reference/config-api/kubelet-config.v1beta1/)
|
|
to the desired configuration containing the pod priority class values and
|
|
their respective shutdown periods.
|
|
|
|
{{< note >}}
|
|
The ability to take Pod priority into account during graceful node shutdown was introduced
|
|
as an Alpha feature in Kubernetes v1.23. In Kubernetes {{< skew currentVersion >}}
|
|
the feature is Beta and is enabled by default.
|
|
{{< /note >}}
|
|
|
|
Metrics `graceful_shutdown_start_time_seconds` and `graceful_shutdown_end_time_seconds`
|
|
are emitted under the kubelet subsystem to monitor node shutdowns.
|
|
|
|
## Swap memory management {#swap-memory}
|
|
|
|
{{< feature-state state="alpha" for_k8s_version="v1.22" >}}
|
|
|
|
Prior to Kubernetes 1.22, nodes did not support the use of swap memory, and a
|
|
kubelet would by default fail to start if swap was detected on a node. In 1.22
|
|
onwards, swap memory support can be enabled on a per-node basis.
|
|
|
|
To enable swap on a node, the `NodeSwap` feature gate must be enabled on
|
|
the kubelet, and the `--fail-swap-on` command line flag or `failSwapOn`
|
|
[configuration setting](/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-KubeletConfiguration)
|
|
must be set to false.
|
|
|
|
{{< warning >}}
|
|
When the memory swap feature is turned on, Kubernetes data such as the content
|
|
of Secret objects that were written to tmpfs now could be swapped to disk.
|
|
{{< /warning >}}
|
|
|
|
A user can also optionally configure `memorySwap.swapBehavior` in order to
|
|
specify how a node will use swap memory. For example,
|
|
|
|
```yaml
|
|
memorySwap:
|
|
swapBehavior: LimitedSwap
|
|
```
|
|
|
|
The available configuration options for `swapBehavior` are:
|
|
|
|
- `LimitedSwap`: Kubernetes workloads are limited in how much swap they can
|
|
use. Workloads on the node not managed by Kubernetes can still swap.
|
|
- `UnlimitedSwap`: Kubernetes workloads can use as much swap memory as they
|
|
request, up to the system limit.
|
|
|
|
If configuration for `memorySwap` is not specified and the feature gate is
|
|
enabled, by default the kubelet will apply the same behaviour as the
|
|
`LimitedSwap` setting.
|
|
|
|
The behaviour of the `LimitedSwap` setting depends if the node is running with
|
|
v1 or v2 of control groups (also known as "cgroups"):
|
|
|
|
- **cgroupsv1:** Kubernetes workloads can use any combination of memory and
|
|
swap, up to the pod's memory limit, if set.
|
|
- **cgroupsv2:** Kubernetes workloads cannot use swap memory.
|
|
|
|
For more information, and to assist with testing and provide feedback, please
|
|
see [KEP-2400](https://github.com/kubernetes/enhancements/issues/2400) and its
|
|
[design proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md).
|
|
|
|
## {{% heading "whatsnext" %}}
|
|
|
|
* Learn about the [components](/docs/concepts/overview/components/#node-components) that make up a node.
|
|
* Read the [API definition for Node](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#node-v1-core).
|
|
* Read the [Node](https://git.k8s.io/design-proposals-archive/architecture/architecture.md#the-kubernetes-node)
|
|
section of the architecture design document.
|
|
* Read about [taints and tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/).
|
|
|