[zh-cn] sync scheduling-eviction/node-pressure-eviction.md

Signed-off-by: xin.li <xin.li@daocloud.io>
This commit is contained in:
xin.li 2023-09-10 14:09:00 +08:00
parent 18568296df
commit 2651394378
1 changed files with 176 additions and 116 deletions

View File

@ -18,8 +18,8 @@ When one or more of these resources reach specific consumption levels, the
kubelet can proactively fail one or more pods on the node to reclaim resources
and prevent starvation.
During a node-pressure eviction, the kubelet sets the `PodPhase` for the
selected pods to `Failed`. This terminates the pods.
During a node-pressure eviction, the kubelet sets the [phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase) for the
selected pods to `Failed`. This terminates the Pods.
Node-pressure eviction is not the same as
[API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/).
@ -29,49 +29,87 @@ Node-pressure eviction is not the same as
当这些资源中的一个或者多个达到特定的消耗水平,
kubelet 可以主动地使节点上一个或者多个 Pod 失效,以回收资源防止饥饿。
在节点压力驱逐期间kubelet 将所选 Pod 的 `PodPhase` 设置为 `Failed`。这将终止 Pod。
在节点压力驱逐期间kubelet 将所选 Pod 的[阶段](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)
设置为 `Failed`。这将终止 Pod。
节点压力驱逐不同于 [API 发起的驱逐](/zh-cn/docs/concepts/scheduling-eviction/api-eviction/)。
<!--
The kubelet does not respect your configured `PodDisruptionBudget` or the pod's
The kubelet does not respect your configured {{<glossary_tooltip term_id="pod-disruption-budget" text="PodDisruptionBudget">}}
or the pod's
`terminationGracePeriodSeconds`. If you use [soft eviction thresholds](#soft-eviction-thresholds),
the kubelet respects your configured `eviction-max-pod-grace-period`. If you use
[hard eviction thresholds](#hard-eviction-thresholds), it uses a `0s` grace period for termination.
[hard eviction thresholds](#hard-eviction-thresholds), the kubelet uses a `0s` grace period (immediate shutdown) for termination.
-->
kubelet 并不理会你配置的 {{<glossary_tooltip term_id="pod-disruption-budget" text="PodDisruptionBudget">}}
或者是 Pod 的 `terminationGracePeriodSeconds`
如果你使用了[软驱逐条件](#soft-eviction-thresholds)kubelet 会考虑你所配置的
`eviction-max-pod-grace-period`
如果你使用了[硬驱逐条件](#hard-eviction-thresholds)kubelet 使用 `0s`
宽限期(立即关闭)来终止 Pod。
<!--
## Self healing behavior
The kubelet attempts to [reclaim node-level resources](#reclaim-node-resources)
before it terminates end-user pods. For example, it removes unused container
images when disk resources are starved.
-->
## 自我修复行为
kubelet 在终止最终用户 Pod 之前会尝试[回收节点级资源](#reclaim-node-resources)。
例如,它会在磁盘资源不足时删除未使用的容器镜像。
<!--
If the pods are managed by a {{< glossary_tooltip text="workload" term_id="workload" >}}
resource (such as {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
or {{< glossary_tooltip text="Deployment" term_id="deployment" >}}) that
replaces failed pods, the control plane or `kube-controller-manager` creates new
pods in place of the evicted pods.
-->
kubelet 并不理会你配置的 `PodDisruptionBudget` 或者是 Pod 的 `terminationGracePeriodSeconds`
如果你使用了[软驱逐条件](#soft-eviction-thresholds)kubelet 会考虑你所配置的
`eviction-max-pod-grace-period`
如果你使用了[硬驱逐条件](#hard-eviction-thresholds),它使用 `0s` 宽限期来终止 Pod。
如果 Pod 是由替换失败 Pod 的{{< glossary_tooltip text="工作负载" term_id="workload" >}}资源
(例如 {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
或者 {{< glossary_tooltip text="Deployment" term_id="deployment" >}})管理,
则控制平面或 `kube-controller-manager` 会创建新的 Pod 来代替被驱逐的 Pod。
{{<note>}}
<!--
The kubelet attempts to [reclaim node-level resources](#reclaim-node-resources)
before it terminates end-user pods. For example, it removes unused container
images when disk resources are starved.
<!--
### Self healing for static pods
-->
kubelet 在终止最终用户 Pod 之前会尝试[回收节点级资源](#reclaim-node-resources)。
例如,它会在磁盘资源不足时删除未使用的容器镜像。
{{</note>}}
### 静态 Pod 的自我修复
<!--
If you are running a [static pod](/docs/concepts/workloads/pods/#static-pods)
on a node that is under resource pressure, the kubelet may evict that static
Pod. The kubelet then tries to create a replacement, because static Pods always
represent an intent to run a Pod on that node.
-->
如果你在面临资源压力的节点上运行静态 Pod则 kubelet 可能会驱逐该静态 Pod。
由于静态 Pod 始终表示在该节点上运行 Pod 的意图kubelet 会尝试创建替代 Pod。
<!--
The kubelet takes the _priority_ of the static pod into account when creating
a replacement. If the static pod manifest specifies a low priority, and there
are higher-priority Pods defined within the cluster's control plane, and the
node is under resource pressure, the kubelet may not be able to make room for
that static pod. The kubelet continues to attempt to run all static pods even
when there is resource pressure on a node.
-->
创建替代 Pod 时kubelet 会考虑静态 Pod 的优先级。如果静态 Pod 清单指定了低优先级,
并且集群的控制平面内定义了优先级更高的 Pod并且节点面临资源压力则 kubelet
可能无法为该静态 Pod 腾出空间。
即使节点上存在资源压力kubelet 也会继续尝试运行所有静态 pod。
<!--
## Eviction signals and thresholds
The kubelet uses various parameters to make eviction decisions, like the following:
- Eviction signals
- Eviction thresholds
- Monitoring intervals
-->
## 驱逐信号和阈值 {#eviction-signals-and-thresholds}
kubelet 使用各种参数来做出驱逐决定,如下所示:
- 驱逐信号
@ -86,7 +124,7 @@ point in time. Kubelet uses eviction signals to make eviction decisions by
comparing the signals to eviction thresholds, which are the minimum amount of
the resource that should be available on the node.
Kubelet uses the following eviction signals:
On Linux, the kubelet uses the following eviction signals:
-->
### 驱逐信号 {#eviction-signals}
@ -94,7 +132,7 @@ Kubelet uses the following eviction signals:
kubelet 使用驱逐信号,通过将信号与驱逐条件进行比较来做出驱逐决定,
驱逐条件是节点上应该可用资源的最小量。
kubelet 使用以下驱逐信号:
Linux 系统中,kubelet 使用以下驱逐信号:
| 驱逐信号 | 描述 |
|----------------------|---------------------------------------------------------------------------------------|
@ -106,12 +144,12 @@ kubelet 使用以下驱逐信号:
| `pid.available` | `pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` |
<!--
In this table, the `Description` column shows how kubelet gets the value of the
In this table, the **Description** column shows how kubelet gets the value of the
signal. Each signal supports either a percentage or a literal value. Kubelet
calculates the percentage value relative to the total capacity associated with
the signal.
-->
在上表中,`描述`列显示了 kubelet 如何获取信号的值。每个信号支持百分比值或者是字面值。
在上表中,**描述**列显示了 kubelet 如何获取信号的值。每个信号支持百分比值或者是字面值。
kubelet 计算相对于与信号有关的总量的百分比值。
<!--
@ -122,7 +160,7 @@ feature, out of resource decisions
are made local to the end user Pod part of the cgroup hierarchy as well as the
root node. This [script](/examples/admin/resource/memory-available.sh)
reproduces the same set of steps that the kubelet performs to calculate
`memory.available`. The kubelet excludes inactive_file (i.e. # of bytes of
`memory.available`. The kubelet excludes inactive_file (the number of bytes of
file-backed memory on inactive LRU list) from its calculation as it assumes that
memory is reclaimable under pressure.
-->
@ -132,27 +170,28 @@ memory is reclaimable under pressure.
这一功能特性,资源不足的判定是基于 cgroup 层次结构中的用户 Pod 所处的局部及 cgroup 根节点作出的。
这个[脚本](/zh-cn/examples/admin/resource/memory-available.sh)
重现了 kubelet 为计算 `memory.available` 而执行的相同步骤。
kubelet 在其计算中排除了 inactive_file非活动 LRU 列表上基于文件来虚拟的内存的字节数),
kubelet 在其计算中排除了 inactive_file非活动 LRU 列表上基于文件来虚拟的内存的字节数),
因为它假定在压力下内存是可回收的。
<!--
The kubelet supports the following filesystem partitions:
The kubelet recognizes two specific filesystem identifiers:
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir,
log storage, and more. For example, `nodefs` contains `/var/lib/kubelet/`.
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir
volumes not backed by memory, log storage, and more.
For example, `nodefs` contains `/var/lib/kubelet/`.
1. `imagefs`: An optional filesystem that container runtimes use to store container
images and container writable layers.
Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet
Kubelet auto-discovers these filesystems and ignores other node local filesystems. Kubelet
does not support other configurations.
-->
kubelet 支持以下文件系统分区
kubelet 可识别以下两个特定的文件系统标识符
1. `nodefs`节点的主要文件系统用于本地磁盘卷、emptyDir、日志存储等。
1. `nodefs`:节点的主要文件系统,用于本地磁盘卷、不受内存支持的 emptyDir、日志存储等。
例如,`nodefs` 包含 `/var/lib/kubelet/`
1. `imagefs`:可选文件系统,供容器运行时存储容器镜像和容器可写层。
kubelet 会自动发现这些文件系统并忽略其他文件系统。kubelet 不支持其他配置。
kubelet 会自动发现这些文件系统并忽略节点本地的其它文件系统。kubelet 不支持其他配置。
<!--
Some kubelet garbage collection features are deprecated in favor of eviction:
@ -179,7 +218,8 @@ Some kubelet garbage collection features are deprecated in favor of eviction:
### Eviction thresholds
You can specify custom eviction thresholds for the kubelet to use when it makes
eviction decisions.
eviction decisions. You can configure [soft](#soft-eviction-thresholds) and
[hard](#hard-eviction-thresholds) eviction thresholds.
Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where:
@ -193,6 +233,7 @@ Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where
### 驱逐条件 {#eviction-thresholds}
你可以为 kubelet 指定自定义驱逐条件,以便在作出驱逐决定时使用。
你可以设置[软性的](#soft-eviction-thresholds)和[硬性的](#hard-eviction-thresholds)驱逐阈值。
驱逐条件的形式为 `[eviction-signal][operator][quantity]`,其中:
@ -204,15 +245,15 @@ Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where
你可以使用文字值或百分比(`%`)。
<!--
For example, if a node has `10Gi` of total memory and you want trigger eviction if
the available memory falls below `1Gi`, you can define the eviction threshold as
either `memory.available<10%` or `memory.available<1Gi`. You cannot use both.
For example, if a node has 10GiB of total memory and you want trigger eviction if
the available memory falls below 1GiB, you can define the eviction threshold as
either `memory.available<10%` or `memory.available<1Gi` (you cannot use both).
You can configure soft and hard eviction thresholds.
-->
例如,如果一个节点的总内存为 10Gi 并且你希望在可用内存低于 1Gi 时触发驱逐,
则可以将驱逐条件定义为 `memory.available<10%` `memory.available< 1G`
你不能同时使用二者。
例如,如果一个节点的总内存为 10GiB 并且你希望在可用内存低于 1GiB 时触发驱逐,
则可以将驱逐条件定义为 `memory.available<10%`
`memory.available< 1G`你不能同时使用二者
你可以配置软和硬驱逐条件。
@ -221,8 +262,8 @@ You can configure soft and hard eviction thresholds.
A soft eviction threshold pairs an eviction threshold with a required
administrator-specified grace period. The kubelet does not evict pods until the
grace period is exceeded. The kubelet returns an error on startup if there is no
specified grace period.
grace period is exceeded. The kubelet returns an error on startup if you do
not specify a grace period.
-->
#### 软驱逐条件 {#soft-eviction-thresholds}
@ -305,25 +346,26 @@ should provide all the thresholds respectively.
为了提供自定义值,你应该分别设置所有阈值。
<!--
### Eviction monitoring interval
## Eviction monitoring interval
The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval`
The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval`,
which defaults to `10s`.
-->
### 驱逐监测间隔
## 驱逐监测间隔 {#eviction-monitoring-interval}
kubelet 根据其配置的 `housekeeping-interval`(默认为 `10s`)评估驱逐条件。
<!--
### Node conditions {#node-conditions}
## Node conditions {#node-conditions}
The kubelet reports node conditions to reflect that the node is under pressure
because hard or soft eviction threshold is met, independent of configured grace
periods.
The kubelet reports [node conditions](/docs/concepts/architecture/nodes/#condition)
to reflect that the node is under pressure because hard or soft eviction
threshold is met, independent of configured grace periods.
-->
### 节点条件 {#node-conditions}
## 节点状况 {#node-conditions}
kubelet 报告节点状况以反映节点处于压力之下,因为满足硬或软驱逐条件,与配置的宽限期无关。
kubelet 报告[节点状况](/zh-cn/docs/concepts/architecture/nodes/#condition)以反映节点处于压力之下,
原因是满足硬性的或软性的驱逐条件,与配置的宽限期无关。
<!--
The kubelet maps eviction signals to node conditions as follows:
@ -334,6 +376,9 @@ The kubelet maps eviction signals to node conditions as follows:
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
| `PIDPressure` | `pid.available` | Available processes identifiers on the (Linux) node has fallen below an eviction threshold |
The control plane also [maps](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition)
these node conditions to taints.
The kubelet updates the node conditions based on the configured
`--node-status-update-frequency`, which defaults to `10s`.
-->
@ -345,10 +390,12 @@ kubelet 根据下表将驱逐信号映射为节点状况:
| `DiskPressure` | `nodefs.available`、`nodefs.inodesFree`、`imagefs.available` 或 `imagefs.inodesFree` | 节点的根文件系统或镜像文件系统上的可用磁盘空间和 inode 已满足驱逐条件 |
| `PIDPressure` | `pid.available` | (Linux) 节点上的可用进程标识符已低于驱逐条件 |
控制平面还将这些节点状况[映射]((/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition))为其污点。
kubelet 根据配置的 `--node-status-update-frequency` 更新节点条件,默认为 `10s`
<!--
#### Node condition oscillation
### Node condition oscillation
In some cases, nodes oscillate above and below soft eviction thresholds without
holding for the defined grace periods. This causes the reported node condition
@ -358,7 +405,7 @@ To protect against oscillation, you can use the `eviction-pressure-transition-pe
flag, which controls how long the kubelet must wait before transitioning a node
condition to a different state. The transition period has a default value of `5m`.
-->
#### 节点条件振荡
### 节点状况波动 {#node-condition-oscillation}
在某些情况下,节点在软驱逐条件上下振荡,而没有保持定义的宽限期。
这会导致报告的节点条件在 `true``false` 之间不断切换,从而导致错误的驱逐决策。
@ -456,14 +503,14 @@ As a result, kubelet ranks and evicts pods in the following order:
{{<note>}}
<!--
The kubelet does not use the pod's QoS class to determine the eviction order.
The kubelet does not use the pod's [QoS class](/docs/concepts/workloads/pods/pod-qos/) to determine the eviction order.
You can use the QoS class to estimate the most likely pod eviction order when
reclaiming resources like memory. QoS does not apply to EphemeralStorage requests,
reclaiming resources like memory. QoS classification does not apply to EphemeralStorage requests,
so the above scenario will not apply if the node is, for example, under `DiskPressure`.
-->
kubelet 不使用 Pod 的 QoS 类来确定驱逐顺序。
kubelet 不使用 Pod 的 [QoS 类](/zh-cn/docs/concepts/workloads/pods/pod-qos/)来确定驱逐顺序。
在回收内存等资源时,你可以使用 QoS 类来估计最可能的 Pod 驱逐顺序。
QoS 不适用于临时存储EphemeralStorage请求
QoS 分类不适用于临时存储EphemeralStorage请求
因此如果节点在 `DiskPressure` 下,则上述场景将不适用。
{{</note>}}
@ -487,15 +534,25 @@ will choose to evict pods of lowest Priority first.
在这种情况下,它会选择首先驱逐最低优先级的 Pod。
<!--
When the kubelet evicts pods in response to `inode` or `PID` starvation, it uses
the Priority to determine the eviction order, because `inodes` and `PIDs` have no
If you are running a [static pod](/docs/concepts/workloads/pods/#static-pods)
and want to avoid having it evicted under resource pressure, set the
`priority` field for that Pod directly. Static pods do not support the
`priorityClassName` field.
-->
如果你正在运行[静态 Pod](/zh-cn/docs/concepts/workloads/pods/#static-pods)
并且希望避免其在资源压力下被驱逐,请直接为该 Pod 设置 `priority` 字段。
静态 Pod 不支持 `priorityClassName` 字段。
<!--
When the kubelet evicts pods in response to inode or process ID starvation, it uses
the Pods' relative priority to determine the eviction order, because inodes and PIDs have no
requests.
The kubelet sorts pods differently based on whether the node has a dedicated
`imagefs` filesystem:
-->
当 kubelet 因 inode 或 PID 不足而驱逐 Pod 时,
它使用优先级来确定驱逐顺序,因为 inode 和 PID 没有请求。
当 kubelet 因 inode 或 进程 ID 不足而驱逐 Pod 时,
它使用 Pod 的相对优先级来确定驱逐顺序,因为 inode 和 PID 没有对应的请求字段
kubelet 根据节点是否具有专用的 `imagefs` 文件系统对 Pod 进行不同的排序:
@ -569,53 +626,55 @@ evictionMinimumReclaim:
<!--
In this example, if the `nodefs.available` signal meets the eviction threshold,
the kubelet reclaims the resource until the signal reaches the threshold of `1Gi`,
and then continues to reclaim the minimum amount of `500Mi` it until the signal
reaches `1.5Gi`.
the kubelet reclaims the resource until the signal reaches the threshold of 1GiB,
and then continues to reclaim the minimum amount of 500MiB, until the available nodefs storage value reaches 1.5GiB.
Similarly, the kubelet reclaims the `imagefs` resource until the `imagefs.available`
signal reaches `102Gi`.
Similarly, the kubelet tries to reclaim the `imagefs` resource until the `imagefs.available`
value reaches `102Gi`, representing 102 GiB of available container image storage. If the amount
of storage that the kubelet could reclaim is less than 2GiB, the kubelet doesn't reclaim anything.
The default `eviction-minimum-reclaim` is `0` for all resources.
-->
在这个例子中,如果 `nodefs.available` 信号满足驱逐条件,
kubelet 会回收资源,直到信号达到 `1Gi` 的条件,
然后继续回收至少 `500Mi` 直到信号达到 `1.5Gi`
kubelet 会回收资源,直到信号达到 1GiB 的条件,
然后继续回收至少 500MiB 直到信号达到 1.5GiB
类似地kubelet 会回收 `imagefs` 资源,直到 `imagefs.available` 信号达到 `102Gi`
类似地kubelet 尝试回收 `imagefs` 资源,直到 `imagefs.available` 值达到 `102Gi`
即 102 GiB 的可用容器镜像存储。如果 kubelet 可以回收的存储量小于 2GiB
则 kubelet 不会回收任何内容。
对于所有资源,默认的 `eviction-minimum-reclaim``0`
<!--
### Node out of memory behavior
## Node out of memory behavior
If the node experiences an out of memory (OOM) event prior to the kubelet
If the node experiences an _out of memory_ (OOM) event prior to the kubelet
being able to reclaim memory, the node depends on the [oom_killer](https://lwn.net/Articles/391222/)
to respond.
The kubelet sets an `oom_score_adj` value for each container based on the QoS for the pod.
-->
### 节点内存不足行为
## 节点内存不足行为 {#node-out-of-memory-behavior}
如果节点在 kubelet 能够回收内存之前遇到内存不足OOM事件
如果节点在 kubelet 能够回收内存之前遇到**内存不足**OOM事件
则节点依赖 [oom_killer](https://lwn.net/Articles/391222/) 来响应。
kubelet 根据 Pod 的服务质量QoS为每个容器设置一个 `oom_score_adj` 值。
| 服务质量 | oom_score_adj |
|--------------------|-----------------------------------------------------------------------------------|
| `Guaranteed` | -997 |
| `BestEffort` | 1000 |
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
| 服务质量 | `oom_score_adj` |
|--------------------|---------------------------------------------------------------------------------------|
| `Guaranteed` | -997 |
| `BestEffort` | 1000 |
| `Burstable` | **min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)** |
{{<note>}}
<!--
The kubelet also sets an `oom_score_adj` value of `-997` for containers in Pods that have
The kubelet also sets an `oom_score_adj` value of `-997` for any containers in Pods that have
`system-node-critical` {{<glossary_tooltip text="Priority" term_id="pod-priority">}}.
-->
kubelet 还将具有 `system-node-critical`
{{<glossary_tooltip text="优先级" term_id="pod-priority">}}
的 Pod 中的容器 `oom_score_adj` 值设为 `-997`
任何 Pod 中的容器 `oom_score_adj` 值设为 `-997`
{{</note>}}
<!--
@ -627,8 +686,8 @@ for each container. It then kills the container with the highest score.
This means that containers in low QoS pods that consume a large amount of memory
relative to their scheduling requests are killed first.
Unlike pod eviction, if a container is OOM killed, the `kubelet` can restart it
based on its `RestartPolicy`.
Unlike pod eviction, if a container is OOM killed, the kubelet can restart it
based on its `restartPolicy`.
-->
如果 kubelet 在节点遇到 OOM 之前无法回收内存,
`oom_killer` 根据它在节点上使用的内存百分比计算 `oom_score`
@ -638,19 +697,19 @@ based on its `RestartPolicy`.
这意味着低 QoS Pod 中相对于其调度请求消耗内存较多的容器,将首先被杀死。
与 Pod 驱逐不同,如果容器被 OOM 杀死,
`kubelet` 可以根据其 `RestartPolicy` 重新启动它。
`kubelet` 可以根据其 `restartPolicy` 重新启动它。
<!--
### Best practices {#node-pressure-eviction-good-practices}
## Good practices {#node-pressure-eviction-good-practices}
The following sections describe best practices for eviction configuration.
The following sections describe good practices for eviction configuration.
-->
### 最佳实践 {#node-pressure-eviction-good-practices}
### 良好实践 {#node-pressure-eviction-good-practices}
以下部分描述了驱逐配置的最佳实践
以下各小节阐述驱逐配置的好的做法
<!--
#### Schedulable resources and eviction policies
### Schedulable resources and eviction policies
When you configure the kubelet with an eviction policy, you should make sure that
the scheduler will not schedule pods if they will trigger eviction because they
@ -664,13 +723,13 @@ immediately induce memory pressure.
<!--
Consider the following scenario:
- Node memory capacity: `10Gi`
- Node memory capacity: 10GiB
- Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
- Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.
-->
考虑以下场景:
* 节点内存容量:`10Gi`
* 节点内存容量:10GiB
* 操作员希望为系统守护进程(内核、`kubelet` 等)保留 10% 的内存容量
* 操作员希望在节点内存利用率达到 95% 以上时驱逐 Pod以减少系统 OOM 的概率。
@ -679,64 +738,65 @@ For this to work, the kubelet is launched as follows:
-->
为此kubelet 启动设置如下:
```
```none
--eviction-hard=memory.available<500Mi
--system-reserved=memory=1.5Gi
```
<!--
In this configuration, the `--system-reserved` flag reserves `1.5Gi` of memory
In this configuration, the `--system-reserved` flag reserves 1.5GiB of memory
for the system, which is `10% of the total memory + the eviction threshold amount`.
The node can reach the eviction threshold if a pod is using more than its request,
or if the system is using more than `1Gi` of memory, which makes the `memory.available`
signal fall below `500Mi` and triggers the threshold.
or if the system is using more than 1GiB of memory, which makes the `memory.available`
signal fall below 500MiB and triggers the threshold.
-->
在此配置中,`--system-reserved` 标志为系统预留了 `1.5Gi` 的内存,
在此配置中,`--system-reserved` 标志为系统预留了 1GiB 的内存,
`总内存的 10% + 驱逐条件量`
如果 Pod 使用的内存超过其请求值或者系统使用的内存超过 `1Gi`
则节点可以达到驱逐条件,这使得 `memory.available` 信号低于 `500Mi` 并触发条件。
则节点可以达到驱逐条件,这使得 `memory.available` 信号低于 500MiB 并触发条件。
<!--
#### DaemonSet
### DaemonSets and node-pressure eviction {#daemonset}
Pod Priority is a major factor in making eviction decisions. If you do not want
the kubelet to evict pods that belong to a `DaemonSet`, give those pods a high
enough `priorityClass` in the pod spec. You can also use a lower `priorityClass`
or the default to only allow `DaemonSet` pods to run when there are enough
resources.
Pod priority is a major factor in making eviction decisions. If you do not want
the kubelet to evict pods that belong to a DaemonSet, give those pods a high
enough priority by specifying a suitable `priorityClassName` in the pod spec.
You can also use a lower priority, or the default, to only allow pods from that
DaemonSet to run when there are enough resources.
-->
### DaemonSet
### DaemonSets 和节点压力驱逐 {#daemonset}
Pod 优先级是做出驱逐决定的主要因素。
如果你不希望 kubelet 驱逐属于 `DaemonSet` 的 Pod
请在 Pod 规约中为这些 Pod 提供足够高的 `priorityClass`
你还可以使用优先级较低的 `priorityClass` 或默认配置,
如果你不希望 kubelet 驱逐属于 DaemonSet 的 Pod
请在 Pod 规约中通过指定合适的 `priorityClassName` 为这些 Pod
提供足够高的 `priorityClass`
你还可以使用较低优先级或默认优先级,以便
仅在有足够资源时才运行 `DaemonSet` Pod。
<!--
### Known issues
## Known issues
The following sections describe known issues related to out of resource handling.
-->
### 已知问题
## 已知问题 {#known-issues}
以下部分描述了与资源不足处理相关的已知问题。
<!--
#### kubelet may not observe memory pressure right away
### kubelet may not observe memory pressure right away
By default, the kubelet polls `cAdvisor` to collect memory usage stats at a
By default, the kubelet polls cAdvisor to collect memory usage stats at a
regular interval. If memory usage increases within that window rapidly, the
kubelet may not observe `MemoryPressure` fast enough, and the `OOMKiller`
kubelet may not observe `MemoryPressure` fast enough, and the OOM killer
will still be invoked.
-->
#### kubelet 可能不会立即观察到内存压力
默认情况下kubelet 轮询 `cAdvisor` 以定期收集内存使用情况统计信息。
默认情况下kubelet 轮询 cAdvisor 以定期收集内存使用情况统计信息。
如果该轮询时间窗口内内存使用量迅速增加kubelet 可能无法足够快地观察到 `MemoryPressure`
但是 `OOMKiller` 仍将被调用。
但是 OOM killer 仍将被调用。
<!--
You can use the `--kernel-memcg-notification` flag to enable the `memcg`
@ -754,10 +814,10 @@ and `--system-reserved` flags to allocate memory for the system.
则解决此问题的可行方法是使用 `--kube-reserved``--system-reserved` 标志为系统分配内存。
<!--
#### active_file memory is not considered as available memory
### active_file memory is not considered as available memory
On Linux, the kernel tracks the number of bytes of file-backed memory on active
LRU list as the `active_file` statistic. The kubelet treats `active_file` memory
least recently used (LRU) list as the `active_file` statistic. The kubelet treats `active_file` memory
areas as not reclaimable. For workloads that make intensive use of block-backed
local storage, including ephemeral local storage, kernel-level caches of file
and block data means that many recently accessed cache pages are likely to be
@ -765,9 +825,9 @@ counted as `active_file`. If enough of these kernel block buffers are on the
active LRU list, the kubelet is liable to observe this as high resource use and
taint the node as experiencing memory pressure - triggering pod eviction.
-->
#### active_file 内存未被视为可用内存
### active_file 内存未被视为可用内存
在 Linux 上,内核跟踪活动 LRU 列表上的基于文件所虚拟的内存字节数作为 `active_file` 统计信息。
在 Linux 上,内核跟踪活动最近最少使用LRU列表上的基于文件所虚拟的内存字节数作为 `active_file` 统计信息。
kubelet 将 `active_file` 内存区域视为不可回收。
对于大量使用块设备形式的本地存储(包括临时本地存储)的工作负载,
文件和块数据的内核级缓存意味着许多最近访问的缓存页面可能被计为 `active_file`
@ -792,7 +852,7 @@ to estimate or measure an optimal memory limit value for that container.
- Learn about [API-initiated Eviction](/docs/concepts/scheduling-eviction/api-eviction/)
- Learn about [Pod Priority and Preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
- Learn about [PodDisruptionBudgets](/docs/tasks/run-application/configure-pdb/)
- Learn about [Quality of Service](/docs/tasks/configure-pod-container/quality-service-pod/) (QoS)
- Learn about [Q**uality of Servic**e](/docs/tasks/configure-pod-container/quality-service-pod/) (QoS)
- Check out the [Eviction API](/docs/reference/generated/kubernetes-api/{{<param "version">}}/#create-eviction-pod-v1-core)
-->
* 了解 [API 发起的驱逐](/zh-cn/docs/concepts/scheduling-eviction/api-eviction/)