Merge pull request #23478 from gm7y8/eviction_policy

Move eviction-policy from tasks to concepts
This commit is contained in:
Kubernetes Prow Robot 2020-09-07 15:51:42 -07:00 committed by GitHub
commit 942fca9b29
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 74 additions and 78 deletions

View File

@ -0,0 +1,23 @@
---
title: Eviction Policy
content_template: templates/concept
weight: 60
---
<!-- overview -->
This page is an overview of Kubernetes' policy for eviction.
<!-- body -->
## Eviction Policy
The {{< glossary_tooltip text="Kubelet" term_id="kubelet" >}} can proactively monitor for and prevent total starvation of a
compute resource. In those cases, the `kubelet` can reclaim the starved
resource by proactively failing one or more Pods. When the `kubelet` fails
a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
If the evicted Pod is managed by a Deployment, the Deployment will create another Pod
to be scheduled by Kubernetes.
## {{% heading "whatsnext" %}}
- Read [Configure out of resource handling](/docs/tasks/administer-cluster/out-of-resource/) to learn more about eviction signals, thresholds, and handling.

View File

@ -33,7 +33,7 @@ kube-scheduler is designed so that, if you want and need to, you can
write your own scheduling component and use that instead.
For every newly created pod or other unscheduled pods, kube-scheduler
selects an optimal node for them to run on. However, every container in
selects an optimal node for them to run on. However, every container in
pods has different requirements for resources and every pod also has
different requirements. Therefore, existing nodes need to be filtered
according to the specific scheduling requirements.
@ -77,12 +77,9 @@ one of these at random.
There are two supported ways to configure the filtering and scoring behavior
of the scheduler:
1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to
configure _Predicates_ for filtering and _Priorities_ for scoring.
1. [Scheduling Profiles](/docs/reference/scheduling/config/#profiles) allow you
to configure Plugins that implement different scheduling stages, including:
`QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You
can also configure the kube-scheduler to run different profiles.
1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to configure _Predicates_ for filtering and _Priorities_ for scoring.
1. [Scheduling Profiles](/docs/reference/scheduling/profiles) allow you to configure Plugins that implement different scheduling stages, including: `QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You can also configure the kube-scheduler to run different profiles.
## {{% heading "whatsnext" %}}

View File

@ -3,7 +3,7 @@ reviewers:
- bsalamat
title: Scheduler Performance Tuning
content_type: concept
weight: 70
weight: 80
---
<!-- overview -->
@ -48,10 +48,13 @@ To change the value, edit the kube-scheduler configuration file (this is likely
to be `/etc/kubernetes/config/kube-scheduler.yaml`), then restart the scheduler.
After you have made this change, you can run
```bash
kubectl get componentstatuses
```
to verify that the kube-scheduler component is healthy. The output is similar to:
```
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok

View File

@ -3,7 +3,7 @@ reviewers:
- ahg-g
title: Scheduling Framework
content_type: concept
weight: 60
weight: 70
---
<!-- overview -->

View File

@ -18,28 +18,19 @@ nodes become unstable.
<!-- body -->
## Eviction Policy
The `kubelet` can proactively monitor for and prevent total starvation of a
compute resource. In those cases, the `kubelet` can reclaim the starved
resource by proactively failing one or more Pods. When the `kubelet` fails
a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
If the evicted Pod is managed by a Deployment, the Deployment will create another Pod
to be scheduled by Kubernetes.
### Eviction Signals
The `kubelet` supports eviction decisions based on the signals described in the following
table. The value of each signal is described in the Description column, which is based on
the `kubelet` summary API.
| Eviction Signal | Description |
|----------------------------|-----------------------------------------------------------------------|
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
| Eviction Signal | Description |
|----------------------|---------------------------------------------------------------------------------------|
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
Each of the above signals supports either a literal or percentage based value.
The percentage based value is calculated relative to the total capacity
@ -65,7 +56,7 @@ memory is reclaimable under pressure.
`imagefs` is optional. `kubelet` auto-discovers these filesystems using
cAdvisor. `kubelet` does not care about any other filesystems. Any other types
of configurations are not currently supported by the kubelet. For example, it is
*not OK* to store volumes and logs in a dedicated `filesystem`.
_not OK_ to store volumes and logs in a dedicated `filesystem`.
In future releases, the `kubelet` will deprecate the existing [garbage
collection](/docs/concepts/cluster-administration/kubelet-garbage-collection/)
@ -83,9 +74,7 @@ where:
* `eviction-signal` is an eviction signal token as defined in the previous table.
* `operator` is the desired relational operator, such as `<` (less than).
* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must
match the quantity representation used by Kubernetes. An eviction threshold can also
be expressed as a percentage using the `%` token.
* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must match the quantity representation used by Kubernetes. An eviction threshold can also be expressed as a percentage using the `%` token.
For example, if a node has `10Gi` of total memory and you want trigger eviction if
the available memory falls below `1Gi`, you can define the eviction threshold as
@ -108,12 +97,9 @@ termination.
To configure soft eviction thresholds, the following flags are supported:
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
corresponding grace period would trigger a Pod eviction.
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
pods in response to a soft eviction threshold being met.
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a corresponding grace period would trigger a Pod eviction.
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
#### Hard Eviction Thresholds
@ -124,8 +110,7 @@ with no graceful termination.
To configure hard eviction thresholds, the following flag is supported:
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
would trigger a Pod eviction.
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met would trigger a Pod eviction.
The `kubelet` has the following default hard eviction threshold:
@ -150,10 +135,10 @@ reflects the node is under pressure.
The following node conditions are defined that correspond to the specified eviction signal.
| Node Condition | Eviction Signal | Description |
|-------------------------|-------------------------------|--------------------------------------------|
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
| Node Condition | Eviction Signal | Description |
|-------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
The `kubelet` continues to report node status updates at the frequency specified by
`--node-status-update-frequency` which defaults to `10s`.
@ -168,8 +153,7 @@ as a consequence.
To protect against this oscillation, the following flag is defined to control how
long the `kubelet` must wait before transitioning out of a pressure condition.
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has
to wait before transitioning out of an eviction pressure condition.
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has to wait before transitioning out of an eviction pressure condition.
The `kubelet` would ensure that it has not observed an eviction threshold being met
for the specified pressure condition for the period specified before toggling the
@ -207,17 +191,8 @@ then by [Priority](/docs/concepts/configuration/pod-priority-preemption/), and t
As a result, `kubelet` ranks and evicts Pods in the following order:
* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request.
Such pods are ranked by Priority, and then usage above request.
* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last.
`Guaranteed` Pods are guaranteed only when requests and limits are specified for all
the containers and they are equal. Such pods are guaranteed to never be evicted because
of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`,
and `journald`) is consuming more resources than were reserved via `system-reserved` or
`kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using
less than requests remaining, then the node must choose to evict such a Pod in order to
preserve node stability and to limit the impact of the unexpected consumption to other Pods.
In this case, it will choose to evict pods of Lowest Priority first.
* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request. Such pods are ranked by Priority, and then usage above request.
* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last. `Guaranteed` Pods are guaranteed only when requests and limits are specified for all the containers and they are equal. Such pods are guaranteed to never be evicted because of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`, and `journald`) is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using less than requests remaining, then the node must choose to evict such a Pod in order to preserve node stability and to limit the impact of the unexpected consumption to other Pods. In this case, it will choose to evict pods of Lowest Priority first.
If necessary, `kubelet` evicts Pods one at a time to reclaim disk when `DiskPressure`
is encountered. If the `kubelet` is responding to `inode` starvation, it reclaims
@ -228,6 +203,7 @@ that consumes the largest amount of disk and kills those first.
#### With `imagefs`
If `nodefs` is triggering evictions, `kubelet` sorts Pods based on the usage on `nodefs`
- local volumes + logs of all its containers.
If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable layer usage of all its containers.
@ -235,13 +211,13 @@ If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable
#### Without `imagefs`
If `nodefs` is triggering evictions, `kubelet` sorts Pods based on their total disk usage
- local volumes + logs & writable layer of all its containers.
### Minimum eviction reclaim
In certain scenarios, eviction of Pods could result in reclamation of small amount of resources. This can result in
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
is time consuming.
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, is time consuming.
To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
resource pressure, `kubelet` attempts to reclaim at least `minimum-reclaim` amount of resource below
@ -268,10 +244,10 @@ The node reports a condition when a compute resource is under pressure. The
scheduler views that condition as a signal to dissuade placing additional
pods on the node.
| Node Condition | Scheduler Behavior |
| ---------------- | ------------------------------------------------ |
| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. |
| `DiskPressure` | No new Pods are scheduled to the node. |
| Node Condition | Scheduler Behavior |
| ------------------| ----------------------------------------------------|
| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. |
| `DiskPressure` | No new Pods are scheduled to the node. |
## Node OOM Behavior
@ -280,11 +256,11 @@ the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respon
The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the Pod.
| Quality of Service | oom_score_adj |
|----------------------------|-----------------------------------------------------------------------|
| `Guaranteed` | -998 |
| `BestEffort` | 1000 |
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
| Quality of Service | oom_score_adj |
|--------------------|-----------------------------------------------------------------------------------|
| `Guaranteed` | -998 |
| `BestEffort` | 1000 |
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` calculates
an `oom_score` based on the percentage of memory it's using on the node, and then add the `oom_score_adj` to get an
@ -325,10 +301,7 @@ and trigger eviction assuming those Pods use less than their configured request.
### DaemonSet
As `Priority` is a key factor in the eviction strategy, if you do not want
pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass
in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if
there are sufficient resources, specify a lower or default priorityClass.
As `Priority` is a key factor in the eviction strategy, if you do not want pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if there are sufficient resources, specify a lower or default priorityClass.
## Deprecation of existing feature flags to reclaim disk
@ -338,15 +311,15 @@ there are sufficient resources, specify a lower or default priorityClass.
As disk based eviction matures, the following `kubelet` flags are marked for deprecation
in favor of the simpler configuration supported around eviction.
| Existing Flag | New Flag |
| ------------- | -------- |
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
| `--maximum-dead-containers` | deprecated |
| `--maximum-dead-containers-per-container` | deprecated |
| `--minimum-container-ttl-duration` | deprecated |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
| Existing Flag | New Flag |
| ------------------------------------------ | ----------------------------------------|
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
| `--maximum-dead-containers` | deprecated |
| `--maximum-dead-containers-per-container` | deprecated |
| `--minimum-container-ttl-duration` | deprecated |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
## Known issues