Merge pull request #23478 from gm7y8/eviction_policy

Move eviction-policy from tasks to concepts
2020-09-07 15:51:42 -07:00 · 2020-09-07 15:51:42 -07:00 · 942fca9b29
parent 985f9ae4f1 b330bb0256
commit 942fca9b29
5 changed files with 74 additions and 78 deletions
--- a/content/en/docs/concepts/scheduling-eviction/eviction-policy.md
+++ b/content/en/docs/concepts/scheduling-eviction/eviction-policy.md
@ -0,0 +1,23 @@
+---
+title: Eviction Policy
+content_template: templates/concept
+weight: 60
+---
+
+<!-- overview -->
+
+This page is an overview of Kubernetes' policy for eviction.
+
+<!-- body -->
+
+## Eviction Policy
+
+The {{< glossary_tooltip text="Kubelet" term_id="kubelet" >}} can proactively monitor for and prevent total starvation of a
+compute resource. In those cases, the `kubelet` can reclaim the starved
+resource by proactively failing one or more Pods. When the `kubelet` fails
+a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
+If the evicted Pod is managed by a Deployment, the Deployment will create another Pod
+to be scheduled by Kubernetes.
+
+## {{% heading "whatsnext" %}}
+- Read [Configure out of resource handling](/docs/tasks/administer-cluster/out-of-resource/) to learn more about eviction signals, thresholds, and handling.
--- a/content/en/docs/concepts/scheduling-eviction/kube-scheduler.md
+++ b/content/en/docs/concepts/scheduling-eviction/kube-scheduler.md
@ -33,7 +33,7 @@ kube-scheduler is designed so that, if you want and need to, you can
 write your own scheduling component and use that instead.

 For every newly created pod or other unscheduled pods, kube-scheduler
-selects an optimal node for them to run on.  However, every container in
+selects an optimal node for them to run on. However, every container in
 pods has different requirements for resources and every pod also has
 different requirements. Therefore, existing nodes need to be filtered
 according to the specific scheduling requirements.
@ -77,12 +77,9 @@ one of these at random.
 There are two supported ways to configure the filtering and scoring behavior
 of the scheduler:

-1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to
-  configure _Predicates_ for filtering and _Priorities_ for scoring.
-1. [Scheduling Profiles](/docs/reference/scheduling/config/#profiles) allow you
-  to configure Plugins that implement different scheduling stages, including:
-  `QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You
-  can also configure the kube-scheduler to run different profiles.
+
+1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to configure _Predicates_ for filtering and _Priorities_ for scoring.
+1. [Scheduling Profiles](/docs/reference/scheduling/profiles) allow you to configure Plugins that implement different scheduling stages, including: `QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You can also configure the kube-scheduler to run different profiles.


 ## {{% heading "whatsnext" %}}
--- a/content/en/docs/concepts/scheduling-eviction/scheduler-perf-tuning.md
+++ b/content/en/docs/concepts/scheduling-eviction/scheduler-perf-tuning.md
@ -3,7 +3,7 @@ reviewers:
 - bsalamat
 title: Scheduler Performance Tuning
 content_type: concept
-weight: 70
+weight: 80
 ---

 <!-- overview -->
@ -48,10 +48,13 @@ To change the value, edit the kube-scheduler configuration file (this is likely
 to be `/etc/kubernetes/config/kube-scheduler.yaml`), then restart the scheduler.

 After you have made this change, you can run
+
 ```bash
 kubectl get componentstatuses
 ```
+
 to verify that the kube-scheduler component is healthy. The output is similar to:
+
 ```
 NAME                 STATUS    MESSAGE             ERROR
 controller-manager   Healthy   ok
--- a/content/en/docs/concepts/scheduling-eviction/scheduling-framework.md
+++ b/content/en/docs/concepts/scheduling-eviction/scheduling-framework.md
@ -3,7 +3,7 @@ reviewers:
 - ahg-g
 title: Scheduling Framework
 content_type: concept
-weight: 60
+weight: 70
 ---

 <!-- overview -->
--- a/content/en/docs/tasks/administer-cluster/out-of-resource.md
+++ b/content/en/docs/tasks/administer-cluster/out-of-resource.md
@ -18,28 +18,19 @@ nodes become unstable.

 <!-- body -->

-## Eviction Policy
-
-The `kubelet` can proactively monitor for and prevent total starvation of a
-compute resource. In those cases, the `kubelet` can reclaim the starved
-resource by proactively failing one or more Pods. When the `kubelet` fails
-a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
-If the evicted Pod is managed by a Deployment, the Deployment will create another Pod 
-to be scheduled by Kubernetes.
-
 ### Eviction Signals

 The `kubelet` supports eviction decisions based on the signals described in the following
 table. The value of each signal is described in the Description column, which is based on
 the `kubelet` summary API.

-| Eviction Signal  | Description                                                                     |
-|----------------------------|-----------------------------------------------------------------------|
-| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
-| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
-| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
-| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
-| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
+| Eviction Signal      | Description                                                                           |
+|----------------------|---------------------------------------------------------------------------------------|
+| `memory.available`   | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
+| `nodefs.available`   | `nodefs.available` := `node.stats.fs.available`                                       |
+| `nodefs.inodesFree`  | `nodefs.inodesFree` := `node.stats.fs.inodesFree`                                     |
+| `imagefs.available`  | `imagefs.available` := `node.stats.runtime.imagefs.available`                         |
+| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree`                       |

 Each of the above signals supports either a literal or percentage based value.
 The percentage based value is calculated relative to the total capacity
@ -65,7 +56,7 @@ memory is reclaimable under pressure.
 `imagefs` is optional. `kubelet` auto-discovers these filesystems using
 cAdvisor. `kubelet` does not care about any other filesystems. Any other types
 of configurations are not currently supported by the kubelet. For example, it is
-*not OK* to store volumes and logs in a dedicated `filesystem`.
+_not OK_ to store volumes and logs in a dedicated `filesystem`.

 In future releases, the `kubelet` will deprecate the existing [garbage
 collection](/docs/concepts/cluster-administration/kubelet-garbage-collection/)
@ -83,9 +74,7 @@ where:

 * `eviction-signal` is an eviction signal token as defined in the previous table.
 * `operator` is the desired relational operator, such as `<` (less than).
-* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must
-match the quantity representation used by Kubernetes. An eviction threshold can also
-be expressed as a percentage using the `%` token.
+* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must match the quantity representation used by Kubernetes. An eviction threshold can also be expressed as a percentage using the `%` token.

 For example, if a node has `10Gi` of total memory and you want trigger eviction if
 the available memory falls below `1Gi`, you can define the eviction threshold as
@ -108,12 +97,9 @@ termination.

 To configure soft eviction thresholds, the following flags are supported:

-* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
-corresponding grace period would trigger a Pod eviction.
-* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
-correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
-* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
-pods in response to a soft eviction threshold being met.
+* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a corresponding grace period would trigger a Pod eviction.
+* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
+* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

 #### Hard Eviction Thresholds

@ -124,8 +110,7 @@ with no graceful termination.

 To configure hard eviction thresholds, the following flag is supported:

-* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
-would trigger a Pod eviction.
+* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met would trigger a Pod eviction.

 The `kubelet` has the following default hard eviction threshold:

@ -150,10 +135,10 @@ reflects the node is under pressure.

 The following node conditions are defined that correspond to the specified eviction signal.

-| Node Condition | Eviction Signal  | Description                                                      |
-|-------------------------|-------------------------------|--------------------------------------------|
-| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
-| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
+| Node Condition    | Eviction Signal                                                                       | Description                                                                                                                  |
+|-------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
+| `MemoryPressure`  | `memory.available`                                                                    | Available memory on the node has satisfied an eviction threshold                                                             |
+| `DiskPressure`    | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |

 The `kubelet` continues to report node status updates at the frequency specified by
 `--node-status-update-frequency` which defaults to `10s`.
@ -168,8 +153,7 @@ as a consequence.
 To protect against this oscillation, the following flag is defined to control how
 long the `kubelet` must wait before transitioning out of a pressure condition.

-* `eviction-pressure-transition-period` is the duration for which the `kubelet` has
-to wait before transitioning out of an eviction pressure condition.
+* `eviction-pressure-transition-period` is the duration for which the `kubelet` has to wait before transitioning out of an eviction pressure condition.

 The `kubelet` would ensure that it has not observed an eviction threshold being met
 for the specified pressure condition for the period specified before toggling the
@ -207,17 +191,8 @@ then by [Priority](/docs/concepts/configuration/pod-priority-preemption/), and t

 As a result, `kubelet` ranks and evicts Pods in the following order:

-* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request.
-Such pods are ranked by Priority, and then usage above request.
-* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last.
-`Guaranteed` Pods are guaranteed only when requests and limits are specified for all
-the containers and they are equal. Such pods are guaranteed to never be evicted because
-of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`,
-and `journald`) is consuming more resources than were reserved via `system-reserved` or
-`kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using
-less than requests remaining, then the node must choose to evict such a Pod in order to
-preserve node stability and to limit the impact of the unexpected consumption to other Pods.
-In this case, it will choose to evict pods of Lowest Priority first.
+* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request. Such pods are ranked by Priority, and then usage above request.
+* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last. `Guaranteed` Pods are guaranteed only when requests and limits are specified for all the containers and they are equal. Such pods are guaranteed to never be evicted because of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`, and `journald`) is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using less than requests remaining, then the node must choose to evict such a Pod in order to preserve node stability and to limit the impact of the unexpected consumption to other Pods. In this case, it will choose to evict pods of Lowest Priority first.

 If necessary, `kubelet` evicts Pods one at a time to reclaim disk when `DiskPressure`
 is encountered. If the `kubelet` is responding to `inode` starvation, it reclaims
@ -228,6 +203,7 @@ that consumes the largest amount of disk and kills those first.
 #### With `imagefs`

 If `nodefs` is triggering evictions, `kubelet` sorts Pods based on the usage on `nodefs`
+
 - local volumes + logs of all its containers.

 If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable layer usage of all its containers.
@ -235,13 +211,13 @@ If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable
 #### Without `imagefs`

 If `nodefs` is triggering evictions, `kubelet` sorts Pods based on their total disk usage
+
 - local volumes + logs & writable layer of all its containers.

 ### Minimum eviction reclaim

 In certain scenarios, eviction of Pods could result in reclamation of small amount of resources. This can result in
-`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
- is time consuming.
+`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, is time consuming.

 To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
 resource pressure, `kubelet` attempts to reclaim at least `minimum-reclaim` amount of resource below
@ -268,10 +244,10 @@ The node reports a condition when a compute resource is under pressure. The
 scheduler views that condition as a signal to dissuade placing additional
 pods on the node.

-| Node Condition    | Scheduler Behavior                               |
-| ---------------- | ------------------------------------------------ |
-| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. |
-| `DiskPressure` | No new Pods are scheduled to the node. |
+| Node Condition    | Scheduler Behavior                                  |
+| ------------------| ----------------------------------------------------|
+| `MemoryPressure`  | No new `BestEffort` Pods are scheduled to the node. |
+| `DiskPressure`    | No new Pods are scheduled to the node.              |

 ## Node OOM Behavior

@ -280,11 +256,11 @@ the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respon

 The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the Pod.

-| Quality of Service | oom_score_adj |
-|----------------------------|-----------------------------------------------------------------------|
-| `Guaranteed` | -998 |
-| `BestEffort` | 1000 |
-| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
+| Quality of Service | oom_score_adj                                                                     |
+|--------------------|-----------------------------------------------------------------------------------|
+| `Guaranteed`       | -998                                                                              |
+| `BestEffort`       | 1000                                                                              |
+| `Burstable`        | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |

 If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` calculates
 an `oom_score` based on the percentage of memory it's using on the node, and then add the `oom_score_adj` to get an
@ -325,10 +301,7 @@ and trigger eviction assuming those Pods use less than their configured request.

 ### DaemonSet

-As `Priority` is a key factor in the eviction strategy, if you do not want 
-pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass 
-in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if 
-there are sufficient resources, specify a lower or default priorityClass.
+As `Priority` is a key factor in the eviction strategy, if you do not want pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if there are sufficient resources, specify a lower or default priorityClass.


 ## Deprecation of existing feature flags to reclaim disk
@ -338,15 +311,15 @@ there are sufficient resources, specify a lower or default priorityClass.
 As disk based eviction matures, the following `kubelet` flags are marked for deprecation
 in favor of the simpler configuration supported around eviction.

-| Existing Flag | New Flag |
-| ------------- | -------- |
-| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
-| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
-| `--maximum-dead-containers` | deprecated |
-| `--maximum-dead-containers-per-container` | deprecated |
-| `--minimum-container-ttl-duration` | deprecated |
-| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
-| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
+| Existing Flag                              | New Flag                                |
+| ------------------------------------------ | ----------------------------------------|
+| `--image-gc-high-threshold`                | `--eviction-hard` or `eviction-soft`    |
+| `--image-gc-low-threshold`                 | `--eviction-minimum-reclaim`            |
+| `--maximum-dead-containers`                | deprecated                              |
+| `--maximum-dead-containers-per-container`  | deprecated                              |
+| `--minimum-container-ttl-duration`         | deprecated                              |
+| `--low-diskspace-threshold-mb`             | `--eviction-hard` or `eviction-soft`    |
+| `--outofdisk-transition-frequency`         | `--eviction-pressure-transition-period` |

 ## Known issues