From 0bd5266e5a3eef3104f0cf972cb719466c79eee6 Mon Sep 17 00:00:00 2001
From: Vishnu Kannan <vishnuk@google.com>
Date: Wed, 8 Feb 2017 15:39:49 -0800
Subject: [PATCH 01/10] Phase 2 of node allocatable

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
---
 .../design-proposals/kubelet-eviction.md      |  16 ++
 .../design-proposals/node-allocatable.md      | 239 ++++++++++++++----
 .../design-proposals/node-allocatable.png     | Bin 17673 -> 0 bytes
 3 files changed, 199 insertions(+), 56 deletions(-)
 delete mode 100644 contributors/design-proposals/node-allocatable.png

diff --git a/contributors/design-proposals/kubelet-eviction.md b/contributors/design-proposals/kubelet-eviction.md
index 233956b82..010415597 100644
--- a/contributors/design-proposals/kubelet-eviction.md
+++ b/contributors/design-proposals/kubelet-eviction.md
@@ -425,6 +425,22 @@ from placing new best effort pods on the node since they will be rejected by the
 On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
 placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.
 
+## Enforcing Node Allocatable
+
+To enforce [Node Allocatable](./node-allocatable.md), Kubelet primarily uses cgroups.
+However `storage` cannot be enforced using cgroups.
+
+Once Kubelet supports `storage` as an `Allocatable` resource, Kubelet will perform evictions whenever the total storage usage by pods exceed node allocatable.
+
+The trigger threshold for storage evictions will not be user configurable for the purposes of `Allocatable`.
+Kubelet will evict pods once the `storage` usage is greater than or equal to `98%` of `Allocatable`.
+Kubelet will evict pods until it can reclaim `5%` of `storage Allocatable`, thereby brining down usage to `93%` of `Allocatable`.
+These thresholds apply for both `storage` `capacity` and `inodes`.
+
+*Note that these values are subject to change based on feedback from production.*
+
+If a pod cannot tolerate evictions, then ensure that a request is set and it will not exceed `requests`.
+
 ## Best Practices
 
 ### DaemonSet
diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index 12f244eb3..c66e46443 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -1,40 +1,24 @@
 # Node Allocatable Resources
 
-**Issue:** https://github.com/kubernetes/kubernetes/issues/13984
+### Authors: timstclair@, vishh@
 
 ## Overview
 
-Currently Node.Status has Capacity, but no concept of node Allocatable. We need additional
-parameters to serve several purposes:
+Kubernetes nodes typically run many OS system daemons in addition to kubernetes daemons like kubelet, runtime, etc. and user pods.
+Kubernetes assumes that all the compute resources available, referred to as `Capacity`, in a node are available for user pods.
+In reality, system daemons use non-trivial amoutn of resources and their availability is critical for the stability of the system.
+To address this issue, this proposal introduces the concept of `Allocatable` which identifies the amount of compute resources available to user pods.
+Specifically, the kubelet will provide a few knobs to reserve resources for OS system daemons and kubernetes daemons.
 
-1. Kubernetes metrics provides "/docker-daemon", "/kubelet",
-   "/kube-proxy", "/system" etc. raw containers for monitoring system component resource usage
-   patterns and detecting regressions. Eventually we want to cap system component usage to a certain
-   limit / request. However this is not currently feasible due to a variety of reasons including:
-       1. Docker still uses tons of computing resources (See
-          [#16943](https://github.com/kubernetes/kubernetes/issues/16943))
-       2. We have not yet defined the minimal system requirements, so we cannot control Kubernetes
-          nodes or know about arbitrary daemons, which can make the system resources
-          unmanageable. Even with a resource cap we cannot do a full resource management on the
-          node, but with the proposed parameters we can mitigate really bad resource over commits
-       3. Usage scales with the number of pods running on the node
-2. For external schedulers (such as mesos, hadoop, etc.) integration, they might want to partition
-   compute resources on a given node, limiting how much Kubelet can use. We should provide a
-   mechanism by which they can query kubelet, and reserve some resources for their own purpose.
+By explicitly reserving compute resources, the intention is to avoid overcommiting the node and not have system daemons compete with user pods.
+The resources available to system daemons and user pods will be capped based on user specified reservations.
 
-### Scope of proposal
-
-This proposal deals with resource reporting through the [`Allocatable` field](#allocatable) for more
-reliable scheduling, and minimizing resource over commitment. This proposal *does not* cover
-resource usage enforcement (e.g. limiting kubernetes component usage), pod eviction (e.g. when
-reservation grows), or running multiple Kubelets on a single node.
+If `Allocatable` is available, the scheduler use that instead of `Capacity`, thereby not overcommiting the node.
 
 ## Design
 
 ### Definitions
 
-![image](node-allocatable.png)
-
 1. **Node Capacity** - Already provided as
    [`NodeStatus.Capacity`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus),
    this is total capacity read from the node instance, and assumed to be constant.
@@ -89,12 +73,7 @@ The flag will be specified as a serialized `ResourceList`, with resources define
 --kube-reserved=cpu=500m,memory=5Mi
 ```
 
-Initially we will only support CPU and memory, but will eventually support more resources. See
-[#16889](https://github.com/kubernetes/kubernetes/pull/16889) for disk accounting.
-
-If KubeReserved is not set it defaults to a sane value (TBD) calculated from machine capacity. If it
-is explicitly set to 0 (along with `SystemReserved`), then `Allocatable == Capacity`, and the system
-behavior is equivalent to the 1.1 behavior with scheduling based on Capacity.
+Initially we will only support CPU and memory, but will eventually support more resources like [local storage](#phase-3) and io proportional weights to improve node reliability.
 
 #### System-Reserved
 
@@ -102,48 +81,196 @@ In the initial implementation, `SystemReserved` will be functionally equivalent
 [`KubeReserved`](#kube-reserved), but with a different semantic meaning. While KubeReserved
 designates resources set aside for kubernetes components, SystemReserved designates resources set
 aside for non-kubernetes components (currently this is reported as all the processes lumped
-together in the `/system` raw container).
+together in the `/system` raw container on non-systemd nodes).
 
-## Issues
+## Recommended Cgroups Setup
+
+Following is the recommended cgroup configuration for Kubernetes nodes.
+All OS system daemons are expected to be placed under a top level `SystemReserved` cgroup.
+`Kubelet` and `Container Runtime` are expected to be placed under `KubeReserved` cgroup.
+The reason for recommending placing the `Container Runtime` under `KubeReserved` is as follows:
+
+1. A container runtime on Kubernetes nodes is not expected to be used outside of the Kubelet.
+1. It's resource consumption is tied to the number of pods running on a node.
+
+Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individally track their usage.
+
+/ (Cgroup Root)
+.
++..systemreserved or system.slice (`SystemReserved` enforced here *optionally* by kubelet)
+.        .    .tasks(sshd,udev,etc)
+.
+.
++..kubereserved or kube.slice (`KubeReserved` enforced here *optionally* by kubelet)
+.	 .
+.	 +..kubelet
+.	 .   .tasks(kubelet)
+.        .
+.	 +..runtime
+.	     .tasks(docker-engine, containerd)
+.	 
+.
++..kubepods or kubepods.slice (Node Allocatable enforced here by Kubelet)
+.	 .
+.	 +..PodGuaranteed
+.	 .	  .
+.	 .	  +..Container1
+.	 .	  .        .tasks(container processes)
+.	 .	  .
+.	 .        +..PodOverhead
+.	 .        .        .tasks(per-pod processes)
+.	 .        ...
+.	 .
+.	 +..Burstable
+.	 .	  .
+.	 .	  +..PodBurstable
+.	 .	  .	    .
+.	 .	  .	    +..Container1
+.	 .	  .	    .         .tasks(container processes)
+.	 .	  .	    +..Container2
+.	 .	  .	    .         .tasks(container processes)
+.	 .	  .	    .
+.      	 .     	  .    	    ...
+.	 .	  .
+.	 .	  ...
+.	 .
+. 	 .
+.	 +..Besteffort
+.	 .	  .
+.	 .	  +..PodBesteffort
+.	 .	  .    	    .
+.	 .	  .	    +..Container1
+.	 .	  .	    .         .tasks(container processes)
+.	 .	  .	    +..Container2
+.	 .	  .	    .         .tasks(container processes)
+.	 .	  .	    .
+.      	 .     	  .    	    ...
+.	 .	  .
+. 	 .	  ...
+
+`systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroup automatically.
+
+`kubepods` cgroups will be created by kubelet automatically if it is not already there. If the cgroup driver is set to `systemd` then Kubelet will create a `kubepods.slice` via systemd.
+By default, Kubelet will `mkdir` `/kubepods` cgroup directly via cgroupfs.
+
+#### Containerizing Kubelet
+
+If Kubelet is managed using a container runtime, have the runtime create cgroups for kubelet under `kubereserved`.
+
+### Metrics
+
+Kubelet identifies it's own cgroup and exposes it's usage metrics via the Summary metrics API (/stats/summary)
+With docker runtime, kubelet identifies docker runtime's cgroups too and exposes metrics for it via the Summary metrics API.
+To provide a complete overview of a node, Kubelet will expose metrics from cgroups enforcing `SystemReserved`, `KubeReserved` & `Allocatable` too.
+
+## Relationship with Kubelet Evictions
+
+To improve the reliability of nodes, kubelet evicts pods whenever the node runs out of memory or local storage.
+Together, evictions and node allocatable help improve node stability.
+
+As of v1.5, evictions are based on `Capacity` (overall node usage). Kubelet evicts pods based on QoS and user configured eviction thresholds.
+
+From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods are not expected to exceed `Allocatable`.
+Memory and CPU limits are enforced using cgroups, but there exists no easy means to enforce storage limits though. Enforcing storage limits using Linux Quota is not possible since it's not hierarchical. 
+Once storage is supported as a resource for `Allocatable`, Kubelet has to perform evictions based on `Allocatable` in addition to `Capacity`.
+More details on [evictions here](./kubelet-eviction.md#enforce-node-allocatable).
+
+Note that even if `KubeReserved` & `SystemReserved` is enforced, kernel memory will still not be restricted and so kubelet will have to continue performing evictions based on overall node usage.
+
+## Implementation Phases
+
+### Phase 1 - Introduce Allocatable to the system without enforcement
+
+**Status**: Implemented v1.2
+
+In this phase, Kubelet will support specifying `KubeReserved` & `SystemReserved` resource reservations via kubelet flags.
+The defaults for these flags will be `""`, meaning zero cpu or memory reservations.
+Kubelet will compute `Allocatable` and update `Node.Status` to include it.
+The scheduler will use `Allocatable` instead of `Capacity` if it is available.
+
+### Phase 2 - Enforce Allocatable on Pods
+
+**Status**: Targetted for v1.6
+
+In this phase, Kubelet will automatically create a top level cgroup to enforce Node Allocatable on all user pods.
+Kubelet will support specifying the top level cgroups for `KubeReserved` and `SystemReserved` and support *optionally* placing resource restrictions on these top level cgroups.
+Users are expected to specify `KubeReserved` and `SystemReserved` based on their deployment requirements.
+
+Resource requirements for Kubelet and the runtime is typically proportional to the number of pods running on a node.
+Once a user identified the maximum pod density for each of their nodes, they will be able to compute `KubeReserved` using [this performance dashboard](http://node-perf-dash.k8s.io/#/builds).
+[This blog post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html) explains how the dashboard has to be interpreted.
+Note that this dashboard provides usage metrics for docker runtime only as of now.
+
+New flags introduced in this phase are as follows:
+
+1. `--enforce-node-allocatable=[pods][,][kube-reserved],[system-reserved]`
+  * This flag will default to `pods` in v1.6.
+  * Nodes have to be drained prior to upgrading to v1.6.
+  * kubelet's behavior if a node has not been drained prior to upgrades is as follows
+    * If a pod has a `RestartPolicy=Never`, then mark the pod as `Failed` and terminate its workload.
+    * All other pods that are not parented by new Allocatable top level cgroup will be restarted.
+  * Users intending to turn off this feature can set this flag to `""`.
+  * `--kube-reserved` and `--system-reserved` flags are expected to be set by users for this flag to have any meaningful effect on node stability.
+  * By including `kube-reserved` or `system-reserved` in this flag's value, and by specifying the following two flags, Kubelet will attempt to enforce the reservations specified via `--kube-reserved` & `system-reserved` respectively.
+
+2. `--kube-reserved-cgroup=<absolute path to a cgroup>`
+   * This flag helps kubelet identify the control group managing all kube components like Kubelet & container runtime that fall under the `KubeReserved` reservation.
+
+3. `--system-reserved-cgroup=<absolute path to a cgroup>`
+   * This flag helps kubelet identify the control group managing all OS specific system daemons that fall under the `SystemReserved` reservation.
+
+#### Rollout details
+
+This phase is expected to improve Kubernetes node stability. However it requires users to specify non-default values for `--kube-reserved` & `--system-reserved` flags though.
+
+The rollout of this phase has been long due and hence we are attempting to include it in v1.6
+
+Since `KubeReserved` and `SystemReserved` continue to have `""` as defaults, the node's `Allocatable` does not change automatically.
+Since this phase requires node drains (or pod restarts/terminations), it is considered disruptive to users.
+
+To rollback this phase, set `--enforce-node-allocatable` flag to `""`.
+
+This phase might cause the following symptoms to occur if `--kube-reserved` and/or `--system-reserved` flags are also specified.
+
+1. OOM kills of containers and/or evictions of pods. This can happen primarily to `Burstable` and `BestEffort` pods since they can no longer use up all the resource available on the node.
+
+##### Proposed Timeline
+
+02/14/2017 - Discuss the rollout plan in sig-node meeting
+02/15/2017 - Flip the switch to enable pod level cgroups by default
+enable existing experimental behavior by default
+02/21/2017 - Assess impacts based on enablement
+02/27/2017 - Kubernetes Feature complete (i.e. code freeze)
+03/01/2017 - Send an announcement to kubernetes-dev@ about this rollout along with rollback options and potential issues. Recommend users to set kube and system reserved.
+03/22/2017 - Kubernetes 1.6 release
+
+### Phase 3 - Metrics & support for Storage
+
+*Status*: Targetted for v1.7
+
+In this phase, Kubelet will expose usage metrics for `KubeReserved`, `SystemReserved` and `Allocatable` top level cgroups via Summary metrics API.
+`Storage` will also be introduced as a reservable resource in this phase.
+Support for evictions based on Allocatable will be introduced in this phase.
+
+## Known Issues
 
 ### Kubernetes reservation is smaller than kubernetes component usage
 
 **Solution**: Initially, do nothing (best effort). Let the kubernetes daemons overflow the reserved
 resources and hope for the best. If the node usage is less than Allocatable, there will be some room
-for overflow and the node should continue to function. If the node has been scheduled to capacity
+for overflow and the node should continue to function. If the node has been scheduled to `allocatable`
 (worst-case scenario) it may enter an unstable state, which is the current behavior in this
 situation.
 
+A recommended alternative is to enforce KubeReserved once Kubelet supports it (Phase 2).
 In the [future](#future-work) we may set a parent cgroup for kubernetes components, with limits set
 according to `KubeReserved`.
 
-### Version discrepancy
-
-**API server / scheduler is not allocatable-resources aware:** If the Kubelet rejects a Pod but the
-  scheduler expects the Kubelet to accept it, the system could get stuck in an infinite loop
-  scheduling a Pod onto the node only to have Kubelet repeatedly reject it. To avoid this situation,
-  we will do a 2-stage rollout of `Allocatable`. In stage 1 (targeted for 1.2), `Allocatable` will
-  be reported by the Kubelet and the scheduler will be updated to use it, but Kubelet will continue
-  to do admission checks based on `Capacity` (same as today). In stage 2 of the rollout (targeted
-  for 1.3 or later), the Kubelet will start doing admission checks based on `Allocatable`.
-
-**API server expects `Allocatable` but does not receive it:** If the kubelet is older and does not
-  provide `Allocatable` in the `NodeStatus`, then `Allocatable` will be
-  [defaulted](../../pkg/api/v1/defaults.go) to
-  `Capacity` (which will yield today's behavior of scheduling based on capacity).
-
 ### 3rd party schedulers
 
 The community should be notified that an update to schedulers is recommended, but if a scheduler is
 not updated it falls under the above case of "scheduler is not allocatable-resources aware".
 
-## Future work
-
-1. Convert kubelet flags to Config API - Prerequisite to (2). See
-   [#12245](https://github.com/kubernetes/kubernetes/issues/12245).
-2. Set cgroup limits according KubeReserved - as described in the [overview](#overview)
-3. Report kernel usage to be considered with scheduling decisions.
-
 
 
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/node-allocatable.png b/contributors/design-proposals/node-allocatable.png
deleted file mode 100644
index d6f5383e7adf8f417362c63791d64beb27cee5e9..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 17673
zcmb`v2UJsA+b$Y)D+nqgMNo<eNEZPSX+aT$fb=GPM?gfH^cv7DQ9%U>y$jN7sG%je
zqlkbI1tHXgYAAtF6CfnH0r&Tv|GW3U_l$eaS!3YHT5GPk=6u(jZ+V{gHP+Nv?=;&5
zHUI!{8l<md4gee>GXGLoPcW}&u%JBv0G;n3o!ge7V=FmwmKRNg?+SA=k7uuLT{_G9
z$5FFqQv!b+eRgp2*_ru|j;_w}f}>_;hgS}MGJi-j8y@~R%h~=QH!MVPH1f!C1Cpk@
zi6PlDSSdq&b9|Kl<P_e;eqdOMoR=0j<5q=fdWrK{=5XvyPSDc&*u>(5i;F9C19icw
z#U~A0ledfpMx4_VdRE;X2UxWf08a_^B8z4NIHma7B^M2Yy&=8g1o<Z@r*C^Z700|H
zR_5g%>8l0VMcW5PRC{UI;gq6noq9-whQf={5&*#3o8d_^`^i9MPRxhG^{eBM#I3dX
z7+QC){RB0jWV4>pUo-7Akkd&=yu)MC<<H&#eB{bV4xC1B?Hq?9lk`j9$bj@T`)wMD
zB`jyHK8?TK;v~PaCvQvW2bmA(<gF|jcP((OVktu7Ug25t;>$D2M*8L3s@PY9Skj)J
z+;}aFO~;^*AS67a#>wnk>5m9;kH3#^8J<)ZLdePjXT7Zr`fWDx9H7^XA64IDVb&92
zO_P3?cO;2&AT5};9j~kO2aTAEd7o+wT8+p^bNA5J{51_!idTq~htE@L(164_PMM?Z
z8IO9aY5BcDD~4lz)zKX~YZV|@v8xCeU50bphb%}7zR<f$a0>pW2aP(O0Lv7|wq4V4
z{R1$SCzA5fkezCHo`0qJn!Gx`7Fy>v-Cxd%&2vk97V-K)r(Bm2DI)h|oA_dhF8uvW
z*!o>zFn&)!LC!w)XO$>G;oOsH3?e&Tw<N!a4koie-P=B0D=)AZgAJxSVZ(zR<!rA1
zqB(vh&Fj@oKG+y-PiTR$M%ER=)`zfdQw<FhGyNpOJ#+9#Fs^iG#6}z&GPPW>zWE10
zl28QIEZczZ_!^38;Y-o=V|AO1ttO9rT#rtH&RwNgc(hwpe=9|ERg|d5t}yTEA6d3g
zim*24Y^~R6Q<#C6ylC%QHfA`047C{a>b8jzUtF+SlB?#pkE-WNzj&Q7KZVfdxaw}J
zuVMWe=vGBxp|wPv1^`&wbVllA2jQWoUYeZtL#^w)Pi!j`&oMVD<bpbqi4-+1Hx;l=
zn*-j<Jn>Q4zAfn)qN)Q9ADVB~3gQeNUDqlL5q^GuapWrslLO-ou>*fUi_SIuffuQ}
zk{fgk&>jods?d~0Rh$K`1f$B?SMI!~JFQ@VE;3&40{eqKXlF#dwmHd-YI&Gnz4sHY
ze}U8CpTCo9wePD!qH@_Id_R)HzsQj+f`Gm8c=t#l$HgSn?HD;|mjI%B-FPdw273gs
zKZq8xp~@H{KowWDoQIBw5?#rCa_uIulICC%ha_=GOO|LKjq_gImc5Z%VuRU;Lr}9D
z;z>N(U3zT!^E|>X_rYCf$*l~OpX*dTI$#CLd|;%&Nq+!8SHcQd+e4?u!FL>n$NDkn
z*-0y~S^D=sMb`aZN0L)wXoEAAsb1>SMQIbmEbo3ASl?E)+6`FbJS}uC0okBpsiA_p
zTIL|@vG#5UTWGFO3ZHGEBt81Zr=<&cir@^EeU+fp{fpqaN9u$HcG|ouFeMM+Jf!P#
zUQH{<Ab;pK2O+F|_gLG?3tUG0yZ1oCqC8tEGP1jVh@Yd>Hv5u#VS1#j$0%J=H9As7
z>x}S4>ZM4TGRa05sV)oN>l8Fj+0I^c%OxB;5R%i-pVK{x;4XM5UD_%NCHdUYFv5CT
zaO&du1&?LVZ+Vl=DQfPvh;!@hFW?pNLOQX15m`H<Jpc+stp#O)pEY_-{2QB$Vn=38
zAZ(qfDm>tgL1OEJQmPj8`z9D9CW=aJESpiTofUKR=tAC{>Z8ns&+ZUrnX4|+o9M8T
z^DSA&DU@;n@aL_&j$fnXY!P!o7a{SSh+gNHY8!+PkbX*@8jG}Tz_aL_MwEor)<C?A
z4@s<Q&BSrwvf;`lsyff)vmSM=&Sh$HV{u0X^7({Y+9tVNcIC=n_wd3Pmd9mfp;dA|
zAw+6z$(mEBGAE&ow0`F@sgIb>CB4V-Ibzo_pFn|*QQ1G|dkvBnbvo5Y+$i-+8sGiZ
zHhwHey$-240-%re+<|=bt_at{Xubr6s9ZqhE<MFg%FN~7Jy*V@<@MO{)9EU;y`!)y
zJ37@h0sz>tSsY#@YOd#vv3xiJ0NffOy&|!Jclh>LnD^Xw%L`soB4+DlxLNkviq9Np
z+M@rj<Ehr?qBs!$edYn&@CPMwL@Oms90j~@KX&QZF@P!V$h&7p0FtDqr%nNGg|XZ^
zapJ$+;^zZA-8eHP^~S(1sX=tHq>vqih28l25dcUIStH*wR7(kz@h}-^EmQ@~!oG+N
zb^mjxK}-%*{=46)*@w3juK`S*j$ZNt0AfzB{fN<&Ulg+^i$hYb?;h!B%PbUzU2t__
z-g5oeA$6=qO*3NDKYy(<>eut`y{}mHyEjhU8o701{~|QU!YWjEW#M_Gdp#@))Ou$G
zz<TlPQ(@4Ezah1Uflq8GSVQd=+2eRXMm(fEVVU-g)Q!BtfEumFo1$FRXC0{NMvatq
z{JnTc`FABPp0!x4V=^%-pzQ1~=Ue~oOSyOLV{~3~P^)KHg581&EL9w`fv?gT%Yv=S
zt$;{a^L5l7l9ExUUlSSG;{Qa(<HrNva6AY~nIMTn)?1JJQd_!$hPi%-di|b}^~mDp
zGNgA8r}!2P1d%UNqBcj@jl~7`7#(s$?L$C`#VqYtYHqkLfyVdpJldNcc6yGgq4x%O
zbv3?j2FIL!3V(KLXI1fulKlxDaxOyHq)FY6UA2x#D<544+LKS$<SqN-0uZHx0Yc$<
z#}S1jq=3|^BkD5{8Cd_s3^2g1<V3p!>qP+g82{R&+tjkxqBt=s+jW4{gfGS@|Jts1
zj1QR~4y|{M-SFNC?Scule`dR7CE|suIhz`t-0&8D<m`=)cuC^z&~kVQ(wSaOuGV#^
z{nMs?Tir*Fu?ACwH53vQN3oOA_<ApF9i*eR^Y9q}x*yoq5V%Z_=!eBCPHx`#5G!lX
zhADUPU2&_a2&LLX&-gk|nhP7nKvt{+BVQoXt+=b5Oh@X)vFZ!%fT!>y;0_tEN%K%d
z&G)~FBI8%$rOAJe0`<oXhJE+yFG7VDu+nPrL)=axp5jy?YCze7GJPCoLV=tFs~E2K
zzRx=XK%a=AUF8p{Kn%OlO5N<_ZEY-AYeFubXasl1r&d-_+ml>D%#Xx;6?6SF1@u1_
zsq4X#{djGDG6vn2Bf%yrE_W&n9*4f)wWxCxzg<W2FkZ_+vf4iu_GV{3wlB`KPYQw4
z<>=;4x$WFPE)m)i)`z-A;0?ac`eU_<0E{~+XTmV%G~k+R7@}fL=n;;sO?d5=GJ90j
z+-BDK$9>A2H_%aC=v*6xT0IA{<1?CmjiHg`y_*|9^#uSL_3iei)JJ+7_~*Q@f_&SY
z>U@TwwpL)k%Y7EqgJ=5ShT@w9Qq5dw1)bP`T=!F+ywp9rCo9o|VNo`YizDWI;!6hL
zr?D^CGjFP_1tu)p1-ZG5kNb*ax$iItJ~X}ZtA+ANx?XU6W&%H!R7RT|;ljUkb~<Bl
zS&$3obG^may<532ZDRJO@J5KU7COqd)rjsl-9!AwB6)2_=kiMw&f9rBI=KkoYwskK
z*3e3A9JA|AXd#AeazNn`J@Sa$k|gWB7`I9PPVl!bhMVn|lk9eE`IK%~YEhwX6e*xz
zDrQ&M@o6=+UMa}fC>^mxvij!Q@czO5ukt`r9Ab0}=<J}HzP;yYbNzAbI}DPZG#HXn
z8NEeT=Ve3#qtvnngJG-6jhGYxJ!Gb#D{e!%t0h&KlPt^?Y&^tM^nt_lPhy*P`iAXv
z?Rh}ESy+q9$$K1r@VNQiUY4l9+l|vfwYn4L3P^l%-1KQ1+48%?J$d?b83x5(@l!P*
z@ehl$MFp=hBZH{~7h$m0mBkeHN}5;O+rsZ^!9_(d9bs#^IaOg_4VmvGOAX^tOIFol
zxs3=O7u_yT{2Ig8z-|hk^w+kAx>Kc?2ocK}T=~sOyrYUoFYD^$M)l2_PRXx+-M*v_
zvAw`@@=47r4kFEDg)9&~NyhJdH_FiV57`yTOj1^z={ck=7vv-WSJ+<J{G<cBnQNph
zoBsUXEvwhwnw&&X9p<h5OD)H*Z`?NSAs~U%NiT#=l3qV<b<6gzBHJcBOj=R-2~?g=
zXY`d=M4Bp$n>oZM#dQ_Ph$DEZ1_R-AB5f1C_||@A-OyG>YR!dR&iIr<!GopFSr~Qg
zoq%sq;e+RJXT1?+LjH3xv_F9qp*LN7_uH{v5O-mUeuUEN0%9nwdGjMvH*S!~(%mL9
z{jX~`E_R*C3v}DS@!(IAA}XfUvU2w*Y=D~kPQwms_~!3?FBS_sZH5gXU+qZl)dKtd
zf7l@^;~h$iFkR+gJ+qavNTI1Z-Y_k2UOA+H*2FP}wy>3wgIM*xLZg(`OkXJjOazUF
zD5ypOQ)ax*WsC%{M}4QRr8kzgPF{?mDWZ*KjH&%v6zN~Z!0uIAGbKx&a~hoRXdqX{
z7yonjzB^OvZihfJbTh4ySNn&RQ|23Q=OIY0*b{&nQ|1tFIV^e9{Mm_q-0nIVeuY@3
zV?Y5d2dHrwb{-$rjdh(7_&ZDjA;ptF3x-Zs)g`i14OYP0_ZO<w9A7`?aF4ee$V2u$
zSCa_2Z_I{ddP$qr?mrSn)U|wXM;olplVH4qN6<oXb#dP>q2qbVwnF?0$cQVcNM~hr
zoI3%@XOwxM*$HpAvR~rdZ|myS?u<RmmQbh1_h-L+p!N3loV_1A-AVX^lHf;YapFx6
zTAg_6)uF=gnJfP9M{0ld6!5Lr6_TI$&ZX(rrA0P9$S}InwO@9bTVP%_mz%G{N@G@N
zdzXU&hikRH-%Hjx-fyX)t`$6AlKXWZhqbZcJX3WAo6`huJsDvr;;?JKg5jeiQO}V+
z1^N4C;d~WaezC-8a$`pa3`b~ROR&TPd&2WUaxyByrr+0%GgT7*A|ZrG0<I+UsXtkz
zBKAtZm(^ZWcK-DC#q#V?fV?GcvSL9pRo<oh3-U_3FyhZah&`L&`wiatyC4xuvmeh@
zUvOf$I3fPGH%lL|p`W`_Q!)6RPw05e?yLZMJ%g;{FMM?f7_#njd6fTr9Ys2K)c2vY
zK?Yn8(!T;bx@FoK*(KMA2@xQ9Ixw<$$->iR=v;EtZrfNJuiu+1^I;FVs^*2A%<sZd
zw;ago6%`B56^i@m=V~`_M#Bo{u<!T(*s190HNPwS(`j?Bn*6Bj&Hcx2)7!1=83s+X
z${xZlE$L@Op#^U!qUH7c(N?Txcn>6R#q8aUNYoNb&r{*sBXxjiu`uyq**Ayw52RZ^
zU;6DYOk1|qcb^;tY7cQEb8g(zW`9PB*jZWdWJr9a&Dh3P><oae4W=aZ#GnHf<yN)$
z&w?2p{voTShHu8M*Qc%|$UaV@N6oZRRM-+6s;xBatEF^u?YdDOe_SD7V0d?zP=9@9
zwHZAM5WY7u4qV)2Z<0L%j^_R{q`stBzG{}0sU3<OzgBuWC_!fkf&<lw0no<mDN;eg
z=|{dkWz_xK7&6^8ro}pNKlGVCVNY-H!y#aLx2IAOox~z9b*gy70(>^<wt7t;a=q~g
z`UD^}aih9?jXrN5?{s~<<gAlL=<%XOhI2We!-KOM8F5o9=@y}HPcqHrDO;SkWP9Fj
z>J}FnI@Sed+Rly9OP@2|EELgOEoqK&((Pj*MGqc>ej7}R{YsL)msCu)R@;-eNlunZ
zf72}$(pIf6f0BZVTsv?1Z8oCv0W0QqjOO%gO%Nx!bWGxIx;fvUm;2e<Kak&r>2_K^
zUn8iR?OwX5P;=c$k?HjAv!mnhaL0?uP`K~jP=lYv?uTUWdfXM!pjnB$K^2f=fe(>1
zs5W2tk8_$AO-HzZPSOiJ(8TN;4xBgZxuroxE;eAmE89j4yMU@?`T~Hi5d_V5omHGA
zxlQ7GPcwzVSKU4?DWyUAsjBodKQ8t}RH;%7s`P;nP7=SBu@enp8*MhpsB~f9OT*I4
zsoYt97tjsc8dd#aFAxrXPeYQA96%j^&QCVQEs9ux^R{+~h%xsJnD&mUK|oXCtsFz3
zy(Pi&sbAhj1WuyP1uy-LZq;qKE(3zj{7}SO?<x)4?utCY6yVxwPp=?U=;tR^YQNmC
z<T#GjpSmN9?tZ-a&4pB=q(&Ria8AHoyeBV0=!zUIPUa{c!h|^90nv9y007~F_6c)B
zU>Yq3Qr%DZ<mZDgI}+n~w>9eCbYos$Wf0K%GEtZfgU1dDp`C3w3lfr{j3Z!5L53Uv
z;A`#FPfZv*i*~YM)x+$OTJ0AJnsXX&;6)F{KAwc)7CR;%R9_j_Y97)kKYJIiPsaO;
zT|!nnB5BK>Yau78YW4sCuCl$dMOj@d1}?t%I<6&9SeE$XJa@3Jh-}m<BiejnchVa-
zHIhX8VmK<}U?Z(oey|RAmV0_h;bY%(f?fx1@o)jl=$l7P$^xDydmx&ZBPpy@4*1f%
z6w^IC<$!x@`|vGC<lhl9{9<T&Kz6B;v@+B81l&mp$%3IE-(TmLvcAOb;0`_p-pR>&
zYkxflFXIwE3D;!qj;9p!e+4y%{vu#r@OOxF;832u@$LvS8oCvB@+thKo6|hW8q57+
zQ1kmZ(!9C6AN0ih%umRrG^b@8yC)3TF}421>Qwc0JOPKiuFiHx+Hm@9&at?Yw?>}A
zot@$m#Or`od{;tdP7e-Sk`%tz3_)(?Gao(nf$bA_Li_jiD?Be$gf{VpOWECPCjJFv
zP6-psd%~xZ(*Vrbva0v&nfo&63_V_QF4__)-&xt?emqhJjFCFEW0~LlzI~1tfQ||S
zm(*RoxD=ub*3R<uRY|KHH)mZJYJKly#+p=Zf6oTEN;vl;dncU${9?%7R(y@{F-1$O
z=Q7KnaY*(Mu>`GgEb)a~pxh!F8~mOOdO3_-wF6rt7ljsu%+eztNmC;o2jqok3+L!j
z)_gw{ibai2T%V^2@mvwiv~kyO&%ZW{Pf433ssqhf-Gs&Qo!Zuk)J%t2=$Jd-H>jN6
z*;J{klJYIbnewn6F&m!zGLJV+_x3Ro*}qWoF2~63jmp(Egj^yQX|r$Kd(S~8w$h&W
zN8zA<-m9si7^G)n-%Up?*zAqSep2dIB{YIC4DM9a*lg9c(Xf-wO(gA%r9lM8fe#T~
zCq3DeHj2C?D3m0Ik9f*onhYdi_G%b42uc{fh(h$DiihSdcp7hJ$Q%ivGm7HqClhc{
zli09@%}&W)4*bh7o4c7fchpNb1tRdi{f}EC@{+`nx+qI`_o}$#^%)QpVfc|W(O`H<
zt(UBd5SR~GbzfHD%0MQ=P2^S@plt=$v}lDg+$ohq!k>CKd2Q@scatYdM(mt=`lDev
zO7ju^c-Lv5>kp`8d;J{U$$inQl_#sjHUo~8rGy9fQwvk*>AG4=bT5&z<a=oi?^VKC
zBa}eG2jMqS=z4uo9^5na9!@QL?MBzt<XQjK)JNO~qeQ0&O`zv4-=abS*Lr(092@x!
z59}&%6D+g)=65c?_sGB#LWBy*o1=cYIq-GEo4ip-#=626L%W%(Ir=Y(e?L`uk2S((
zUhw;Xfv&Bt9K*|QAet?I-KC^zbhgc@2!3J%aEbVLbAPH6IzInLCW4lt?>#oo#FhTI
zo+m=ogPsi^Raid8)@I*T)pqZf1snL663;ZAlYMneQ7cd;AOOHbOmT8y<0~-Fkp)=O
zK5DK`j>{<nJ{$D`65~joOop6C#z3YdKi1lh_oCccFP^<|iD*+$X-b<Lw#cq)swTT*
zJbIrj@}p5r3+7*PihBTP`j4V{KGCBJ-ZGAtPgJC0f`(_D^yd&>W<yEaAG+xl^k|^{
zpTr%jFV6;r_%{`P)vYw~sJ-f>QkCo(&gZ5V#EBTT1Q~zxY4SO83Cy$d=D9(`k!8@{
z(~WZgFjEa04xEohEp)h$NI!HGJFH?X^DYxHSHG7e#r>V3o$hdqHA2;6GG1Nn+N{gI
zu~?+fZR@1rVD34aUfH{R7jVFyn(_ko-E>Sq=E>CwLZw7khYjVnS<^XAy{NsLZ^g>}
z(ga*gbW7^)d+3XvGArLirCN^6RtH|U3M}+&HNOFjfuBET>r1V7Leqzf+&B((!Evag
z+4`ICpXYX_wdrFFA@PEUIJgX(Al#0Z_#C4Od=D==i$ZNGf1XcMa~^MP(s3C>Fk^(2
z1kO9NjIb<dp{|g4l<#zLO2!Lu`jm@C>*c`}!8s65zL$8{PKJQ!-J8<gG^D}6h&fd<
zP`M^L(9Ki1Erh)~eImTYjOuAm)1IMrheR(gt6Tr@fD_yLooPpV3;{8;xX)*d-=5ca
zI4L)~F0twyR1s{AWI2-&=vnJzu9~r7E(ID7?A8pmq_&Nj(UmTEZPjrFky{hw2pO>J
zU)aos7Wr4qyDwN!@?9ew36}y5?GX{Q_h}cqPczvF^b6Ds2(MDzJRea(Cg46PR!rYL
zx;*ShL~c0BtHUlH`Z<tY;_%E{ZT)?NoSbjDXp|ZrNoPe1Q8ddFgO~`|*&FRQl%b{@
zJHj`A>ABqVRVh~e?H(}^GbXlj_{om}E_q!#20H}P4#0#<UQbJYnDcEtFwR%;dj1<-
zBPH1K3>tpPmmDkt{RY~YI0Mse0Dc2)2N+y(2*CAE0H8-5@cY(p6b<kX=9ZvPeNAuU
z8CsyzhK=OX<>vDA)HgI5y-_1;)NsVN^Le2>ES2Yk8Gmo8nzHdhFFRDFOU%wC<(V)J
zfQg7OfiC7N4k3?!T~a(~a5%|-H8_CPn1eXfUrapz06aPX3Yo_B-<J*>F#A8~?{|ZP
z+0(uGn=gAf4Kx)xg0f&2q8ButD_-?jJWc9uF0`ssKSX4l_)@kE2c8#-f~W_4RR{c9
ziCE65T3Y$wI`2=h9F7bpCgAxL&Un<jhxWhF2gEmi?s8sm_{7$nTC~`hzM&bWa{$(z
zB}afe7c(SnZ4mHZFw8@-mHWRKpz?>>xv&Pw#WqBifmo>6qOrNfK)RN2+LrA=ZlN!X
zSJBT$%)?_3Z{*vt*fAsy@qEnjx9?7*5y;s}Xd^H6r@!49rm;T&j+{o)0#&D)>{CZW
zK+Y3(s3(T^Q^%p&p|sb9O1@eiiMi4-qZEhR>6?2H!$3UcA+k^x_$#PGOz?g)0;uOY
z$@$C|v}b(I!&x2nWU;7n8<=XTNM>ruPhD?%QxIBf$j56IM-B~qT+$Ky#1E;<b{nIw
z3NQBF5L&jm{I%Tvr@i>m)XqIkbxdNRux1Q0!>er$a{;tB;J$j==B)KOaGu9nye}=r
zF{E`yQPg9Y=cOb5BH&;<O{R@{g~h{N7TPl*bWM->+~)Z#|5R4wxDt^QYPywazaXI*
zqnSZeS;c&TZS}cfBYoEIf}{Co=qrn1B7SVrfGC!ya2B9RYx26INb<Pho{Nt%{i-eg
zUST-wtF3t6JUX|@RG7A#Xup7r(M%=gMGafBzRXLORgZ;d{ev|k-KJ{2q+EV1W=PxG
z*sw}PCpJtJL6H)R1TH9~=qz0b1XK?#fZK3u&xGw==w|BdLx{4b?;{W5vHTEy&$>?z
z7h&l8S#IR5YnEA~)(0V-k;Q;X>)`emSS{U=58{iSHz9e4UrTXH@9W3YhHXV_-*!J2
zpsW`3d2BatXmN)J;#-3p1(~3tba*(%_7z21Ew9sQ_`J#4CT=GsPmmh9)|WfH7<-^4
zPB1&^{>WpvRbSLp-XD@ZZ@p8D$mkvJt59mljlNN`rV%+oaM-A&Tl^T07+N042YJ%l
zQ+1DQU<S72%8`0Iu%0`N$#lOxsZzj>AhW-XlYow6%n*v(60m>f=|K>Maz7O8v%lD$
z;t$ZtLqXp<A$IH-UWwWE)=2czIJC^YaW*a+uO-a48&>=e#<v=)v4#0-ts})~&qpa7
zTbjoX_gN=@s3rJX5)1T4ADQO_e@iqNZ0yneOuyBLcSFK<3v$zf@5*++N42@>kxO2;
zMtmphP^RfIF4w7%*o_O(`2nY8rZRg7mGmUa%onp44ULczvWUq)BaaH2X-MW0UtaQY
zEEdh|ll&IdSaD6wR+(!f*}e>qoU|Nh1=j6nO26O|wBd8~HUYmHfL7|F5(|qKqg9Z!
zVl_P%!$Cwb`3d<225Xsq;yTB!3ik`UJ{d^~>EeaL_eZBfn0~3zh)w#A3-d#KS)x~8
zdyZ|iCp7AKooAcgmS+Rcr=kUhmDpy(@%B+%)vOKKFNzruOTO0YzKIECd@uJuqLEe#
z=MnAr=bgfXAuxUFSYnK!{PPCfa<!}1zvc|eR%#XgXJ2Wvwqyt#?yE8P&7M0megg@~
zqR?*L)|2Wy1elhm*rI)p-8V}7Wu1d<W$emRLO}F%S(W@VSl#r#R06B|i~h}i8+?-W
zYOFm<%6$na94t-uusLz{XGC~kw`+%ABjzP-!J2tOkVcPUeZn}jtp@UC#JGI7G!v*o
z$+d<3u_mqS6QFw+GzWex(}P?LCe@Y5c1#m8^@)wDVBhm<(2fkFr)xLwhIH>eygYRK
z*C%7BrguJYZ&MWUJx@_M*g`}g%Dj#!h8A)S?|UJ00gHLJZ5WBEaEz~=eT0604zhYZ
zzy6L>Z(T>P&-EeXCb%YusqBH3@UFU|H@YS6<*ts7B8B^%&?mO+5hnQ`hAm?L>u!{-
zeH<HEafh`ne=U+~iHYW*nK0vr@O0}1Pxdza?%5mj;>=j#=;0Q{ski>J9zb>~&+vNf
zg%6gF<(Y9@CT6)h%BVYiT4s7XKVVy(Hg!NIUm7akfBk;xZX<NkXt<&}G>9t87I-%Y
zNfEVz>YZy7U+#^ef##9x=OG*slgr>3!H8|rF20?G@doXZuF1%9#ujSftwN81Z@j+u
zQ$TafA~eSrHgo!b($I9Ddci)L>K63C4BvX|O=Vb?Fq>wp?__KB8A;;(HHLZUa64i1
zbHBTdba*C*jF(jV!(mt;(FNaJ<iHH=*@3t!gi%_JuPjk8Ncd|pv9W%uUekFn<%2jC
zWx^iYVC?8YogvNK)@AwRFjA`$6=^@!L$ztgq>yMhC8*{v&v;+X{Wbi$LgzzRbY}l+
zIh!j5-)8^3<s1p|fIj)0=O~u)7fWM@9qWO`YWwgYHY#|!v3w)<$Lv+ftjEz>Ovo#%
zcrWq>;b`>#UA-(3U9HHrnJC>x3ahhu*n*bg{A`MxlJ7x4;S&Vvx`d80rFwedt=ju3
zqn9h(%uxYuvOK}Ww;kDD>NDQhI$mV16j9@{C?eT~rPYFvm7FD3;O{f4x0C@7kzKx?
z5NRFeeEPryJ&PQ$8KI>9Ds!`r{tKQ^za0Q1mnKA1l5yL$81%g>#GT#9CWt`Kwm4LC
zrRVS1xZA!xd_ihhNKIz7jz)QX$i5G($*@Ef8Lu3KdVA<!Jd0F(2Y0Cm|E8@L{ui8L
za9MoaPx}-#Z5y7~8-pyPBiC6t$$d{1^DKU(&HRv?;XCIX=-KSHyIKJeL+S;g`bX{!
z1h3F~s@>v<uMaV(D|x46h{-#s-DpS*B=;oMy$J}c5%yzO$9EK5j&T}BuMS=8$b(~H
z0R#cHm5uegu&r-W8p2xEoS6(*kvQ3+v`i^bI3ceIzKJ?%Ma!^VVyfv&UZGtLmiK<B
zlK<cgjNqVtubL>EBMrLJVlX*cpN#QRS_b4%|K976FUVDwHxJQ1JNNtU4SV?RBV{#J
z&6?UCM@MNayL8Qdo^~5_Ush;mbbDm>lcSkqciruxNhB>i72s^{Mx}WpetEh&Gbzqn
zifbL8Dvt+L_22dq8e&W?_@Jw0V*Z-Fx$1Y1Yyowz!-~rhNA851Oos)i{bfJo(PF_W
zBV0yHX&aRlT=#y~pc1twJyDEhEDgk^#tG5<_6FGVL?O$UP`^SJH}~OB+^xPThIn_c
zwroSD<8Dnbv^&!`5s!t_OIAC4VR0X?9UC~rDRwU1voApKqFd+HR(V5^7&`LBm7I;%
z^}_9DH`^QBxe;CN_;p5-53)LhbFtmNB<nFL(CWtZ#bzkZasiUplq|!BNyA34+DqU4
z?y0sC;3zFofqQsorDl6bj=p1kXNVH1VAz|jRnewhpI7CqUmcO{b?_Z`a7s@)kCU71
zC2OY`Tm=5)LjyN$J=dCxwQsh%tQ4vlRR*a`&$J)Rnz(St2`e8<Xnpef`8}dc%mtG3
z{d9{9KCiy{BtiE<ypfYyJh`U{A8O4^X_}wmR1UqHQR;b{krxhBL(R93>$X!ZW9oud
z69ZA-d!Z*^@;otm-e*gL?OU&h!z|yJ;dXjE2+=9b2!?u`bLEg`ey#K*(x}wSy1X?m
z#7|U7`pY}I$x6-7%2pm^X}>~6$wasbJ}E`!Sl;B(y<#$0qy#qXJxLwn(awyZBXWyW
zjqrBXx4QmT$z-e9ylx=Px*IUCgWCA!O#jDn&+~^mcb-k5jre;i#fw0y3QFoMx5%{l
zZFgMM09xIORfW{@*Vzb_uG#_$#CO=xZq7yei-;Edi3<s@FEG1j4?`vCjMY-MiqfL0
zDCKcZI4h<f5p$ZADi<rNzAcJUZ%d85QSd#s<i9a^?(vf`Il+8(8}4t#Ad5<r7c#6e
zv2e`w`}enA&oi>SFG>D(UW8u#<3VcpdPJGnwMZ-ps4m@`Bz!XN)LQBJ2n{!>$smsz
zzr?;-%<8RHg$VDCT^5Ql<C)$}?QvEOuOfv7z6$KPfeE`J;Cw#9KmyJm6wgJ+k#sw{
z-JSkA9z>FCE$jz*rw;LZ7z~b@h6elrK8B8h1{NFg``Pt64G3GXs-*QjNRFpze2p9_
zgO|wt)QgTPBtMgTk>ONdMt5<bWI5GeELjrhpelC|gh9zu$g?GL0uc|P8|$%M3o84I
zo=PQa*0<$zGZ`<+0F6H(8!%7kD;qU$Vh+WYl-MsC)F1cafMdSqGHU*%V9>&PvqmN+
zxucZIH?Vkfh-r>{6b)FgH<>P~i|?uv^Mx|kx*VQ$tA7wI4Uk>M?&Y4p9xD7C>q_LW
zV^You3y`<FlLGILDwNZo&uhFivI#Ze5mGsA4ldnMnJ};$+YPp0qv_^mG8uhVDztP*
zFRF};MLd>%2zuQ7G}4r+-u>Zm^j*zY@Il+mjgahF(_|A)@*~jRhC(&<WLq)0wsJlK
z<}~XWjd34Uma_|#sdl(0K`B^7hsad?A$Ns3kE9HGI#cModkQ6ByUzD&>e9cKa*Z|y
z;t;#V{ax315I-MWIL`!wTExmtx0Q`|!mX8_)letgd|Bsse(ZetJ9Ur?krz1DA@#<l
z`ksS24hXsPU^tAdrn*|Lg<q?^vd<WmjsCK!Qct~B_TcMK)e#wqv}y+iLBswc9I91D
z&gkoDQXd**$-aF6TE2TxG+B4aSa!+2v7?{6>m5D+`v<D^ii;M7E7;47X3i?Ix~oyU
zIhqoe`|A^%z4VD5{jO%95nJ_bZCJ&qGH&;RjPIIem-`l%t<LNOHrS`uhzg{yZPT{)
z)?M@Xj!v#2eab>lXw0DL=;L%0mXQ|ei@<61=Rg)31nqKp@>IxNBYv3!rpqSKF6ngv
z3NwKEwLMfhv;L<@vB>(5!UAkyRI3g3Y@yqaXU@1^Sv8dJVNb@fyYC9UhoVV)0_UZ_
zm^fQf)8|BEKh_|9ZuCaag6Lj*xj1TDF1ZC2F<GW2A=Mq5l>dfG*C<tbGbzXDiW>|}
zSZT8SR&cnmo~M$t?bH0ceubZpWo9<Eiw;MEdYRBzKXj)V(=z8mD=9!w%CSXvl7@ed
z-Yz57Ux&QTc*L@@r8IrCyo?a!6~0Po|HbxP5sY_-Qd+b}oH7JBk|07qaV-m3ai&z2
z*V4!YKiqO+VL{{Ahq?x9W;D(7ykB;QQI;JN&f4~CwrJ2rNF1BN7GacwWT*^5-sPr~
zF{2P8lG#vn%}u!w+d|PZ0T)E&!}Rw*rzF{(2d`2iDqIt*i__|5>z<en%$SV!kyRx*
ziH$V@!@cdKL8B(`3^)OTSvmmN-kGR*lyB|^0fXlCFXxTA?xubP_UjZ?&Tu=&wpqHH
zOtoU=ub_n%MTny8qDFT#y6Wb$*<B5gdEhSo)|nun7*A=PvAeQxKsR-pi|iF#)B6V^
zv*A`+{W3LU0;>oX-g-1mx2Bv8Alhtv;u)<4?uuFaJh%_J*O@(=SGP$JG^>%LGIJ>t
zDgx9j4uFVTK#-l0g!nmisYDZ^W1hTU9ug+0o$Q7%_3*O7q&gU0C35kA7b7!Bmq%)W
z=-i-+$j>u-lk>pqvi*8I2_Eq*XUX@k5R<N=;|H_J;m7jtjYMe1<mk!zr$!2L;K-Jf
zEPJ+AyIWNWwiiyU6gf3H+KRtCz=qzLoIB~T%^B<kmQQYYx2$P5Qag0tCR8Y^gSm6|
zc8@@F&V(kN6R)2&(CsPQqnzvT_guGJAr+KRraxle1!p=$-v-Nz{a|&YpvXF4=ZbpK
z*=&rJA*!uhtIbLQ6Ve|`qTO#ImWEcYJAk{`aNGw(!==2yx|yShfjcAStn<Z!vkzF{
zgRXfjjUi6z=IiBSO={}O-W3ZT6NV#6wswP*=(Hs7hX!_!)S0+rZ$b38m~7ljuEvgI
zj_ZvGC;gFi{@}flOTbU`RATgv7<fk;EK%zd&MXv_qm`t!c$vso?nLy<$&$ZB9AU}7
zdjNWUm6)#@+;0JAz$Ii8oTbAP@#55#7AS9oLf$TeD>l+NX<LE&qWXTyNp+wKvgxd}
z&mwLYe^Yre_{u^A52?}l)FtB0Ij<VMW3}!Kl<4vEK7PPp&TnvZDyj3%<r$egcQMeX
z4<y=W06HGRauw+q+35e7CtIE8!&c4?sqSynCfcNRL<35mnY%Uqn52ID_n`Q@9&;QI
zxd^NG@%Su6**@1|Bo%1fi|yjY{YAllwHrU3qUD=o$*NgWL+?~M-DHWY2>U{~U!Q$~
zKQquQCx2ug_-&^>%h+b9>uSWRJvsPq%WjBckKMM5bn%kI)x^01lHRATOes~TaKWKD
zb^E?}>>4E*b=&OAU*Q~C>SZ%;6K>;SlH;q)b>3pZgv?$<JB`%)>;m1ttx|5hk=S?q
zCLOp_ftn@(0N_t;=bN~-zhd}%`N5w8Zkka}>K^7H?TU$<Y1_5L&3?093t5|J*E^UR
zT?%|`O`Qpi-MW4JBMESTzx{(S9KbzC{=pI=U-I7+w&DA2#`)I-A2LChf8E<RWV&98
zH9@=TaUHl>-vw7=FrV=uU_^URwH8?3u}@0JJxu6e1eLSR>d8yLX^AYl7PWZ_ae|4=
zprfQ_ntNs)>8Xs|rFBY&!61G<=QyPFTCOtX@cs$r&<>Em0~F`o0glNepAOFsII-V2
z;lD2(Huw#lPJKvDbAh1@)z&+>?g*4l(|#EaGt=m9zc`=0(S*sjS8G*JoeF*pBs+As
z%E@oEa5$|S3Sg&Z%_wah>k(ee&Yx1GhzignWDf2Zh7%P0^B%8zIkQ&`2jUA;3sa6H
zC{eYcJ4<JedY(4pY);ww;5wOmes=;}vvu4U6#jimNq_sGBY}vOQ?Hn`-3b{|u$ZH(
zOf+bZd<v{*D}OriJ(idJo^X*8?CBUownpa)n!}kBo030Iel;c`^{!9}`hCYD<{+?L
zI(5DzPd8t-8T~}g!)8TfEHfc`$PIFp=;qwpSZD$L9_*Dv-yqY_3(nkE#NT1sG-Kcw
zja@QvKZxN?kbO^REJwjL8Q6`a2tTRC3JPd2I<+vm8ym1Pf2l=la+Rdd0j_CN43S&B
zKXCnka90P*aDHxphL)5-KZn4E9~svdZf8XCLjA!<`gR!ZO!#r1QdwxiCbDD*E!*`6
z*GR{_VQe;fwO+0*U7HB`G>$2!28_oQ;-z!VqODoq+OV7r9`aDP2qkN#Sqd%Te!X-<
zO+`c#J7iEQh}5+P(jsyYRvbOLm`B<64yEb}FY6^xBQhrN4kONDlRG;})%Q<rmNH#A
zz<&OQ{{VD1r}HMT5H9cd9)DlMgoj+>O4GSKbey5(oo@~4Z#)NPGzFTY#Q(|^eE&V+
z+#7;`k=Uroh-dY+H)7+%Ik?k2`Q(;L-Kd34FIPe$cExQm;V(yhX<n0-jxzYNG%0$`
zB0u5SfgtXHOU`LnbDj{g9HIC$_EAZF60B{NTqAE2hPG76b@SG`J3uz|xNY~Q6%YUT
z+cA)Zt*R@Ar)rqah@GmwDw*61j4tZh?UL)0hsDE|<+++1RPXzX{fVbsMRstA6PmYa
zMo_^7Ncv>QYmtrP2d{)~_&YrhxfOSu7CsE~E~8}-E}>EKs`SR=+;=|r!Q2$Mg6|DF
zrL8cd+*ijby__$bKf8A&L1K_48`#qLcArMfE+Ud{fyY=u*Me(!$;hHs0`Hzra5SlP
z0$+6%O}FzW-8iVF7-yxL=4hhNBmcfcnu{;1ueXVeim3Vyl4e^7X(HJKz3|dg^!O5Q
z&al-VBJ#<w<-pCsHdY!P#u?U}4_76Kr-eUQpL^D9xN55c_HF$u5%J!ot1$y+VAg5&
zjL-Y)oe6i%7|3eRM+=g#G?7mlgU`zwd_rnrrmsFdFDs9bMN1;h5SODL%1w`o#K`fG
zRE5RlK5k_&xyn>_H_S}=4l|Fn`-SeVW+FK*&K=$}P)l%0kihapJSZV#$tKoKn6HrA
z$!#Rts+aJt_P*fDD;ShS5}nustGcpGLi!r6V5Vjpzfk$1TBN{inV4}I$STPvjjEn%
z1?X@8Y%pF(isoW2srI*UnX8}euTI#f1~Jn_0`$RMJ-3L;OnQ0|D6L^ntLlH}%{`El
zCexVEdSiBl6#0gw@#V~5;7?gX*Ia9S67R!LU7?ol=^<)~V17SZ^b*Yu`xnM<dH$~M
zjuWb>cP`wg-Nb7e5t-++pA%`ZeyZhH(0W_Ore8s!yu*=KRZCG6mGi-OL<AU9vqdRV
z2dNb&(D9$OAw-V6&~NAp^2oypxh)_6^eu(Q$Fk*}u#dW1N9rnRe=k?<*qvy}-Ov5$
z2;yI<bIzq_YRj)h-RUK_bx3_M{`57k|0z?UFbNo-(*YI?IDps=wCS3!xfv@SBdL+s
zA$pjp{r7ml%suO-iRImzmx)ZC34E*_5QHf0f~9U(YS+*mp}C~r)X~n={~s01?B_s7
zAA%GAB~knTEiLxHOLO>#QaZ%Z4@}g5%;Ufg{6jDOYihq6{HB!tm(0e0j`_bE(qZPs
zK_13|Fqz5j%n$bOONXf$|2t#;pU_}T67b)YAXBafdG&`fU=qXsPnPhY!N0StnC#8L
z*&$i?JDuc!(D@$?{!dTje@XcNp4)%Yo&R;5{~GoGWFr5}lu5q)2T8f0Z8%WOtSN9|
z76u%0p`p>=491cIuP%=9D>`QWzV;v03kUotQ_BBc^};`e4GhT-uwk(k{XMBg?P(<O
z?GeQ%7C9u}dN>>&`MD_e>Qq&FYM}I@H5iZ*1U18a;kgT}u$cXWxpbZ*O2j=t=Fs01
zuCU4h$NO&)JY{+ZuL`|E9o%XSIiN9^B_!XoOg}SpmfBME-H+pdc%>VFZNweze%iyV
zdBW{+25MpXo~l2`DN_>>e=i?lN^NZk?X=PKVU(amk6b-oJ%3<95BR7}vh(U?ua-v5
z>T4>LSnUOk`7E^h3=i~OxK|k?d8>PzsNcFtV4+t&c-FPT0m=RPtdIvpb&6hP`}{_`
z87NdSJlkq*`AwWYNzjKP7dmQ^X5Wf0UxWoCS9`E#hQm4fcBY}*t9Apm%s3^;(Jdu8
zwC{X3d9Zs=5Nbm~h2stNG<pWO2G#p9e=xn(4pCxjAwG3U<#lDQw?$H*ud3c%NW2Fq
zPcB0hl*HSMOi(IX^NN&D+(99n*Y-3)DycBpLKfkWVgI(qr{^8Jo0GPR^nSh<_ZTGX
zuQuETV|GcWOQL@)sv56$7ZnQQqKA2)&5U$PM{wt#I0j<{@=JF7x;AHx-)Pswre>yb
z!C4MoO64GZVP$^yOWbz}8}6ueB_@ZuMq;)1QH`!QohiLt@|fc#X}A8XJA!w%)rE)j
z#%5BjMnW;Js_&$5^cBNTZkMM<maMOg1m@9n;Ho95fd<kOVL>kzR3dRZ4Ex5A*4MCH
zBYH<}CBEtDAjBuK($81R8P@-4PAJE18DkZg+_s@QuqC^B_+H<W>Tc-Ed-K+hlY=9R
zHB(uGl{o+Gg}t68UU$f0ix}z}?M+?Q`H0Ohb*~TTGSEi256I=|Wtf`O8T7zpgs47(
z&hAyy8gDsJzbtWy59>PWmR}_ZZrpw{05Og{<3{n8AHC7E)S<3sHtl}&DV&)RL#|_I
zWXIV@>UF!@@J=)ojpz@*jhF;oCdCS&eNtsB<i&_x=fEmMLgzX{V*bQbP1VF|Bshpj
z8O{>9Vvu9=RKB~+co7;M<hJx<v1aptfnM`vrG7siZR@$BKRxO(SMRp2OpP>K5iGsH
z8(gn5vMEiPYbw1wM{@uZsD$ynGNo)(&R3=2{T`WXp3-0Zh3}3is3~Mr;Y|j98jh@U
z-U+8!0Dqm+;$bdq#tXPsi>`qB{UO<+5f5o^NLTzrM26%Uo?OIm(#onhE$x*%HqS<`
z?t?4Uw((_co(vIVJnLG=xO0V=2NHMd8D|*NeRgjq%F~Hb=<3pi&(<c=R~MBjo?z!)
zMv9oh+CsxoW@^*^F?4ECWD&V+$!7>Bqj4_6iy&CnFW2e$S~A5jk{(XzkeDd%K#18t
zCYf$KfNdOZKPp>5LXy3-W*a8TL%n6JO)oIZM{n4!?q}p)B8K)6x|OwDdH#w*=bD99
zN2L)vI)Co~aiV2=y-#D%^7p{f?;qO7=}JqH&i1@up%cO(6)5`4Rq&@S`HK%4mx7Ws
z$u&Dp_ZH`0cnF$Y=Ko~w7oOLu#kB1oSvZ5`?@pfc-A)uZ$hT0Oy7+L4U3gQ*`Sd}`
z2xSueQ1aeEEJbo)rdBYr;V*O2RE)1*bFN!`jqN-2BwI?s{>S^z5kt%-07fy~om*|?
zlxV*~O2S0LtV`9Y&}>__^=Gk^$EObb@CbBuS^5Q13p&0|=wY&G%v*R~Ph^^gy#=3o
zJKi!a5&XgMdGUwyT(0hb7)@C!+jCW-rp3slE3*6nYPt?n5W~y;BCS0*{DSR~_7B(I
z>5T4W^Hn>W^K~y;PD6$`NfGn&ZDaUGM{|Xmb@@h2xEOT3YIe_}QR@t|?&Qe>Oc&0E
zud&?Dzhm~J@GKjN!?}V_v=zp))YZLv<7}`VFkCiqyA1uO&~pR7k#;S8weWSToac#5
zxvX}>+)N{i;UW8sI?mFbyLJXzuhA$tz<r-_v&dPulRLVe?YYuEe5xEW5d2|HgI13~
z!gbu+v!&l`aI-``vK^R3RkQHq)2w(t*%?cJi{E-r4ug^)`|_w%mPQk0zLF=|mJ5>t
z?oOHh(0N6vuggx+bDmBXd`4{Gkn?WI+eK<Dcfw52xPvOQ2)?^KFK%RqJ;>)N3iU9X
zEM=A@+=kN1biC?$ZGlpEh6oPJNlkro^~xj*jNOb}p3CU+;HGRTAJ;BIO~PlV`oYy-
zn(U3R#^S+l$tK6!uc|70y{I?&KQ;FdfH<8IN8OAbwFDHgw7sVD@y29MdAh@Ffmo~J
z3Tb0(StW^7T{#0^X>d@_yq>D_z^DEDDASSP_#8XNlEo~_BFSa$o&%kcCR%r|`$kq4
zV-hh(B0a(V<&ez3byE_`+CB4sb+u5Uf=Z-}zrc%TXt>b!Sugyc*oqrP-f3L_GHrW~
z?(EhzP6<n;g*Y9_3_k|8u{^cXlK)_0g(60jTvkufl2(-}^D4?)*}chv>Z<iWzqwJ!
zz^yaVa<oEPS*ay%f-;eM^wni)p#TB*T`s`cV5wU1kfpztbvgm`=cpUp;;BtKC|lFP
zJ8V?c+zys4pI96x#hE4Uk2C$@u{xIWP7zEhOW%6$brsxl%h4jVTHq6gA>K8<To+x#
zhG``I$d);_t;0ufj8~bT6%1}6M4OLDJkIkHy=p^`D#7K$U3y!IRnXf|NRF;Lsx)2P
zQ!BPmCly9g;tAgOHgZkL#rId76Y*oNa@XaXX`gF!9&L-n<nX?;;sJjgQ4f)k+PUmJ
z#DC62;vt{e56%@WcFkxf<iH&vH?)tMU<yn-`m%>=5tR5$*UEyL{5;V_T5U9lD|iq}
zs~%Iw3pR_Vo|8Fs8v5=rkg`4_!`ik=ZC!}_Vq+s9l#N@r>G|lEQs9KH-W9@E)Yp$*
zLgO&QT^^V9tYW$cAAQZoVqDR<d(t6`{z>{w=IRpB-Rw$saqG*KFRcxaie(X!B|E3o
zg8MVA6(D~~6K}47I~fdbkjlguEmuFHNargpM$>1pP_q`1JuKC`ClY(c(lX|%i_Eeb
z=Wr<lM-n9lR{v_S*^LZlNTa@}8wN~xj2q~LS0f@`XwEcDpv+I84ihs|KJCQK)KvbZ
z@(LdCtp5sSPi++F{6!J0LK6&r%__ny2cYT2&_+H44H}J@#nfU$b@S_uiH)>zxqOEa
zma7Y^69pi9kP?pG-xX@=Uh+N)9WVEjZ!aO>WXU3aJ8IqGJV@;^s6x8GDl+?PfsOFf
zJV!f<Lim#;2YWTO3G<Dx$QK4B^~PX=SwiYq$=Ell=T}MtcJ|HA<YyX(f%F3=z7i{6
zhQw>T+Uj;09MtXMm0f(0<jjr`2P8owcS|8xiQtoM*lcfiSGI3mNUqQJG-b?GzTCLi
zHEp@N<-Kp@yHh(QP&p_8lK^@hj8Ca)C9OZS=AZ0sO8Ak~U|}m0`_9tw<Xf}%?TEc9
zjIEKYpzwMjPmIDrvd+#+X7Xfc<P-N@<NMVdbmGVlnB_=G>XvWs?zIZO8_4K4!472$
zcM?lxldyORf|00|46w@uWGHap0s@(8T;#6x7pj+1)#{HVzLj)=wQMdNO7`+~ZT5@*
zb|(L;KmG55+5gNfU!$CiaO#{>mXLmEF+R|X!nc?AC-wuFan2>8h)$+SgWvlHsKwax
zY9^y^0uEC1&XV7u`}xiI;<nB+1EB*#l_KL+mVR2!xY^7iz3#d(BfPMG(H3iUPl;;L
z`OUipp2n@c=2wu~igp#X!TH#%xOC)6z7eGxg1`PeG0J*4Se9}p#B@i3+)v%tq0TD-
z?kY4&rY?OlT163zk$iJnDB)t4oR&p8_pc;OQ2v9Bgqxqno6i4E9vp4S<P97fi{JP6
zY<{51NUN;4zr_5%2;8}^aM)t(#;Yoa94PD7u)s@?nb71@4k_ZpCMNG6hxHC?qr$wK
z<W~HA)kj6;m^EEjYb2IYdz(nF78?bY5JcGQfmc3NbDkV@@We>D)05XbO|M@ui)yD5
zmVik%gH;PwC2sPz6h|;y-^1=YCf;}OX<Fed`Dt(tZZy`4NAUZGhf;KCSp?0zDavM`
zZY+-p%<~+P(&&vRbPwKu9zJn4ScEw0YHJ!w-Y=cu_aCa7`k+(U>!Yl<E+ic&R6!+4
z?s^1~&*3TSOKg~G5>I%nO?#Bl3g#O#SM4AG?oUnCPFQJg)X4uHpbmVAN^8id^p$My
z+5VGXt8^hIl|4d7<g?J7*#O`#M}H!t&XZY+!Gum3mGC@tjD6qG)n&a&#p=^x6LH!l
zjZR(6+MNH48uzUlxN~X!pbLX*pS%w8QJFZw(R#L=s(v1|2lF|o|BRp7kE)1Q`SbGf
grY1B`=npUeB~QQfR4&0K0n7uWYpheP{pVl*7X>)_tN;K2


From f1d5e24d93bea2ddcf8533f70b8261affd0ad746 Mon Sep 17 00:00:00 2001
From: Vishnu Kannan <vishnuk@google.com>
Date: Fri, 10 Feb 2017 10:53:25 -0800
Subject: [PATCH 02/10] clarify node allocatable in the presence of eviction
 thresholds

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
---
 .../design-proposals/node-allocatable.md      | 49 ++++++++++++-------
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index c66e46443..5abf09afe 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -6,7 +6,7 @@
 
 Kubernetes nodes typically run many OS system daemons in addition to kubernetes daemons like kubelet, runtime, etc. and user pods.
 Kubernetes assumes that all the compute resources available, referred to as `Capacity`, in a node are available for user pods.
-In reality, system daemons use non-trivial amoutn of resources and their availability is critical for the stability of the system.
+In reality, system daemons use non-trivial amount of resources and their availability is critical for the stability of the system.
 To address this issue, this proposal introduces the concept of `Allocatable` which identifies the amount of compute resources available to user pods.
 Specifically, the kubelet will provide a few knobs to reserve resources for OS system daemons and kubernetes daemons.
 
@@ -50,7 +50,7 @@ type NodeStatus struct {
 Allocatable will be computed by the Kubelet and reported to the API server. It is defined to be:
 
 ```
-   [Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]
+   [Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved] - [Hard-Eviction-Threshold]
 ```
 
 The scheduler will use `Allocatable` in place of `Capacity` when scheduling pods, and the Kubelet
@@ -83,6 +83,33 @@ designates resources set aside for kubernetes components, SystemReserved designa
 aside for non-kubernetes components (currently this is reported as all the processes lumped
 together in the `/system` raw container on non-systemd nodes).
 
+## Kubelet Evictions Tresholds
+
+To improve the reliability of nodes, kubelet evicts pods whenever the node runs out of memory or local storage.
+Together, evictions and node allocatable help improve node stability.
+
+As of v1.5, evictions are based on `Capacity` (overall node usage). Kubelet evicts pods based on QoS and user configured eviction thresholds.
+More deails in [this doc](./kubelet-eviction.md#enforce-node-allocatable)
+
+From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods cannot to exceed `Allocatable`.
+Memory and CPU limits are enforced using cgroups, but there exists no easy means to enforce storage limits though. 
+Enforcing storage limits using Linux Quota is not possible since it's not hierarchical. 
+Once storage is supported as a resource for `Allocatable`, Kubelet has to perform evictions based on `Allocatable` in addition to `Capacity`.
+
+Note that eviction limits are enforced on pods only and system daemons are free to use any amount of resources unless their reservations are enforced.
+
+Here is an example to illustrate Node Allocatable for memory:
+
+Node Capacity is `32Gi`, kube-reserved is `2Gi`, system-reserved is `1Gi`, eviction-hard is set to `<100Mi`
+
+For this node, the effective Node Allocatable is `28.9Gi` only; i.e. if kube and system components use up all their reservation, the memory available for pods is only `28.9Gi` and kubelet will evict pods once overall usage of pods crosses that threshold.
+
+If we enforce Node Allocatable (`28.9Gi`) via top level cgroups, then pods can never exceed `28.9Gi` in which case evictions will not be performed unless kernel memory consumption is above `100Mi`.
+
+In order to support evictions and avoid memcg OOM kills for pods, we will set the top level cgroup limits for pods to be `Node Allocatable` + `Eviction Hard Tresholds`.
+
+However, the scheduler is not expected to use more than `28.9Gi` and so `Node Allocatable` on Node Status will be `28.9Gi`.
+
 ## Recommended Cgroups Setup
 
 Following is the recommended cgroup configuration for Kubernetes nodes.
@@ -163,20 +190,6 @@ Kubelet identifies it's own cgroup and exposes it's usage metrics via the Summar
 With docker runtime, kubelet identifies docker runtime's cgroups too and exposes metrics for it via the Summary metrics API.
 To provide a complete overview of a node, Kubelet will expose metrics from cgroups enforcing `SystemReserved`, `KubeReserved` & `Allocatable` too.
 
-## Relationship with Kubelet Evictions
-
-To improve the reliability of nodes, kubelet evicts pods whenever the node runs out of memory or local storage.
-Together, evictions and node allocatable help improve node stability.
-
-As of v1.5, evictions are based on `Capacity` (overall node usage). Kubelet evicts pods based on QoS and user configured eviction thresholds.
-
-From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods are not expected to exceed `Allocatable`.
-Memory and CPU limits are enforced using cgroups, but there exists no easy means to enforce storage limits though. Enforcing storage limits using Linux Quota is not possible since it's not hierarchical. 
-Once storage is supported as a resource for `Allocatable`, Kubelet has to perform evictions based on `Allocatable` in addition to `Capacity`.
-More details on [evictions here](./kubelet-eviction.md#enforce-node-allocatable).
-
-Note that even if `KubeReserved` & `SystemReserved` is enforced, kernel memory will still not be restricted and so kubelet will have to continue performing evictions based on overall node usage.
-
 ## Implementation Phases
 
 ### Phase 1 - Introduce Allocatable to the system without enforcement
@@ -190,7 +203,7 @@ The scheduler will use `Allocatable` instead of `Capacity` if it is available.
 
 ### Phase 2 - Enforce Allocatable on Pods
 
-**Status**: Targetted for v1.6
+**Status**: Targeted for v1.6
 
 In this phase, Kubelet will automatically create a top level cgroup to enforce Node Allocatable on all user pods.
 Kubelet will support specifying the top level cgroups for `KubeReserved` and `SystemReserved` and support *optionally* placing resource restrictions on these top level cgroups.
@@ -246,7 +259,7 @@ enable existing experimental behavior by default
 
 ### Phase 3 - Metrics & support for Storage
 
-*Status*: Targetted for v1.7
+*Status*: Targeted for v1.7
 
 In this phase, Kubelet will expose usage metrics for `KubeReserved`, `SystemReserved` and `Allocatable` top level cgroups via Summary metrics API.
 `Storage` will also be introduced as a reservable resource in this phase.

From e16e4826395f05d13a18934857670e99b4dd0f36 Mon Sep 17 00:00:00 2001
From: Vishnu Kannan <vishnuk@google.com>
Date: Fri, 10 Feb 2017 14:57:16 -0800
Subject: [PATCH 03/10] add evictions based on node allocatable cgroup usage

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
---
 contributors/design-proposals/node-allocatable.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index 5abf09afe..803d8d2f0 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -110,6 +110,11 @@ In order to support evictions and avoid memcg OOM kills for pods, we will set th
 
 However, the scheduler is not expected to use more than `28.9Gi` and so `Node Allocatable` on Node Status will be `28.9Gi`.
 
+If kube and system components do not use up all their reservation, with the above example, pods will face memcg OOM kills from the node allocatable cgroup before kubelet evictions kick in.
+To better enforce QoS under this situation, Kubelet will apply the hard eviction thresholds on the node allocatable cgroup as well, if node allocatable is enforced.
+The resulting behavior will be the same for user pods. 
+With the above example, Kubelet will evict pods whenever pods consume more than `28.9Gi` which will be `<100Mi` from `29Gi` which will be the memory limits on the Node Allocatable cgroup.
+
 ## Recommended Cgroups Setup
 
 Following is the recommended cgroup configuration for Kubernetes nodes.

From 37b5373b0ac38b70d4563e0bc9ae92093f55f81b Mon Sep 17 00:00:00 2001
From: Vishnu Kannan <vishnuk@google.com>
Date: Fri, 10 Feb 2017 17:12:46 -0800
Subject: [PATCH 04/10] adjust storage allocatable enforcement to run at 100%
 of allocatable

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
---
 contributors/design-proposals/kubelet-eviction.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/contributors/design-proposals/kubelet-eviction.md b/contributors/design-proposals/kubelet-eviction.md
index 010415597..15aa82809 100644
--- a/contributors/design-proposals/kubelet-eviction.md
+++ b/contributors/design-proposals/kubelet-eviction.md
@@ -433,13 +433,13 @@ However `storage` cannot be enforced using cgroups.
 Once Kubelet supports `storage` as an `Allocatable` resource, Kubelet will perform evictions whenever the total storage usage by pods exceed node allocatable.
 
 The trigger threshold for storage evictions will not be user configurable for the purposes of `Allocatable`.
-Kubelet will evict pods once the `storage` usage is greater than or equal to `98%` of `Allocatable`.
-Kubelet will evict pods until it can reclaim `5%` of `storage Allocatable`, thereby brining down usage to `93%` of `Allocatable`.
-These thresholds apply for both `storage` `capacity` and `inodes`.
+Kubelet will evict pods once the `storage` usage is greater than or equal to `Allocatable`.
+Kubelet will evict pods until it can reclaim `5%` of `storage Allocatable`, thereby brining down usage to `95%` of `Allocatable`.
+These thresholds apply for both storage `capacity` and `inodes`.
 
 *Note that these values are subject to change based on feedback from production.*
 
-If a pod cannot tolerate evictions, then ensure that a request is set and it will not exceed `requests`.
+If a pod cannot tolerate evictions, then ensure that requests is set and it will not exceed `requests`.
 
 ## Best Practices
 

From bc9c4c1f0efa57cd536b9eb6754e618ed5d2cbb4 Mon Sep 17 00:00:00 2001
From: Vishnu Kannan <vishnuk@google.com>
Date: Fri, 10 Feb 2017 17:13:52 -0800
Subject: [PATCH 05/10] fix nits

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
---
 contributors/design-proposals/node-allocatable.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index 803d8d2f0..4203afd34 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -88,7 +88,8 @@ together in the `/system` raw container on non-systemd nodes).
 To improve the reliability of nodes, kubelet evicts pods whenever the node runs out of memory or local storage.
 Together, evictions and node allocatable help improve node stability.
 
-As of v1.5, evictions are based on `Capacity` (overall node usage). Kubelet evicts pods based on QoS and user configured eviction thresholds.
+As of v1.5, evictions are based on overall node usage relative to `Capacity`.
+Kubelet evicts pods based on QoS and user configured eviction thresholds.
 More deails in [this doc](./kubelet-eviction.md#enforce-node-allocatable)
 
 From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods cannot to exceed `Allocatable`.

From 94ce3d9a8239ad29b0f0d96b917bfed6414f8911 Mon Sep 17 00:00:00 2001
From: Vish Kannan <vishh@users.noreply.github.com>
Date: Mon, 13 Feb 2017 15:26:20 -0800
Subject: [PATCH 06/10] Fix formatting

---
 contributors/design-proposals/node-allocatable.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index 4203afd34..2790e628f 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -127,6 +127,7 @@ The reason for recommending placing the `Container Runtime` under `KubeReserved`
 1. It's resource consumption is tied to the number of pods running on a node.
 
 Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individally track their usage.
+```text
 
 / (Cgroup Root)
 .
@@ -181,6 +182,8 @@ Note that the hierarchy below recommends having dedicated cgroups for kubelet an
 .	 .	  .
 . 	 .	  ...
 
+```
+
 `systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroup automatically.
 
 `kubepods` cgroups will be created by kubelet automatically if it is not already there. If the cgroup driver is set to `systemd` then Kubelet will create a `kubepods.slice` via systemd.

From 6a72d51c4c3f00a38754e69a68f133ac0d2435e5 Mon Sep 17 00:00:00 2001
From: Vish Kannan <vishh@users.noreply.github.com>
Date: Mon, 13 Feb 2017 15:28:58 -0800
Subject: [PATCH 07/10] Rename kubesystem to podruntime

---
 contributors/design-proposals/node-allocatable.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index 2790e628f..872700c31 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -135,7 +135,7 @@ Note that the hierarchy below recommends having dedicated cgroups for kubelet an
 .        .    .tasks(sshd,udev,etc)
 .
 .
-+..kubereserved or kube.slice (`KubeReserved` enforced here *optionally* by kubelet)
++..podruntime or podruntime.slice (`KubeReserved` enforced here *optionally* by kubelet)
 .	 .
 .	 +..kubelet
 .	 .   .tasks(kubelet)
@@ -184,7 +184,7 @@ Note that the hierarchy below recommends having dedicated cgroups for kubelet an
 
 ```
 
-`systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroup automatically.
+`systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroups automatically.
 
 `kubepods` cgroups will be created by kubelet automatically if it is not already there. If the cgroup driver is set to `systemd` then Kubelet will create a `kubepods.slice` via systemd.
 By default, Kubelet will `mkdir` `/kubepods` cgroup directly via cgroupfs.

From 04b8f209ac3bf43c7308c3e68055302877c30924 Mon Sep 17 00:00:00 2001
From: Vishnu kannan <vishnuk@google.com>
Date: Tue, 14 Feb 2017 13:03:38 -0800
Subject: [PATCH 08/10] control kubepods cgroup using cgroups-per-qos flag

Signed-off-by: Vishnu kannan <vishnuk@google.com>
---
 .../design-proposals/node-allocatable.md      | 51 ++++++++++++++-----
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index 872700c31..199c16264 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -116,6 +116,24 @@ To better enforce QoS under this situation, Kubelet will apply the hard eviction
 The resulting behavior will be the same for user pods. 
 With the above example, Kubelet will evict pods whenever pods consume more than `28.9Gi` which will be `<100Mi` from `29Gi` which will be the memory limits on the Node Allocatable cgroup.
 
+## General guidelines
+
+System daemons are expected to be treated similar to `Guaranteed` pods.
+System daemons can burst within their bounding cgroups and this behavior needs to be managed as part of kubernetes deployment.
+For example, Kubelet can have its own cgroup and share `KubeReserved` resources with the Container Runtime.
+However, Kubelet cannot burst and use up all available Node resources if `KubeReserved` is enforced.
+
+Users are adviced to be extra careful while enforcing `SystemReserved` reservation since it can lead to critical services being CPU starved or OOM killed on the nodes.
+The recommendation is to enforce `SystemReserved` only if a user has profiled their nodes exhaustively to come up with precise estimates.
+
+To begin with enforce `Allocatable` on `pods` only.
+Once adequate monitoring and alerting is in place to track kube daemons, attempt to enforce `KubeReserved` based on heuristics.
+More on this in [Phase 2](#phase-2-enforce-allocatable-on-pods).
+
+The resource requirements of kube system daemons will grow over time as more and more features are added.
+Over time, the project will attempt to bring down utilization, but that is not a priority as of now.
+So expect a drop in `Allocatable` capacity over time.
+
 ## Recommended Cgroups Setup
 
 Following is the recommended cgroup configuration for Kubernetes nodes.
@@ -186,7 +204,9 @@ Note that the hierarchy below recommends having dedicated cgroups for kubelet an
 
 `systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroups automatically.
 
-`kubepods` cgroups will be created by kubelet automatically if it is not already there. If the cgroup driver is set to `systemd` then Kubelet will create a `kubepods.slice` via systemd.
+`kubepods` cgroups will be created by kubelet automatically if it is not already there.
+Creation of `kubepods` cgroup is tied to QoS Cgroup support which is controlled by `--cgroups-per-qos` flag.
+If the cgroup driver is set to `systemd` then Kubelet will create a `kubepods.slice` via systemd.
 By default, Kubelet will `mkdir` `/kubepods` cgroup directly via cgroupfs.
 
 #### Containerizing Kubelet
@@ -214,8 +234,11 @@ The scheduler will use `Allocatable` instead of `Capacity` if it is available.
 
 **Status**: Targeted for v1.6
 
-In this phase, Kubelet will automatically create a top level cgroup to enforce Node Allocatable on all user pods.
+In this phase, Kubelet will automatically create a top level cgroup to enforce Node Allocatable across all user pods.
+The creation of this cgroup is controlled by `--cgroups-per-qos` flag.
+
 Kubelet will support specifying the top level cgroups for `KubeReserved` and `SystemReserved` and support *optionally* placing resource restrictions on these top level cgroups.
+
 Users are expected to specify `KubeReserved` and `SystemReserved` based on their deployment requirements.
 
 Resource requirements for Kubelet and the runtime is typically proportional to the number of pods running on a node.
@@ -225,15 +248,16 @@ Note that this dashboard provides usage metrics for docker runtime only as of no
 
 New flags introduced in this phase are as follows:
 
-1. `--enforce-node-allocatable=[pods][,][kube-reserved],[system-reserved]`
-  * This flag will default to `pods` in v1.6.
-  * Nodes have to be drained prior to upgrading to v1.6.
-  * kubelet's behavior if a node has not been drained prior to upgrades is as follows
-    * If a pod has a `RestartPolicy=Never`, then mark the pod as `Failed` and terminate its workload.
-    * All other pods that are not parented by new Allocatable top level cgroup will be restarted.
-  * Users intending to turn off this feature can set this flag to `""`.
-  * `--kube-reserved` and `--system-reserved` flags are expected to be set by users for this flag to have any meaningful effect on node stability.
-  * By including `kube-reserved` or `system-reserved` in this flag's value, and by specifying the following two flags, Kubelet will attempt to enforce the reservations specified via `--kube-reserved` & `system-reserved` respectively.
+1. `--enforce-node-allocatable=[pods][,][kube-reserved][,][system-reserved]`
+
+	* This flag will default to `pods` in v1.6.
+	* This flag will be a `no-op` unless `--kube-reserved` and/or `--system-reserved` has been specified.
+	* If `--cgroups-per-qos=false`, then this flag has to be set to `""`. Otherwise its an error and kubelet will fail.
+	* It is recommended to drain and restart nodes prior to upgrading to v1.6. This is necessary for `--cgroups-per-qos` feature anyways which is expected to be turned on by default in `v1.6`.
+	* Users intending to turn off this feature can set this flag to `""`.
+	* Specifying `kube-reserved` value in this flag is invalid if `--kube-reserved-cgroup` flag is not specified.
+	* Specifying `system-reserved` value in this flag is invalid if `--system-reserved-cgroup` flag is not specified.
+	* By including `kube-reserved` or `system-reserved` in this flag's value, and by specifying the following two flags, Kubelet will attempt to enforce the reservations specified via `--kube-reserved` & `system-reserved` respectively.
 
 2. `--kube-reserved-cgroup=<absolute path to a cgroup>`
    * This flag helps kubelet identify the control group managing all kube components like Kubelet & container runtime that fall under the `KubeReserved` reservation.
@@ -243,7 +267,8 @@ New flags introduced in this phase are as follows:
 
 #### Rollout details
 
-This phase is expected to improve Kubernetes node stability. However it requires users to specify non-default values for `--kube-reserved` & `--system-reserved` flags though.
+This phase is expected to improve Kubernetes node stability.
+However it requires users to specify non-default values for `--kube-reserved` & `--system-reserved` flags though.
 
 The rollout of this phase has been long due and hence we are attempting to include it in v1.6
 
@@ -258,6 +283,7 @@ This phase might cause the following symptoms to occur if `--kube-reserved` and/
 
 ##### Proposed Timeline
 
+```text
 02/14/2017 - Discuss the rollout plan in sig-node meeting
 02/15/2017 - Flip the switch to enable pod level cgroups by default
 enable existing experimental behavior by default
@@ -265,6 +291,7 @@ enable existing experimental behavior by default
 02/27/2017 - Kubernetes Feature complete (i.e. code freeze)
 03/01/2017 - Send an announcement to kubernetes-dev@ about this rollout along with rollback options and potential issues. Recommend users to set kube and system reserved.
 03/22/2017 - Kubernetes 1.6 release
+```
 
 ### Phase 3 - Metrics & support for Storage
 

From a871e58555621ce7dc1be9f56083b6a5e23ac06a Mon Sep 17 00:00:00 2001
From: Vishnu kannan <vishnuk@google.com>
Date: Fri, 17 Feb 2017 13:36:56 -0800
Subject: [PATCH 09/10] fix typos in node-allocatable proposal

Signed-off-by: Vishnu kannan <vishnuk@google.com>
---
 .../design-proposals/kubelet-eviction.md      |  7 -----
 .../design-proposals/node-allocatable.md      | 26 ++++++++++++-------
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/contributors/design-proposals/kubelet-eviction.md b/contributors/design-proposals/kubelet-eviction.md
index 15aa82809..68b39ec13 100644
--- a/contributors/design-proposals/kubelet-eviction.md
+++ b/contributors/design-proposals/kubelet-eviction.md
@@ -432,13 +432,6 @@ However `storage` cannot be enforced using cgroups.
 
 Once Kubelet supports `storage` as an `Allocatable` resource, Kubelet will perform evictions whenever the total storage usage by pods exceed node allocatable.
 
-The trigger threshold for storage evictions will not be user configurable for the purposes of `Allocatable`.
-Kubelet will evict pods once the `storage` usage is greater than or equal to `Allocatable`.
-Kubelet will evict pods until it can reclaim `5%` of `storage Allocatable`, thereby brining down usage to `95%` of `Allocatable`.
-These thresholds apply for both storage `capacity` and `inodes`.
-
-*Note that these values are subject to change based on feedback from production.*
-
 If a pod cannot tolerate evictions, then ensure that requests is set and it will not exceed `requests`.
 
 ## Best Practices
diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index 199c16264..276e68b7a 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -13,7 +13,7 @@ Specifically, the kubelet will provide a few knobs to reserve resources for OS s
 By explicitly reserving compute resources, the intention is to avoid overcommiting the node and not have system daemons compete with user pods.
 The resources available to system daemons and user pods will be capped based on user specified reservations.
 
-If `Allocatable` is available, the scheduler use that instead of `Capacity`, thereby not overcommiting the node.
+If `Allocatable` is available, the scheduler will use that instead of `Capacity`, thereby not overcommiting the node.
 
 ## Design
 
@@ -83,7 +83,7 @@ designates resources set aside for kubernetes components, SystemReserved designa
 aside for non-kubernetes components (currently this is reported as all the processes lumped
 together in the `/system` raw container on non-systemd nodes).
 
-## Kubelet Evictions Tresholds
+## Kubelet Evictions Thresholds
 
 To improve the reliability of nodes, kubelet evicts pods whenever the node runs out of memory or local storage.
 Together, evictions and node allocatable help improve node stability.
@@ -92,7 +92,7 @@ As of v1.5, evictions are based on overall node usage relative to `Capacity`.
 Kubelet evicts pods based on QoS and user configured eviction thresholds.
 More deails in [this doc](./kubelet-eviction.md#enforce-node-allocatable)
 
-From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods cannot to exceed `Allocatable`.
+From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods cannot exceed `Allocatable`.
 Memory and CPU limits are enforced using cgroups, but there exists no easy means to enforce storage limits though. 
 Enforcing storage limits using Linux Quota is not possible since it's not hierarchical. 
 Once storage is supported as a resource for `Allocatable`, Kubelet has to perform evictions based on `Allocatable` in addition to `Capacity`.
@@ -107,7 +107,7 @@ For this node, the effective Node Allocatable is `28.9Gi` only; i.e. if kube and
 
 If we enforce Node Allocatable (`28.9Gi`) via top level cgroups, then pods can never exceed `28.9Gi` in which case evictions will not be performed unless kernel memory consumption is above `100Mi`.
 
-In order to support evictions and avoid memcg OOM kills for pods, we will set the top level cgroup limits for pods to be `Node Allocatable` + `Eviction Hard Tresholds`.
+In order to support evictions and avoid memcg OOM kills for pods, we will set the top level cgroup limits for pods to be `Node Allocatable` + `Eviction Hard Thresholds`.
 
 However, the scheduler is not expected to use more than `28.9Gi` and so `Node Allocatable` on Node Status will be `28.9Gi`.
 
@@ -123,7 +123,7 @@ System daemons can burst within their bounding cgroups and this behavior needs t
 For example, Kubelet can have its own cgroup and share `KubeReserved` resources with the Container Runtime.
 However, Kubelet cannot burst and use up all available Node resources if `KubeReserved` is enforced.
 
-Users are adviced to be extra careful while enforcing `SystemReserved` reservation since it can lead to critical services being CPU starved or OOM killed on the nodes.
+Users are advised to be extra careful while enforcing `SystemReserved` reservation since it can lead to critical services being CPU starved or OOM killed on the nodes.
 The recommendation is to enforce `SystemReserved` only if a user has profiled their nodes exhaustively to come up with precise estimates.
 
 To begin with enforce `Allocatable` on `pods` only.
@@ -134,6 +134,11 @@ The resource requirements of kube system daemons will grow over time as more and
 Over time, the project will attempt to bring down utilization, but that is not a priority as of now.
 So expect a drop in `Allocatable` capacity over time.
 
+`Systemd-logind` places ssh sessions under `/user.slice`.
+Its usage will not be accounted for in the nodes.
+Take into account resource reservation for `/user.slice` while configuring `SystemReserved`.
+Ideally `/user.slice` should reside under `SystemReserved` top level cgroup.
+
 ## Recommended Cgroups Setup
 
 Following is the recommended cgroup configuration for Kubernetes nodes.
@@ -144,16 +149,16 @@ The reason for recommending placing the `Container Runtime` under `KubeReserved`
 1. A container runtime on Kubernetes nodes is not expected to be used outside of the Kubelet.
 1. It's resource consumption is tied to the number of pods running on a node.
 
-Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individally track their usage.
+Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individually track their usage.
 ```text
 
 / (Cgroup Root)
 .
-+..systemreserved or system.slice (`SystemReserved` enforced here *optionally* by kubelet)
++..systemreserved or system.slice (Specified via `--system-reserved-cgroup`; `SystemReserved` enforced here *optionally* by kubelet)
 .        .    .tasks(sshd,udev,etc)
 .
 .
-+..podruntime or podruntime.slice (`KubeReserved` enforced here *optionally* by kubelet)
++..podruntime or podruntime.slice (Specified via `--kube-reserved-cgroup`; `KubeReserved` enforced here *optionally* by kubelet)
 .	 .
 .	 +..kubelet
 .	 .   .tasks(kubelet)
@@ -202,7 +207,8 @@ Note that the hierarchy below recommends having dedicated cgroups for kubelet an
 
 ```
 
-`systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroups automatically.
+`systemreserved` & `kubereserved` cgroups are expected to be created by users.
+If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroups automatically.
 
 `kubepods` cgroups will be created by kubelet automatically if it is not already there.
 Creation of `kubepods` cgroup is tied to QoS Cgroup support which is controlled by `--cgroups-per-qos` flag.
@@ -261,9 +267,11 @@ New flags introduced in this phase are as follows:
 
 2. `--kube-reserved-cgroup=<absolute path to a cgroup>`
    * This flag helps kubelet identify the control group managing all kube components like Kubelet & container runtime that fall under the `KubeReserved` reservation.
+   * Example: `/kube.slice`. Note that absolute paths are required and systemd naming scheme isn't supported.
 
 3. `--system-reserved-cgroup=<absolute path to a cgroup>`
    * This flag helps kubelet identify the control group managing all OS specific system daemons that fall under the `SystemReserved` reservation.
+   * Example: `/system.slice`. Note that absolute paths are required and systemd naming scheme isn't supported.
 
 #### Rollout details
 

From 91368a6d9936b4d345b8b8a9fb3545edbabe569a Mon Sep 17 00:00:00 2001
From: Vishnu Kannan <vishnuk@google.com>
Date: Mon, 20 Feb 2017 14:34:44 -0800
Subject: [PATCH 10/10] adding a flag to opt-out of node allocatable
 calculation changes

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
---
 .../design-proposals/node-allocatable.md      | 20 ++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
index 276e68b7a..3dec408a9 100644
--- a/contributors/design-proposals/node-allocatable.md
+++ b/contributors/design-proposals/node-allocatable.md
@@ -252,6 +252,8 @@ Once a user identified the maximum pod density for each of their nodes, they wil
 [This blog post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html) explains how the dashboard has to be interpreted.
 Note that this dashboard provides usage metrics for docker runtime only as of now.
 
+Support for evictions based on Allocatable will be introduced in this phase.
+
 New flags introduced in this phase are as follows:
 
 1. `--enforce-node-allocatable=[pods][,][kube-reserved][,][system-reserved]`
@@ -273,29 +275,34 @@ New flags introduced in this phase are as follows:
    * This flag helps kubelet identify the control group managing all OS specific system daemons that fall under the `SystemReserved` reservation.
    * Example: `/system.slice`. Note that absolute paths are required and systemd naming scheme isn't supported.
 
+4. `--experimental-node-allocatable-ignore-eviction-threshold`
+   * This flag is provided as an `opt-out` option to avoid including Hard eviction thresholds in Node Allocatable which can impact existing clusters.
+   * The default value is `false`.
+
 #### Rollout details
 
 This phase is expected to improve Kubernetes node stability.
 However it requires users to specify non-default values for `--kube-reserved` & `--system-reserved` flags though.
 
-The rollout of this phase has been long due and hence we are attempting to include it in v1.6
+The rollout of this phase has been long due and hence we are attempting to include it in v1.6.
 
 Since `KubeReserved` and `SystemReserved` continue to have `""` as defaults, the node's `Allocatable` does not change automatically.
 Since this phase requires node drains (or pod restarts/terminations), it is considered disruptive to users.
 
-To rollback this phase, set `--enforce-node-allocatable` flag to `""`.
+To rollback this phase, set `--enforce-node-allocatable` flag to `""` and `--experimental-node-allocatable-ignore-eviction-threshold` to `true`.
+The former disables Node Allocatable enforcement on all pods and the latter avoids including hard eviction thresholds in Node Allocatable.
 
-This phase might cause the following symptoms to occur if `--kube-reserved` and/or `--system-reserved` flags are also specified.
+This rollout in v1.6 might cause the following symptoms:
 
-1. OOM kills of containers and/or evictions of pods. This can happen primarily to `Burstable` and `BestEffort` pods since they can no longer use up all the resource available on the node.
+1. If `--kube-reserved` and/or `--system-reserved` flags are also specified, OOM kills of containers and/or evictions of pods. This can happen primarily to `Burstable` and `BestEffort` pods since they can no longer use up all the resource available on the node.
+1. Total allocatable capadity in the cluster reduces resulting in pods staying `Pending` because Hard Eviction Thresholds are included in Node Allocatable.
 
 ##### Proposed Timeline
 
 ```text
 02/14/2017 - Discuss the rollout plan in sig-node meeting
 02/15/2017 - Flip the switch to enable pod level cgroups by default
-enable existing experimental behavior by default
-02/21/2017 - Assess impacts based on enablement
+02/21/2017 - Merge phase 2 implementation
 02/27/2017 - Kubernetes Feature complete (i.e. code freeze)
 03/01/2017 - Send an announcement to kubernetes-dev@ about this rollout along with rollback options and potential issues. Recommend users to set kube and system reserved.
 03/22/2017 - Kubernetes 1.6 release
@@ -307,7 +314,6 @@ enable existing experimental behavior by default
 
 In this phase, Kubelet will expose usage metrics for `KubeReserved`, `SystemReserved` and `Allocatable` top level cgroups via Summary metrics API.
 `Storage` will also be introduced as a reservable resource in this phase.
-Support for evictions based on Allocatable will be introduced in this phase.
 
 ## Known Issues