From 51317535a6679efa60437cc54823fba03a16cd2d Mon Sep 17 00:00:00 2001
From: David Ashpole <dashpole@google.com>
Date: Thu, 30 Nov 2017 11:45:58 -0800
Subject: [PATCH] Add proposal for improving memcg notifications.

---
 .../design-proposals/node/kubelet-eviction.md | 48 ++++++++++++++++---
 1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md
index c777e7c75..2250c5c69 100644
--- a/contributors/design-proposals/node/kubelet-eviction.md
+++ b/contributors/design-proposals/node/kubelet-eviction.md
@@ -191,6 +191,48 @@ signal.  If that signal is observed as being satisfied for longer than the
 specified period, the `kubelet` will initiate eviction to attempt to
 reclaim the resource that has met its eviction threshold.
 
+### Memory CGroup Notifications
+
+When the `kubelet` is started with `--experimental-kernel-memcg-notification=true`, 
+it will use cgroup events on the memory.usage_in_bytes file in order to trigger the eviction manager.
+With the addition of on-demand metrics, this permits the `kubelet` to trigger the eviction manager,
+collect metrics, and respond with evictions much quicker than using the sync loop alone.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+However, a current issue with this is that the cgroup notifications trigger based on memory.usage_in_bytes,
+but the eviction manager determines memory pressure based on the working set, which is (memory.usage_in_bytes - memory.total_inactive_file).
+For example:
+```
+capacity = 1Gi
+--eviction-hard=memory.available<250Mi
+assume memory.total_inactive_file=10Mi
+```
+When the cgroup event is triggered, `memory.usage_in_bytes = 750Mi`.
+The eviction manager observes
+`working_set = memory.usage_in_bytes - memory.total_inactive_file = 740Mi`
+Signal: `memory.available = capacity - working_set = 260Mi`
+Therefore, no memory pressure.  This will occur as long as memory.total_inactive_file is non-zero.
+
+### Proposed solutions:
+1. Set the cgroup event at `threshold*fraction`.
+For example, if `--eviction-hard=memory.available<200Mi`, set the cgroup event at `100Mi`.
+This way, when the eviction manager is triggered, it will likely observe memory pressure.
+This is not guaranteed to always work, but should prevent OOMs in most cases.
+
+2. Use Usage instead of Working Set to determine memory pressure
+This would mean that the eviction manager and cgroup notifications use the same metric,
+and thus the response is ideal: the eviction manager is triggered exactly when memory pressure occurs.
+However, the eviction manager may often evict unneccessarily if there are large quantities of memory
+the kernel has not yet reclaimed.
+
+3. Increase the syncloop interval after the threshold is crossed
+For example, the eviction manager could start collecting observations every second instead of every
+10 seconds after the threshold is crossed.  This means that even though the cgroup event and eviction
+manager are not completely in-sync, the threshold can help the eviction manager to respond faster than
+it otherwise would.  After a short period, it would resume the standard interval of sync loop calls.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 ### Disk
 
 Let's assume the operator started the `kubelet` with the following:
@@ -457,9 +499,3 @@ In general, it should be strongly recommended that `DaemonSet` not
 create `BestEffort` pods to avoid being identified as a candidate pod
 for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
 
-## Known issues
-
-### kubelet may evict more pods than needed
-
-The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
-the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.