mirror of https://github.com/kubernetes/kops.git
				
				
				
			
		
			
				
	
	
		
			132 lines
		
	
	
		
			5.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			132 lines
		
	
	
		
			5.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| ## Node Resource Handling In Kubernetes
 | ||
| 
 | ||
| An aspect of Kubernetes clusters that is often overlooked is the resources non-
 | ||
| pod components require to run, such as:
 | ||
| 
 | ||
| * Operating system components i.e. `sshd`, `udev` etc.
 | ||
| * Kubernetes system components i.e. `kubelet`, `container runtime` (e.g.
 | ||
|   Docker), `node problem detector`, `journald` etc.
 | ||
| 
 | ||
| As you manage your cluster, it's important that you are cognisant of these
 | ||
| components because if your critical non-pod components don't have enough
 | ||
| resources, you might end up with a very unstable cluster.
 | ||
| 
 | ||
| ### Understanding Node Resources
 | ||
| 
 | ||
| Each node in a cluster has resources available to it and pods scheduled to run
 | ||
| on the node may or may not have resource requests or limits set on them.
 | ||
| Kubernetes schedules pods on nodes that have resources that satisfy the pod's
 | ||
| specified requirements. Broadly, pods are [bin-packed][4] onto the nodes in a
 | ||
| best effort attempt to utilize as much of the resources available with as few
 | ||
| nodes as possible.
 | ||
| 
 | ||
| ```
 | ||
|       Node Capacity
 | ||
| ---------------------------
 | ||
| |     kube-reserved       |
 | ||
| |-------------------------|
 | ||
| |     system-reserved     |
 | ||
| |-------------------------|
 | ||
| |    eviction-threshold   |
 | ||
| |-------------------------|
 | ||
| |                         |
 | ||
| |      allocatable        |
 | ||
| |   (available for pods)  |
 | ||
| |                         |
 | ||
| |                         |
 | ||
| ---------------------------
 | ||
| ```
 | ||
| 
 | ||
| Node resources can be categorised into 4 (as shown above):
 | ||
| 
 | ||
| * `kube-reserved` – reserves resources for kubernetes system daemons.
 | ||
| * `system-reserved` – reserves resources for operating system components.
 | ||
| * `eviction-threshold` – specifies limits that trigger evictions when node
 | ||
|   resources drop below the reserved value.
 | ||
| * `allocatable` – the remaining node resources available for scheduling of pods
 | ||
|   when `kube-reserved`, `system-reserved` and `eviction-threshold` resources
 | ||
|   have been accounted for.
 | ||
| 
 | ||
| For example, with a 30.5 GB, 4 vCPUs machine with only `eviction-thresholds` set
 | ||
| as `--eviction-hard=memory.available<100Mi` we'd get the following `Capacity`
 | ||
| and `Allocatable` resources:
 | ||
| 
 | ||
| ```
 | ||
| $ kubectl describe node/ip-xx-xx-xx-xxx.internal
 | ||
| ...
 | ||
| Capacity:
 | ||
|  cpu:   4
 | ||
|  memory:  31402412Ki
 | ||
|  ...
 | ||
| Allocatable:
 | ||
|  cpu:   4
 | ||
|  memory:  31300012Ki
 | ||
|  ...
 | ||
| ```
 | ||
| 
 | ||
| ### So, What Could Possibly Go Wrong?
 | ||
| 
 | ||
| The scheduler ensures that for each resource type, the sum of the resources
 | ||
| scheduled does not surpass the sum of allocatable resources. But suppose you
 | ||
| have a couple of applications deployed in your cluster that are constantly using
 | ||
| up way more resources set in their resource requests (burst above requests but
 | ||
| below limits during workload). You end up with a node with pods that are each
 | ||
| attempting to take up more resources than there are available on the node!
 | ||
| 
 | ||
| This is particularly an issue with non-compressible resources like memory. For
 | ||
| example, in the aforementioned case, with an eviction threshold of only
 | ||
| `memory.available<100Mi` and no `kube-reserved` nor `system-reserved`
 | ||
| reservations set, it is possible for a node to OOM prior to when `kubelet` is
 | ||
| able to reclaim memory (because it may not observe memory pressure right away,
 | ||
| since it polls `cAdvisor` to collect memory usage stats at a regular interval).
 | ||
| 
 | ||
| All the while, keep in mind that without `kube-reserved` nor `system-reserved`
 | ||
| reservations set (which is most clusters i.e. [GKE][5], [Kops][6]), the
 | ||
| scheduler doesn't account for resources that non-pod components would require to
 | ||
| function properly because `Capacity` and `Allocatable` resources are more or
 | ||
| less equal.
 | ||
| 
 | ||
| ### Where Do We Go From Here?
 | ||
| 
 | ||
| It's difficult to give a one size fits all answer to node resource allocation.
 | ||
| The behaviour of your cluster depends on the resource requirements of the apps
 | ||
| running on the cluster, the pod density and the cluster size. But there's a
 | ||
| [node performance dashboard][7] that exposes `cpu` and `memory` usage profiles
 | ||
| of `kubelet` and `docker` engine at multiple levels of pod density which may
 | ||
| serve as a guide for what values would be appropriate for your cluster.
 | ||
| 
 | ||
| But, it seems fitting to recommend the following:
 | ||
| 
 | ||
| 1. Always set requests with some breathing room – do not set requests to match
 | ||
|    your application's resource profile during idle time too closely.
 | ||
| 2. Always set limits – so that your application doesn't hog all the memory on a
 | ||
|    node during a spike.
 | ||
| 3. Don't set your limits for incompressible resources too high - at the end of
 | ||
|    the day, the Kubernetes scheduler schedules based on resource requests which
 | ||
|    match what's available on the node. During a spike, your pod technically will
 | ||
|    try to access resources outside what it's guaranteed to have access to. As
 | ||
|    explained before, this can be an issue if a bunch of your pods are all
 | ||
|    bursting at the same time.
 | ||
| 4. Increase eviction thresholds if they are too low - while extreme utilization
 | ||
|    is ideal, it might be too close to the edge such that the system doesn't have
 | ||
|    enough time to reclaim resources via evictions if the resource increases
 | ||
|    within that window rapidly.
 | ||
| 5. Reserve resources for system components once you've been able to profile your
 | ||
|    nodes i.e. `kube-reserved` and `system-reserved`.
 | ||
| 
 | ||
| **Further Reading:**
 | ||
| 
 | ||
|  * [Configure Out Of Resource Handling][2]
 | ||
|  * [Reserve Compute Resources for System Daemons][1]
 | ||
|  * [Managing Compute Resources for Containers][3]
 | ||
|  * [Visualize Kubelet Performance with Node Dashboard][8]
 | ||
| 
 | ||
| [1]: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
 | ||
| [2]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
 | ||
| [3]: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
 | ||
| [4]: https://en.wikipedia.org/wiki/Bin_packing_problem
 | ||
| [5]: https://cloud.google.com/container-engine/
 | ||
| [6]: https://github.com/kubernetes/kops
 | ||
| [7]: http://node-perf-dash.k8s.io/#/builds
 | ||
| [8]: http://kubernetes.io/blog/2016/11/visualize-kubelet-performance-with-node-dashboard.html
 |