mirror of https://github.com/kubernetes/kops.git
132 lines
5.8 KiB
Markdown
132 lines
5.8 KiB
Markdown
## Node Resource Handling In Kubernetes
|
||
|
||
An aspect of Kubernetes clusters that is often overlooked is the resources non-
|
||
pod components require to run, such as:
|
||
|
||
* Operating system components i.e. `sshd`, `udev` etc.
|
||
* Kubernetes system components i.e. `kubelet`, `container runtime` (e.g.
|
||
containerd), `node problem detector`, `journald` etc.
|
||
|
||
As you manage your cluster, it's important that you are cognisant of these
|
||
components because if your critical non-pod components don't have enough
|
||
resources, you might end up with a very unstable cluster.
|
||
|
||
### Understanding Node Resources
|
||
|
||
Each node in a cluster has resources available to it and pods scheduled to run
|
||
on the node may or may not have resource requests or limits set on them.
|
||
Kubernetes schedules pods on nodes that have resources that satisfy the pod's
|
||
specified requirements. Broadly, pods are [bin-packed][4] onto the nodes in a
|
||
best effort attempt to utilize as much of the resources available with as few
|
||
nodes as possible.
|
||
|
||
```
|
||
Node Capacity
|
||
---------------------------
|
||
| kube-reserved |
|
||
|-------------------------|
|
||
| system-reserved |
|
||
|-------------------------|
|
||
| eviction-threshold |
|
||
|-------------------------|
|
||
| |
|
||
| allocatable |
|
||
| (available for pods) |
|
||
| |
|
||
| |
|
||
---------------------------
|
||
```
|
||
|
||
Node resources can be categorised into 4 (as shown above):
|
||
|
||
* `kube-reserved` – reserves resources for kubernetes system daemons.
|
||
* `system-reserved` – reserves resources for operating system components.
|
||
* `eviction-threshold` – specifies limits that trigger evictions when node
|
||
resources drop below the reserved value.
|
||
* `allocatable` – the remaining node resources available for scheduling of pods
|
||
when `kube-reserved`, `system-reserved` and `eviction-threshold` resources
|
||
have been accounted for.
|
||
|
||
For example, with a 30.5 GB, 4 vCPUs machine with only `eviction-thresholds` set
|
||
as `--eviction-hard=memory.available<100Mi` we'd get the following `Capacity`
|
||
and `Allocatable` resources:
|
||
|
||
```
|
||
$ kubectl describe node/ip-xx-xx-xx-xxx.internal
|
||
...
|
||
Capacity:
|
||
cpu: 4
|
||
memory: 31402412Ki
|
||
...
|
||
Allocatable:
|
||
cpu: 4
|
||
memory: 31300012Ki
|
||
...
|
||
```
|
||
|
||
### So, What Could Possibly Go Wrong?
|
||
|
||
The scheduler ensures that for each resource type, the sum of the resources
|
||
scheduled does not surpass the sum of allocatable resources. But suppose you
|
||
have a couple of applications deployed in your cluster that are constantly using
|
||
up way more resources set in their resource requests (burst above requests but
|
||
below limits during workload). You end up with a node with pods that are each
|
||
attempting to take up more resources than there are available on the node!
|
||
|
||
This is particularly an issue with non-compressible resources like memory. For
|
||
example, in the aforementioned case, with an eviction threshold of only
|
||
`memory.available<100Mi` and no `kube-reserved` nor `system-reserved`
|
||
reservations set, it is possible for a node to OOM prior to when `kubelet` is
|
||
able to reclaim memory (because it may not observe memory pressure right away,
|
||
since it polls `cAdvisor` to collect memory usage stats at a regular interval).
|
||
|
||
All the while, keep in mind that without `kube-reserved` nor `system-reserved`
|
||
reservations set (which is most clusters i.e. [GKE][5], [kOps][6]), the
|
||
scheduler doesn't account for resources that non-pod components would require to
|
||
function properly because `Capacity` and `Allocatable` resources are more or
|
||
less equal.
|
||
|
||
### Where Do We Go From Here?
|
||
|
||
It's difficult to give a one size fits all answer to node resource allocation.
|
||
The behaviour of your cluster depends on the resource requirements of the apps
|
||
running on the cluster, the pod density and the cluster size. But there's a
|
||
[node performance dashboard][7] that exposes `cpu` and `memory` usage profiles
|
||
of `kubelet` and `docker` engine at multiple levels of pod density which may
|
||
serve as a guide for what values would be appropriate for your cluster.
|
||
|
||
But, it seems fitting to recommend the following:
|
||
|
||
1. Always set requests with some breathing room – do not set requests to match
|
||
your application's resource profile during idle time too closely.
|
||
2. Always set limits – so that your application doesn't hog all the memory on a
|
||
node during a spike.
|
||
3. Don't set your limits for incompressible resources too high - at the end of
|
||
the day, the Kubernetes scheduler schedules based on resource requests which
|
||
match what's available on the node. During a spike, your pod technically will
|
||
try to access resources outside what it's guaranteed to have access to. As
|
||
explained before, this can be an issue if a bunch of your pods are all
|
||
bursting at the same time.
|
||
4. Increase eviction thresholds if they are too low - while extreme utilization
|
||
is ideal, it might be too close to the edge such that the system doesn't have
|
||
enough time to reclaim resources via evictions if the resource increases
|
||
within that window rapidly.
|
||
5. Reserve resources for system components once you've been able to profile your
|
||
nodes i.e. `kube-reserved` and `system-reserved`.
|
||
|
||
**Further Reading:**
|
||
|
||
* [Configure Out Of Resource Handling][2]
|
||
* [Reserve Compute Resources for System Daemons][1]
|
||
* [Managing Compute Resources for Containers][3]
|
||
* [Visualize Kubelet Performance with Node Dashboard][8]
|
||
|
||
[1]: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
|
||
[2]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
|
||
[3]: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
|
||
[4]: https://en.wikipedia.org/wiki/Bin_packing_problem
|
||
[5]: https://cloud.google.com/container-engine/
|
||
[6]: https://github.com/kubernetes/kops
|
||
[7]: http://node-perf-dash.k8s.io/#/builds
|
||
[8]: http://kubernetes.io/blog/2016/11/visualize-kubelet-performance-with-node-dashboard.html
|