245 lines
10 KiB
Markdown
245 lines
10 KiB
Markdown
# Node overcommit
|
|
|
|
KubeVirt does not yet support classical Memory Overcommit Management or
|
|
Memory Ballooning. In other words VirtualMachineInstances can't give
|
|
back memory they have allocated. However, a few other things can be
|
|
tweaked to reduce the memory footprint and overcommit the per-VMI memory
|
|
overhead.
|
|
|
|
## Remove the Graphical Devices
|
|
|
|
First the safest option to reduce the memory footprint, is removing the
|
|
graphical device from the VMI by setting
|
|
`spec.domain.devices.autottachGraphicsDevice` to `false`. See the video
|
|
and graphics device
|
|
[documentation](../compute/virtual_hardware.md#video-and-graphics-device)
|
|
for further details and examples.
|
|
|
|
This will save a constant amount of `16MB` per VirtualMachineInstance
|
|
but also disable VNC access.
|
|
|
|
## Overcommit the Guest Overhead
|
|
|
|
Before you continue, make sure you make yourself comfortable with the
|
|
[Out of Resource
|
|
Management](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/)
|
|
of Kubernetes.
|
|
|
|
Every VirtualMachineInstance requests slightly more memory from
|
|
Kubernetes than what was requested by the user for the Operating System.
|
|
The additional memory is used for the per-VMI overhead consisting of our
|
|
infrastructure which is wrapping the actual VirtualMachineInstance
|
|
process.
|
|
|
|
In order to increase the VMI density on the node, it is possible to not
|
|
request the additional overhead by setting
|
|
`spec.domain.resources.overcommitGuestOverhead` to `true`:
|
|
|
|
```yaml
|
|
apiVersion: kubevirt.io/v1
|
|
kind: VirtualMachineInstance
|
|
metadata:
|
|
name: testvmi-nocloud
|
|
spec:
|
|
terminationGracePeriodSeconds: 30
|
|
domain:
|
|
resources:
|
|
overcommitGuestOverhead: true
|
|
requests:
|
|
memory: 1024M
|
|
[...]
|
|
```
|
|
|
|
This will work fine for as long as most of the VirtualMachineInstances
|
|
will not request the whole memory. That is especially the case if you
|
|
have short-lived VMIs. But if you have long-lived
|
|
VirtualMachineInstances or do extremely memory intensive tasks inside
|
|
the VirtualMachineInstance, your VMIs will use all memory they are
|
|
granted sooner or later.
|
|
|
|
## Overcommit Guest Memory
|
|
|
|
The third option is real memory overcommit on the VMI. In this scenario
|
|
the VMI is explicitly told that it has more memory available than what
|
|
is requested from the cluster by setting `spec.domain.memory.guest` to a
|
|
value higher than `spec.domain.resources.requests.memory`.
|
|
|
|
The following definition requests `1024MB` from the cluster but tells
|
|
the VMI that it has `2048MB` of memory available:
|
|
|
|
```yaml
|
|
apiVersion: kubevirt.io/v1alpha3
|
|
kind: VirtualMachineInstance
|
|
metadata:
|
|
name: testvmi-nocloud
|
|
spec:
|
|
terminationGracePeriodSeconds: 30
|
|
domain:
|
|
resources:
|
|
overcommitGuestOverhead: true
|
|
requests:
|
|
memory: 1024M
|
|
memory:
|
|
guest: 2048M
|
|
[...]
|
|
```
|
|
|
|
For as long as there is enough free memory available on the node, the
|
|
VMI can happily consume up to `2048MB`. This VMI will get the
|
|
`Burstable` resource class assigned by Kubernetes (See [QoS classes in
|
|
Kubernetes](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-burstable)
|
|
for more details). The same eviction rules like for Pods apply to the
|
|
VMI in case the node gets under memory pressure.
|
|
|
|
Implicit memory overcommit is disabled by default. This means that when
|
|
memory request is not specified, it is set to match
|
|
`spec.domain.memory.guest`. However, it can be enabled using
|
|
`spec.configuration.developerConfiguration.memoryOvercommit` in the `kubevirt` CR. For example, by setting
|
|
`memoryOvercommit: "150"` we define that when memory request is not
|
|
explicitly set, it will be implicitly set to achieve memory overcommit
|
|
of 150%. For instance, when `spec.domain.memory.guest: 3072M`, memory
|
|
request is set to 2048M, if omitted. Note that the actual memory request
|
|
depends on additional configuration options like
|
|
OvercommitGuestOverhead.
|
|
|
|
## Configuring the memory pressure behavior of nodes
|
|
|
|
If the node gets under memory pressure, depending on the `kubelet`
|
|
configuration the virtual machines may get killed by the OOM handler or
|
|
by the `kubelet` itself. It is possible to tweak that behaviour based on
|
|
the requirements of your VirtualMachineInstances by:
|
|
|
|
* Configuring [Soft Eviction
|
|
Thresholds](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#soft-eviction-thresholds)
|
|
* Configuring [Hard Eviction
|
|
Thresholds](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#hard-eviction-thresholds)
|
|
* Requesting the right QoS class for VirtualMachineInstances
|
|
* Setting `--system-reserved` and `--kubelet-reserved`
|
|
* Enabling KSM
|
|
* Enabling swap
|
|
|
|
### Configuring Soft Eviction Thresholds
|
|
|
|
> Note: Soft Eviction will effectively shutdown VirtualMachineInstances.
|
|
> They are not paused, hibernated or migrated. Further, Soft Eviction is
|
|
> disabled by default.
|
|
|
|
If configured, VirtualMachineInstances get evicted once the available
|
|
memory falls below the threshold specified via `--eviction-soft` and the
|
|
VirtualmachineInstance is given the chance to perform a shutdown of the
|
|
VMI within a timespan specified via `--eviction-max-pod-grace-period`.
|
|
The flag `--eviction-soft-grace-period` specifies for how long a soft
|
|
eviction condition must be held before soft evictions are triggered.
|
|
|
|
If set properly according to the demands of the VMIs, overcommitting
|
|
should only lead to soft evictions in rare cases for some VMIs. They may
|
|
even get re-scheduled to the same node with less initial memory demand.
|
|
For some workload types, this can be perfectly fine and lead to better
|
|
overall memory-utilization.
|
|
|
|
### Configuring Hard Eviction Thresholds
|
|
|
|
> Note: If unspecified, the kubelet will do hard evictions for Pods once
|
|
> `memory.available` falls below `100Mi`.
|
|
|
|
Limits set via `--eviction-hard` will lead to immediate eviction of
|
|
VirtualMachineInstances or Pods. This stops VMIs without a grace period
|
|
and is comparable with power-loss on a real computer.
|
|
|
|
If the hard limit is hit, VMIs may from time to time simply be killed.
|
|
They may be re-scheduled to the same node immediately again, since they
|
|
start with less memory consumption again. This can be a simple option,
|
|
if the memory threshold is only very seldom hit and the work performed
|
|
by the VMIs is reproducible or it can be resumed from some checkpoints.
|
|
|
|
## Requesting the right QoS Class for VirtualMachineInstances
|
|
|
|
Different QoS classes get [assigned to Pods and
|
|
VirtualMachineInstances](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy)
|
|
based on the `requests.memory` and `limits.memory`. KubeVirt right now
|
|
supports the QoS classes `Burstable` and `Guaranteed`. `Burstable` VMIs
|
|
are evicted before `Guaranteed` VMIs.
|
|
|
|
This allows creating two classes of VMIs:
|
|
|
|
* One type can have equal `requests.memory` and `limits.memory` set
|
|
and therefore gets the `Guaranteed` class assigned. This one will
|
|
not get evicted and should never run into memory issues, but is more
|
|
demanding.
|
|
* One type can have no `limits.memory` or a `limits.memory` which is
|
|
greater than `requests.memory` and therefore gets the `Burstable`
|
|
class assigned. These VMIs will be evicted first.
|
|
|
|
## Setting `--system-reserved` and `--kubelet-reserved`
|
|
|
|
It may be important to reserve some memory for other daemons (not DaemonSets)
|
|
which are running on the same node (ssh, dhcp servers, etc). The reservation
|
|
can be done with the `--system reserved` switch. Further for the Kubelet and
|
|
Docker a special flag called `--kubelet-reserved` exists.
|
|
|
|
## Enabling KSM
|
|
|
|
The [KSM](https://www.linux-kvm.org/page/KSM) (Kernel same-page merging)
|
|
daemon can be started on the node. Depending on its tuning parameters it
|
|
can more or less aggressively try to merge identical pages between
|
|
applications and VirtualMachineInstances. The more aggressive it is
|
|
configured the more CPU it will use itself, so the memory overcommit
|
|
advantages comes with a slight CPU performance hit.
|
|
|
|
Config file tuning allows changes to scanning frequency (how often will
|
|
KSM activate) and aggressiveness (how many pages per second will it
|
|
scan).
|
|
|
|
## Enabling Swap
|
|
|
|
> Note: This will definitely make sure that your VirtualMachines can't
|
|
> crash or get evicted from the node but it comes with the cost of
|
|
> pretty unpredictable performance once the node runs out of memory and
|
|
> the kubelet may not detect that it should evict Pods to increase the
|
|
> performance again.
|
|
|
|
Enabling swap is in general [not
|
|
recommended](https://github.com/kubernetes/kubernetes/issues/53533) on
|
|
Kubernetes right now. However, it can be useful in combination with KSM,
|
|
since KSM merges identical pages over time. Swap allows the VMIs to
|
|
successfully allocate memory which will then effectively never be used
|
|
because of the later de-duplication done by KSM.
|
|
|
|
# Node CPU allocation ratio
|
|
|
|
KubeVirt runs Virtual Machines in a Kubernetes Pod. This pod requests a certain
|
|
amount of CPU time from the host. On the other hand, the Virtual Machine is
|
|
being created with a certain amount of vCPUs. The number of vCPUs may not
|
|
necessarily correlate to the number of requested CPUs by the POD.
|
|
Depending on the [QOS](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) of the POD, vCPUs can be scheduled on a variable amount
|
|
of physical CPUs; this depends on the available CPU resources on a node. When
|
|
there are fewer available CPUs on the node as the requested vCPU, vCPU will be
|
|
over committed.
|
|
|
|
By default, each pod requests 100mil of CPU time. The CPU requested on the pod
|
|
sets the cgroups cpu.shares which serves as a priority for the scheduler to
|
|
provide CPU time for vCPUs in this POD.
|
|
As the number of vCPUs increases, this will reduce the amount of CPU time each
|
|
vCPU may get when competing with other processes on the node or other Virtual
|
|
Machine Instances with a lower amount of vCPUs.
|
|
|
|
The `cpuAllocationRatio` comes to normalize the amount of CPU time the POD will
|
|
request based on the number of vCPUs.
|
|
For example, POD CPU request = number of vCPUs * 1/cpuAllocationRatio
|
|
When cpuAllocationRatio is set to 1, a full amount of vCPUs will be requested
|
|
for the POD.
|
|
|
|
> Note: In Kubernetes, one full core is 1000 of CPU time
|
|
> [More Information](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/)
|
|
|
|
Administrators can change this ratio by updating the KubeVirt CR
|
|
|
|
```yaml
|
|
...
|
|
spec:
|
|
configuration:
|
|
developerConfiguration:
|
|
cpuAllocationRatio: 10
|
|
```
|
|
|