21 KiB
Kubelet pod level resource management
Authors:
- Buddha Prakash (@dubstack)
- Vishnu Kannan (@vishh)
- Derek Carr (@derekwaynecarr)
Last Updated: 02/21/2017
Status: Implementation planned for Kubernetes 1.6
This document proposes a design for introducing pod level resource accounting to Kubernetes. It outlines the implementation and associated rollout plan.
Introduction
Kubernetes supports container level isolation by allowing users
to specify compute resource requirements via requests and
limits on individual containers. The kubelet
delegates creation of a
cgroup sandbox for each container to its associated container runtime.
Each pod has an associated Quality of Service (QoS)
class based on the aggregate resource requirements made by individual
containers in the pod. The kubelet
has the ability to
evict pods when compute resources are scarce. It evicts
pods with the lowest QoS class in order to attempt to maintain stability of the
node.
The kubelet
has no associated cgroup sandbox for individual QoS classes or
individual pods. This inhibits the ability to perform proper resource
accounting on the node, and introduces a number of code complexities when
trying to build features around QoS.
This design introduces a new cgroup hierarchy to enable the following:
- Enforce QoS classes on the node.
- Simplify resource accounting at the pod level.
- Allow containers in a pod to share slack resources within its pod cgroup. For example, a Burstable pod has two containers, where one container makes a CPU request and the other container does not. The latter container should get CPU time not used by the former container. Today, it must compete for scare resources at the node level across all BestEffort containers.
- Ability to charge per container overhead to the pod instead of the node.
This overhead is container runtime specific. For example,
docker
has an associatedcontainerd-shim
process that is created for each container which should be charged to the pod. - Ability to charge any memory usage of memory-backed volumes to the pod when an individual container exits instead of the node.
Enabling QoS and Pod level cgroups
To enable the new cgroup hierarchy, the operator must enable the
--cgroups-per-qos
flag. Once enabled, the kubelet
will start managing
inner nodes of the described cgroup hierarchy.
The --cgroup-root
flag if not specified when the --cgroups-per-qos
flag
is enabled will default to /
. The kubelet
will parent any cgroups
it creates below that specified value per the
node allocatable design.
Configuring a cgroup driver
The kubelet
will support manipulation of the cgroup hierarchy on
the host using a cgroup driver. The driver is configured via the
--cgroup-driver
flag.
The supported values are the following:
cgroupfs
is the default driver that performs direct manipulation of the cgroup filesystem on the host in order to manage cgroup sandboxes.systemd
is an alternative driver that manages cgroup sandboxes using transient slices for resources that are supported by that init system.
Depending on the configuration of the associated container runtime,
operators may have to choose a particular cgroup driver to ensure
proper system behavior. For example, if operators use the systemd
cgroup driver provided by the docker
runtime, the kubelet
must
be configured to use the systemd
cgroup driver.
Implementation of either driver will delegate to the libcontainer library in opencontainers/runc.
Conversion of cgroupfs to systemd naming conventions
Internally, the kubelet
maintains both an abstract and a concrete name
for its associated cgroup sandboxes. The abstract name follows the traditional
cgroupfs
style syntax. The concrete name is the name for how the cgroup
sandbox actually appears on the host filesystem after any conversions performed
based on the cgroup driver.
If the systemd
cgroup driver is used, the kubelet
converts the cgroupfs
style syntax into transient slices, and as a result, it must follow systemd
conventions for path encoding.
For example, the cgroup name /burstable/pod123-456
is translated to a
transient slice with the name burstable-pod123_456.slice
. Given how
systemd manages the cgroup filesystem, the concrete name for the cgroup
sandbox becomes /burstable.slice/burstable-pod123_456.slice
.
Integration with container runtimes
The kubelet
when integrating with container runtimes always provides the
concrete cgroup filesystem name for the pod sandbox.
Conversion of CPU millicores to cgroup configuration
Kubernetes measures CPU requests and limits in millicores.
The following formula is used to convert CPU in millicores to cgroup values:
- cpu.shares = (cpu in millicores * 1024) / 1000
- cpu.cfs_period_us = 100000 (i.e. 100ms)
- cpu.cfs_quota_us = quota = (cpu in millicores * 100000) / 1000
Pod level cgroups
The kubelet
will create a cgroup sandbox for each pod.
The naming convention for the cgroup sandbox is pod<pod.UID>
. It enables
the kubelet
to associate a particular cgroup on the host filesytem
with a corresponding pod without managing any additional state. This is useful
when the kubelet
restarts and needs to verify the cgroup filesystem.
A pod can belong to one of the following 3 QoS classes in decreasing priority:
- Guaranteed
- Burstable
- BestEffort
The resource configuration for the cgroup sandbox is dependent upon the pod's associated QoS class.
Guaranteed QoS
A pod in this QoS class has its cgroup sandbox configured as follows:
pod<UID>/cpu.shares = sum(pod.spec.containers.resources.requests[cpu])
pod<UID>/cpu.cfs_quota_us = sum(pod.spec.containers.resources.limits[cpu])
pod<UID>/memory.limit_in_bytes = sum(pod.spec.containers.resources.limits[memory])
Burstable QoS
A pod in this QoS class has its cgroup sandbox configured as follows:
pod<UID>/cpu.shares = sum(pod.spec.containers.resources.requests[cpu])
If all containers in the pod specify a cpu limit:
pod<UID>/cpu.cfs_quota_us = sum(pod.spec.containers.resources.limits[cpu])
Finally, if all containers in the pod specify a memory limit:
pod<UID>/memory.limit_in_bytes = sum(pod.spec.containers.resources.limits[memory])
BestEffort QoS
A pod in this QoS class has its cgroup sandbox configured as follows:
pod<UID>/cpu.shares = 2
QoS level cgroups
The kubelet
defines a --cgroup-root
flag that is used to specify the ROOT
node in the cgroup hierarchy below which the kubelet
should manange individual
cgroup sandboxes. It is strongly recommended that users keep the default
value for --cgroup-root
as /
in order to avoid deep cgroup hierarchies. The
kubelet
creates a cgroup sandbox under the specified path ROOT/kubepods
per
node allocatable to parent pods. For simplicity, we will
refer to ROOT/kubepods
as ROOT
in this document.
The ROOT
cgroup sandbox is used to parent all pod sandboxes that are in
the Guaranteed QoS class. By definition, pods in this class have cpu and
memory limits specified that are equivalent to their requests so the pod
level cgroup sandbox confines resource consumption without the need of an
additional cgroup sandbox for the tier.
When the kubelet
launches, it will ensure a Burstable
cgroup sandbox
and a BestEffort
cgroup sandbox exist as children of ROOT
. These cgroup
sandboxes will parent pod level cgroups in those associated QoS classes.
The kubelet
highly prioritizes resource utilization, and thus
allows BestEffort and Burstable pods to potentially consume as many
resources that are presently available on the node.
For compressible resources like CPU, the kubelet
attempts to mitigate
the issue via its use of CPU CFS shares. CPU time is proportioned
dynamically when there is contention using CFS shares that attempts to
ensure minimum requests are satisfied.
For incompressible resources, this prioritization scheme can inhibit the ability of a pod to have its requests satisfied. For example, a Guaranteed pods memory request may not be satisfied if there are active BestEffort pods consuming all available memory.
As a node operator, I may want to satisfy the following use cases:
- I want to prioritize access to compressible resources for my system and/or kubernetes daemons over end-user pods.
- I want to prioritize access to compressible resources for my Guaranteed workloads over my Burstable workloads.
- I want to prioritize access to compressible resources for my Burstable workloads over my BestEffort workloads.
Almost all operators are encouraged to support the first use case by enforcing
node allocatable via --system-reserved
and --kube-reserved
flags. It is understood that not all operators may feel the need to extend
that level of reservation to Guaranteed and Burstable workloads if they choose
to prioritize utilization. That said, many users in the community deploy
cluster services as Guaranteed or Burstable workloads via a DaemonSet
and would like a similar
resource reservation model as is provided via node allocatable
for system and kubernetes daemons.
For operators that have this concern, the kubelet
with opt-in configuration
will attempt to limit the abilty for a pod in a lower QoS tier to burst utilization
of a compressible resource that was requested by a pod in a higher QoS tier.
The kubelet
will support a flag experimental-qos-reserved
that
takes a set of percentages per incompressible resource that controls how the
QoS cgroup sandbox attempts to reserve resources for its tier. It attempts
to reserve requested resources to exclude pods from lower OoS classes from
using resources requested by higher QoS classes. The flag will accept values
in a range from 0-100%, where a value of 0%
instructs the kubelet
to attempt
no reservation, and a value of 100%
will instruct the kubelet
to attempt to
reserve the sum of requested resource across all pods on the node. The kubelet
initially will only support memory
. The default value per incompressible
resource if not specified is for no reservation to occur for the incompressible
resource.
Prior to starting a pod, the kubelet
will attempt to update the
QoS cgroup sandbox associated with the lower QoS tier(s) in order
to prevent consumption of the requested resource by the new pod.
For example, prior to starting a Guaranteed pod, the Burstable
and BestEffort QoS cgroup sandboxes are adjusted. For resource
specific details, and concerns, see the sections per resource that
follow.
The kubelet
will allocate resources to the QoS level cgroup
dynamically in response to the following events:
- kubelet startup/recovery
- prior to creation of the pod level cgroup
- after deletion of the pod level cgroup
- at periodic intervals to reach
experimental-qos-reserved
heurisitc that converge to a desired state.
All writes to the QoS level cgroup sandboxes are protected via a common lock in the kubelet to ensure we do not have multiple concurrent writers to this tier in the hierarchy.
QoS level CPU allocation
The BestEffort
cgroup sandbox is statically configured as follows:
ROOT/besteffort/cpu.shares = 2
This ensures that allocation of CPU time to pods in this QoS class is given the lowest priority.
The Burstable
cgroup sandbox CPU share allocation is dynamic based
on the set of pods currently scheduled to the node.
ROOT/burstable/cpu.shares = max(sum(Burstable pods cpu requests, 2)
The Burstable cgroup sandbox is updated dynamically in the exit
points described in the previous section. Given the compressible
nature of CPU, and the fact that cpu.shares are evaluated via relative
priority, the risk of an update being incorrect is minimized as the kubelet
converges to a desired state. Failure to set cpu.shares
at the QoS level
cgroup would result in 500m
of cpu for a Guaranteed pod to have different
meaning than 500m
of cpu for a Burstable pod in the current hierarchy. This
is because the default cpu.shares
value if unspecified is 1024
and cpu.shares
are evaluated relative to sibling nodes in the cgroup hierarchy. As a consequence,
all of the Burstable pods under contention would have a relative priority of 1 cpu
unless updated dynamically to capture the sum of requests. For this reason,
we will always set cpu.shares
for the QoS level sandboxes
by default as part of roll-out for this feature.
QoS level memory allocation
By default, no memory limits are applied to the BestEffort
and Burstable QoS level cgroups unless a --qos-reserve-requests
value
is specified for memory.
The heuristic that is applied is as follows for each QoS level sandbox:
ROOT/burstable/memory.limit_in_bytes =
Node.Allocatable - {(summation of memory requests of `Guaranteed` pods)*(reservePercent / 100)}
ROOT/besteffort/memory.limit_in_bytes =
Node.Allocatable - {(summation of memory requests of all `Guaranteed` and `Burstable` pods)*(reservePercent / 100)}
A value of --experimental-qos-reserved=memory=100%
will cause the
kubelet
to adjust the Burstable and BestEffort cgroups from consuming memory
that was requested by a higher QoS class. This increases the risk
of inducing OOM on BestEffort and Burstable workloads in favor of increasing
memory resource guarantees for Guaranteed and Burstable workloads. A value of
--experimental-qos-reserved=memory=0%
will allow a Burstable
and BestEffort QoS sandbox to consume up to the full node allocatable amount if
available, but increases the risk that a Guaranteed workload will not have
access to requested memory.
Since memory is an incompressible resource, it is possible that a QoS level cgroup sandbox may not be able to reduce memory usage below the value specified in the heuristic described earlier during pod admission and pod termination.
As a result, the kubelet
runs a periodic thread to attempt to converge
to this desired state from the above heuristic. If unreclaimable memory
usage has exceeded the desired limit for the sandbox, the kubelet
will
attempt to set the effective limit near the current usage to put pressure
on the QoS cgroup sandbox and prevent further consumption.
The kubelet
will not wait for the QoS cgroup memory limit to converge
to the desired state prior to execution of the pod, but it will always
attempt to cap the existing usage of QoS cgroup sandboxes in lower tiers.
This does mean that the new pod could induce an OOM event at the ROOT
cgroup, but ideally per our QoS design, the oom_killer targets a pod
in a lower QoS class, or eviction evicts a lower QoS pod. The periodic
task is then able to converge to the steady desired state so any future
pods in a lower QoS class do not impact the pod at a higher QoS class.
Adjusting the memory limits for the QoS level cgroup sandbox carries greater risk given the incompressible nature of memory. As a result, we are not enabling this function by default, but would like operators that want to value resource priority over resource utilization to gather real-world feedback on its utility.
As a best practice, oeprators that want to provide a similar resource reservation model for Guaranteed pods as we offer via enforcement of node allocatable are encouraged to schedule their Guaranteed pods first as it will ensure the Burstable and BestEffort tiers have had their QoS memory limits appropriately adjusted before taking unbounded workload on node.
Memory backed volumes
The pod level cgroup ensures that any writes to a memory backed volume are correctly charged to the pod sandbox even when a container process in the pod restarts.
All memory backed volumes are removed when a pod reaches a terminal state.
The kubelet
verifies that a pod's cgroup is deleted from the
host before deleting a pod from the API server as part of the graceful
deletion process.
Log basic cgroup management
The kubelet
will log and collect metrics associated with cgroup manipulation.
It will log metrics for cgroup create, update, and delete actions.
Rollout Plan
Kubernetes 1.5
The support for the described cgroup hierarchy is experimental.
Kubernetes 1.6+
The feature will be enabled by default.
As a result, we will recommend that users drain their nodes prior
to upgrade of the kubelet
. If users do not drain their nodes, the
kubelet
will act as follows:
- If a pod has a
RestartPolicy=Never
, then mark the pod asFailed
and terminate its workload. - All other pods that are not parented by a pod-level cgroup will be restarted.
The cgroups-per-qos
flag will be enabled by default, but user's
may choose to opt-out. We may deprecate this opt-out mechanism
in Kubernetes 1.7, and remove the flag entirely in Kubernetes 1.8.
Risk Assessment
The impact of the unified cgroup hierarchy is restricted to the kubelet
.
Potential issues:
- Bugs
- Performance and/or reliability issues for
BestEffort
pods. This is most likely to appear on E2E test runs that mix/match pods across different QoS tiers. - User misconfiguration; most notably the
--cgroup-driver
needs to match the expected behavior of the container runtime. We provide clear errors inkubelet
logs for container runtimes that we include in tree.
Proposed Timeline
- 01/31/2017 - Discuss the rollout plan in sig-node meeting
- 02/14/2017 - Flip the switch to enable pod level cgroups by default
- enable existing experimental behavior by default
- 02/21/2017 - Assess impacts based on enablement
- 02/27/2017 - Kubernetes Feature complete (i.e. code freeze)
- opt-in behavior surrounding the feature (
experimental-qos-reserved
support) completed. - 03/01/2017 - Send an announcement to kubernetes-dev@ about the rollout and potential impact
- 03/22/2017 - Kubernetes 1.6 release
- TBD (1.7?) - Eliminate the option to not use the new cgroup hierarchy.
This is based on the tentative timeline of kubernetes 1.6 release. Need to work out the timeline with the 1.6 release czar.
Future enhancements
Add Pod level metrics to Kubelet's metrics provider
Update the kubelet
metrics provider to include pod level metrics.
Evaluate supporting evictions local to QoS cgroup sandboxes
Rather than induce eviction at /
or /kubepods
, evaluate supporting
eviction decisions for the unbounded QoS tiers (Burstable, BestEffort).
Examples
The following describes the cgroup representation of a node with pods across multiple QoS classes.
Cgroup Hierachy
The following identifies a sample hierarchy based on the described design.
It assumes the flag --experimental-qos-reserved
is not enabled for clarity.
$ROOT
|
+- Pod1
| |
| +- Container1
| +- Container2
| ...
+- Pod2
| +- Container3
| ...
+- ...
|
+- burstable
| |
| +- Pod3
| | |
| | +- Container4
| | ...
| +- Pod4
| | +- Container5
| | ...
| +- ...
|
+- besteffort
| |
| +- Pod5
| | |
| | +- Container6
| | +- Container7
| | ...
| +- ...
Guaranteed Pods
We have two pods Pod1 and Pod2 having Pod Spec given below
kind: Pod
metadata:
name: Pod1
spec:
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 2Gi
kind: Pod
metadata:
name: Pod2
spec:
containers:
name: foo
resources:
limits:
cpu: 20m
memory: 2Gii
Pod1 and Pod2 are both classified as Guaranteed and are nested under the ROOT
cgroup.
/ROOT/Pod1/cpu.quota = 110m
/ROOT/Pod1/cpu.shares = 110m
/ROOT/Pod1/memory.limit_in_bytes = 3Gi
/ROOT/Pod2/cpu.quota = 20m
/ROOT/Pod2/cpu.shares = 20m
/ROOT/Pod2/memory.limit_in_bytes = 2Gi
Burstable Pods
We have two pods Pod3 and Pod4 having Pod Spec given below:
kind: Pod
metadata:
name: Pod3
spec:
containers:
name: foo
resources:
limits:
cpu: 50m
memory: 2Gi
requests:
cpu: 20m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 1Gi
kind: Pod
metadata:
name: Pod4
spec:
containers:
name: foo
resources:
limits:
cpu: 20m
memory: 2Gi
requests:
cpu: 10m
memory: 1Gi
Pod3 and Pod4 are both classified as Burstable and are hence nested under the Burstable cgroup.
/ROOT/burstable/cpu.shares = 30m
/ROOT/burstable/memory.limit_in_bytes = Allocatable - 5Gi
/ROOT/burstable/Pod3/cpu.quota = 150m
/ROOT/burstable/Pod3/cpu.shares = 20m
/ROOT/burstable/Pod3/memory.limit_in_bytes = 3Gi
/ROOT/burstable/Pod4/cpu.quota = 20m
/ROOT/burstable/Pod4/cpu.shares = 10m
/ROOT/burstable/Pod4/memory.limit_in_bytes = 2Gi
Best Effort pods
We have a pod, Pod5, having Pod Spec given below:
kind: Pod
metadata:
name: Pod5
spec:
containers:
name: foo
resources:
name: bar
resources:
Pod5 is classified as BestEffort and is hence nested under the BestEffort cgroup
/ROOT/besteffort/cpu.shares = 2
/ROOT/besteffort/cpu.quota= not set
/ROOT/besteffort/memory.limit_in_bytes = Allocatable - 7Gi
/ROOT/besteffort/Pod5/memory.limit_in_bytes = no limit