200 lines
		
	
	
		
			7.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			200 lines
		
	
	
		
			7.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| ---
 | |
| reviewers:
 | |
| - dchen1107
 | |
| - egernst
 | |
| - tallclair
 | |
| title: Pod Overhead
 | |
| content_type: concept
 | |
| weight: 50
 | |
| ---
 | |
| 
 | |
| <!-- overview -->
 | |
| 
 | |
| {{< feature-state for_k8s_version="v1.18" state="beta" >}}
 | |
| 
 | |
| 
 | |
| When you run a Pod on a Node, the Pod itself takes an amount of system resources. These
 | |
| resources are additional to the resources needed to run the container(s) inside the Pod.
 | |
| _Pod Overhead_ is a feature for accounting for the resources consumed by the Pod infrastructure
 | |
| on top of the container requests & limits.
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| <!-- body -->
 | |
| 
 | |
| In Kubernetes, the Pod's overhead is set at
 | |
| [admission](/docs/reference/access-authn-authz/extensible-admission-controllers/#what-are-admission-webhooks)
 | |
| time according to the overhead associated with the Pod's
 | |
| [RuntimeClass](/docs/concepts/containers/runtime-class/).
 | |
| 
 | |
| When Pod Overhead is enabled, the overhead is considered in addition to the sum of container
 | |
| resource requests when scheduling a Pod. Similarly, Kubelet will include the Pod overhead when sizing
 | |
| the Pod cgroup, and when carrying out Pod eviction ranking.
 | |
| 
 | |
| ## Enabling Pod Overhead {#set-up}
 | |
| 
 | |
| You need to make sure that the `PodOverhead`
 | |
| [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled (it is on by default as of 1.18)
 | |
| across your cluster, and a `RuntimeClass` is utilized which defines the `overhead` field.
 | |
| 
 | |
| ## Usage example
 | |
| 
 | |
| To use the PodOverhead feature, you need a RuntimeClass that defines the `overhead` field. As
 | |
| an example, you could use the following RuntimeClass definition with a virtualizing container runtime
 | |
| that uses around 120MiB per Pod for the virtual machine and the guest OS:
 | |
| 
 | |
| ```yaml
 | |
| ---
 | |
| kind: RuntimeClass
 | |
| apiVersion: node.k8s.io/v1beta1
 | |
| metadata:
 | |
|     name: kata-fc
 | |
| handler: kata-fc
 | |
| overhead:
 | |
|     podFixed:
 | |
|         memory: "120Mi"
 | |
|         cpu: "250m"
 | |
| ```
 | |
| 
 | |
| Workloads which are created which specify the `kata-fc` RuntimeClass handler will take the memory and
 | |
| cpu overheads into account for resource quota calculations, node scheduling, as well as Pod cgroup sizing.
 | |
| 
 | |
| Consider running the given example workload, test-pod:
 | |
| 
 | |
| ```yaml
 | |
| apiVersion: v1
 | |
| kind: Pod
 | |
| metadata:
 | |
|   name: test-pod
 | |
| spec:
 | |
|   runtimeClassName: kata-fc
 | |
|   containers:
 | |
|   - name: busybox-ctr
 | |
|     image: busybox
 | |
|     stdin: true
 | |
|     tty: true
 | |
|     resources:
 | |
|       limits:
 | |
|         cpu: 500m
 | |
|         memory: 100Mi
 | |
|   - name: nginx-ctr
 | |
|     image: nginx
 | |
|     resources:
 | |
|       limits:
 | |
|         cpu: 1500m
 | |
|         memory: 100Mi
 | |
| ```
 | |
| 
 | |
| At admission time the RuntimeClass [admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
 | |
| updates the workload's PodSpec to include the `overhead` as described in the RuntimeClass. If the PodSpec already has this field defined,
 | |
| the Pod will be rejected. In the given example, since only the RuntimeClass name is specified, the admission controller mutates the Pod
 | |
| to include an `overhead`.
 | |
| 
 | |
| After the RuntimeClass admission controller, you can check the updated PodSpec:
 | |
| 
 | |
| ```bash
 | |
| kubectl get pod test-pod -o jsonpath='{.spec.overhead}'
 | |
| ```
 | |
| 
 | |
| The output is:
 | |
| ```
 | |
| map[cpu:250m memory:120Mi]
 | |
| ```
 | |
| 
 | |
| If a ResourceQuota is defined, the sum of container requests as well as the
 | |
| `overhead` field are counted.
 | |
| 
 | |
| When the kube-scheduler is deciding which node should run a new Pod, the scheduler considers that Pod's
 | |
| `overhead` as well as the sum of container requests for that Pod. For this example, the scheduler adds the
 | |
| requests and the overhead, then looks for a node that has 2.25 CPU and 320 MiB of memory available.
 | |
| 
 | |
| Once a Pod is scheduled to a node, the kubelet on that node creates a new {{< glossary_tooltip text="cgroup" term_id="cgroup" >}}
 | |
| for the Pod. It is within this pod that the underlying container runtime will create containers.
 | |
| 
 | |
| If the resource has a limit defined for each container (Guaranteed QoS or Bustrable QoS with limits defined),
 | |
| the kubelet will set an upper limit for the pod cgroup associated with that resource (cpu.cfs_quota_us for CPU
 | |
| and memory.limit_in_bytes memory). This upper limit is based on the sum of the container limits plus the `overhead`
 | |
| defined in the PodSpec.
 | |
| 
 | |
| For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set `cpu.shares` based on the sum of container
 | |
| requests plus the `overhead` defined in the PodSpec.
 | |
| 
 | |
| Looking at our example, verify the container requests for the workload:
 | |
| ```bash
 | |
| kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}'
 | |
| ```
 | |
| 
 | |
| The total container requests are 2000m CPU and 200MiB of memory:
 | |
| ```
 | |
| map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi]
 | |
| ```
 | |
| 
 | |
| Check this against what is observed by the node:
 | |
| ```bash
 | |
| kubectl describe node | grep test-pod -B2
 | |
| ```
 | |
| 
 | |
| The output shows 2250m CPU and 320MiB of memory are requested, which includes PodOverhead:
 | |
| ```
 | |
|   Namespace                   Name                CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
 | |
|   ---------                   ----                ------------  ----------   ---------------  -------------  ---
 | |
|   default                     test-pod            2250m (56%)   2250m (56%)  320Mi (1%)       320Mi (1%)     36m
 | |
| ```
 | |
| 
 | |
| ## Verify Pod cgroup limits
 | |
| 
 | |
| Check the Pod's memory cgroups on the node where the workload is running. In the following example, [`crictl`](https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md)
 | |
| is used on the node, which provides a CLI for CRI-compatible container runtimes. This is an
 | |
| advanced example to show PodOverhead behavior, and it is not expected that users should need to check
 | |
| cgroups directly on the node.
 | |
| 
 | |
| First, on the particular node, determine the Pod identifier:
 | |
| 
 | |
| ```bash
 | |
| # Run this on the node where the Pod is scheduled
 | |
| POD_ID="$(sudo crictl pods --name test-pod -q)"
 | |
| ```
 | |
| 
 | |
| From this, you can determine the cgroup path for the Pod:
 | |
| ```bash
 | |
| # Run this on the node where the Pod is scheduled
 | |
| sudo crictl inspectp -o=json $POD_ID | grep cgroupsPath
 | |
| ```
 | |
| 
 | |
| The resulting cgroup path includes the Pod's `pause` container. The Pod level cgroup is one directory above.
 | |
| ```
 | |
|         "cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a"
 | |
| ```
 | |
| 
 | |
| In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2`. Verify the Pod level cgroup setting for memory:
 | |
| ```bash
 | |
| # Run this on the node where the Pod is scheduled.
 | |
| # Also, change the name of the cgroup to match the cgroup allocated for your pod.
 | |
|  cat /sys/fs/cgroup/memory/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/memory.limit_in_bytes
 | |
| ```
 | |
| 
 | |
| This is 320 MiB, as expected:
 | |
| ```
 | |
| 335544320
 | |
| ```
 | |
| 
 | |
| ### Observability
 | |
| 
 | |
| A `kube_pod_overhead` metric is available in [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)
 | |
| to help identify when PodOverhead is being utilized and to help observe stability of workloads
 | |
| running with a defined Overhead. This functionality is not available in the 1.9 release of
 | |
| kube-state-metrics, but is expected in a following release. Users will need to build kube-state-metrics
 | |
| from source in the meantime.
 | |
| 
 | |
| 
 | |
| 
 | |
| ## {{% heading "whatsnext" %}}
 | |
| 
 | |
| 
 | |
| * [RuntimeClass](/docs/concepts/containers/runtime-class/)
 | |
| * [PodOverhead Design](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md)
 | |
| 
 | |
| 
 |