Merge pull request #29703 from mythi/non-root-containers-and-devices-blog

blog: non-root containers and devices
2021-11-08 07:47:20 -08:00 · 2021-11-08 07:47:20 -08:00 · 5ac4b5f765
parent 1e090d8f1a c7cb7683f9
commit 5ac4b5f765
1 changed files with 238 additions and 0 deletions
--- a/content/en/blog/_posts/non-root-containers-and-devices.md
+++ b/content/en/blog/_posts/non-root-containers-and-devices.md
@ -0,0 +1,238 @@
+---
+layout: blog
+title: 'Non-root Containers And Devices'
+date: 2021-11-09
+slug: non-root-containers-and-devices
+---
+
+**Author:** Mikko Ylinen (Intel)
+
+The user/group ID related security settings in Pod's `securityContext` trigger a problem when users want to
+deploy containers that use accelerator devices (via [Kubernetes Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)) on Linux. In this blog
+post I talk about the problem and describe the work done so far to address it. It's not meant to be a long story about getting the [k/k issue](https://github.com/kubernetes/kubernetes/issues/92211) fixed.
+
+Instead, this post aims to raise awareness of the issue and to highlight important device use-cases too. This is needed as Kubernetes works on new related features such as support for user namespaces.
+
+## Why non-root containers can't use devices and why it matters
+One of the key security principles for running containers in Kubernetes is the
+principle of least privilege. The Pod/container `securityContext` specifies the config
+options to set, e.g., Linux capabilities, MAC policies, and user/group ID values to achieve this.
+
+Furthermore, the cluster admins are supported with tools like [PodSecurityPolicy](/docs/concepts/policy/pod-security-policy/) (deprecated) or
+[Pod Security Admission](/docs/concepts/security/pod-security-admission/) (alpha) to enforce the desired security settings for pods that are being deployed in
+the cluster. These settings could, for instance, require that containers must be `runAsNonRoot` or
+that they are forbidden from running with root's group ID in `runAsGroup` or `supplementalGroups`.
+
+In Kubernetes, the kubelet builds the list of [`Device`](https://pkg.go.dev/k8s.io/cri-api@v0.22.1/pkg/apis/runtime/v1#Device) resources to be made available to a container
+(based on inputs from the Device Plugins) and the list is included in the CreateContainer CRI message
+sent to the CRI container runtime. Each `Device` contains little information: host/container device
+paths and the desired devices cgroups permissions.
+
+The [OCI Runtime Spec for Linux Container Configuration](https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md)
+expects that in addition to the devices cgroup fields, more detailed information about the devices
+must be provided:
+
+```yaml
+{
+        "type": "<string>",
+        "path": "<string>",
+        "major": <int64>,
+        "minor": <int64>,
+        "fileMode": <uint32>,
+        "uid": <uint32>,
+        "gid": <uint32>
+},
+```
+
+The CRI container runtimes (containerd, CRI-O) are responsible for obtaining this information
+from the host for each `Device`. By default, the runtimes copy the host device's user and group IDs:
+
+- `uid` (uint32, OPTIONAL) - id of device owner in the container namespace.  
+- `gid` (uint32, OPTIONAL) - id of device group in the container namespace.
+
+Similarly, the runtimes prepare other mandatory `config.json` sections based on the CRI fields,
+including the ones defined in `securityContext`: `runAsUser`/`runAsGroup`, which become part of the POSIX
+platforms user structure via:
+
+- `uid` (int, REQUIRED) specifies the user ID in the container namespace.  
+- `gid` (int, REQUIRED) specifies the group ID in the container namespace.  
+- `additionalGids` (array of ints, OPTIONAL) specifies additional group IDs in the container namespace to be added to the process.
+
+However, the resulting `config.json` triggers a problem when trying to run containers with
+both devices added and with non-root uid/gid set via `runAsUser`/`runAsGroup`: the container user process
+has no permission to use the device even when its group id (gid, copied from host) was permissive to
+non-root groups. This is because the container user does not belong to that host group (e.g., via `additionalGids`).
+
+Being able to run applications that use devices as non-root user is normal and expected to work so that
+the security principles can be met. Therefore, several alternatives were considered to get the gap filled with what the PodSec/CRI/OCI supports today.
+
+## What was done to solve the issue?
+You might have noticed from the problem definition that it would at least be possible to workaround
+the problem by manually adding the device gid(s) to `supplementalGroups`, or in
+the case of just one device, set `runAsGroup` to the device's group id. However, this is problematic because the device gid(s) may have
+different values depending on the nodes' distro/version in the cluster. For example, with GPUs the following commands for different distros and versions return different gids:
+
+Fedora 33:
+```
+$ ls -l /dev/dri/
+total 0
+drwxr-xr-x. 2 root root         80 19.10. 10:21 by-path
+crw-rw----+ 1 root video  226,   0 19.10. 10:42 card0
+crw-rw-rw-. 1 root render 226, 128 19.10. 10:21 renderD128
+$ grep -e video -e render /etc/group
+video:x:39:
+render:x:997:
+```
+
+Ubuntu 20.04:
+```
+$ ls -l /dev/dri/
+total 0
+drwxr-xr-x 2 root root         80 19.10. 17:36 by-path
+crw-rw---- 1 root video  226,   0 19.10. 17:36 card0
+crw-rw---- 1 root render 226, 128 19.10. 17:36 renderD128
+$ grep -e video -e render /etc/group
+video:x:44:
+render:x:133:
+```
+
+Which number to choose in your `securityContext`? Also, what if the `runAsGroup`/`runAsUser` values cannot be hard-coded because
+they are automatically assigned during pod admission time via external security policies?
+
+Unlike volumes with `fsGroup`, the devices have no official notion of `deviceGroup`/`deviceUser` that the CRI runtimes (or kubelet)
+would be able to use. We considered using container annotations set by the device plugins (e.g., `io.kubernetes.cri.hostDeviceSupplementalGroup/`) to get custom OCI `config.json` uid/gid values.
+This would have required changes to all existing device plugins which was not ideal.
+
+Instead, a solution that is *seamless* to end-users without getting the device plugin vendors involved was preferred. The selected approach was
+to re-use `runAsUser` and `runAsGroup` values in `config.json` for devices:
+
+```yaml
+{
+        "type": "c",
+        "path": "/dev/foo",
+        "major": 123,
+        "minor": 4,
+        "fileMode": 438,
+        "uid": <runAsUser>,
+        "gid": <runAsGroup>
+},
+```
+
+With `runc` OCI runtime (in non-rootless mode), the device is created (`mknod(2)`) in
+the container namespace and the ownership is changed to `runAsUser`/`runAsGroup` using `chmod(2)`.
+
+{{< note >}}
+[Rootless mode](/docs/tasks/administer-cluster/kubelet-in-userns/) and devices is not supported.
+{{</note>}}
+Having the ownership updated in the container namespace is justified as the user process is the only one accessing the device. Only `runAsUser`/`runAsGroup`
+are taken into account, and, e.g., the `USER` setting in the container is currently ignored.
+
+While it is likely that the "faulty" deployments (i.e., non-root `securityContext` + devices) do not exist, to be absolutely sure no
+deployments break, an opt-in config entry in both containerd and CRI-O to enable the new behavior was added. The following:
+
+`device_ownership_from_security_context (bool)`
+
+defaults to `false` and must be enabled to use the feature.
+
+## See non-root containers using devices after the fix
+To demonstrate the new behavior, let's use a Data Plane Development Kit (DPDK) application using hardware accelerators, Kubernetes CPU manager, and HugePages as an example. The cluster runs containerd with:
+
+```toml
+[plugins]
+  [plugins."io.containerd.grpc.v1.cri"]
+    device_ownership_from_security_context = true
+```
+
+or CRI-O with:
+
+```toml
+[crio.runtime]
+device_ownership_from_security_context = true
+```
+
+and the `Guaranteed` QoS Class Pod that runs DPDK's crypto-perf test utility with this YAML:
+
+```yaml
+...
+metadata:
+  name: qat-dpdk
+spec:
+  securityContext:
+    runAsUser: 1000
+    runAsGroup: 2000
+    fsGroup: 3000
+  containers:
+  - name: crypto-perf
+    image: intel/crypto-perf:devel
+    ...
+    resources:
+      requests:
+        cpu: "3"
+        memory: "128Mi"
+        qat.intel.com/generic: '4'
+        hugepages-2Mi: "128Mi"
+      limits:
+        cpu: "3"
+        memory: "128Mi"
+        qat.intel.com/generic: '4'
+        hugepages-2Mi: "128Mi"
+  ...
+```
+
+To verify the results, check the user and group ID that the container runs as:
+
+```
+$ kubectl exec -it qat-dpdk -c crypto-perf -- id
+```
+
+They are set to non-zero values as expected:
+
+```
+uid=1000 gid=2000 groups=2000,3000
+```
+
+Next, check the device node permissions (`qat.intel.com/generic` exposes `/dev/vfio/` devices) are accessible to `runAsUser`/`runAsGroup`:
+
+```
+$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/vfio
+total 0
+drwxr-xr-x 2 root root      140 Sep  7 10:55 .
+drwxr-xr-x 7 root root      380 Sep  7 10:55 ..
+crw------- 1 1000 2000 241,   0 Sep  7 10:55 58
+crw------- 1 1000 2000 241,   2 Sep  7 10:55 60
+crw------- 1 1000 2000 241,  10 Sep  7 10:55 68
+crw------- 1 1000 2000 241,  11 Sep  7 10:55 69
+crw-rw-rw- 1 1000 2000  10, 196 Sep  7 10:55 vfio
+```
+
+Finally, check the non-root container is also allowed to create HugePages:
+
+```
+$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/hugepages/
+```
+
+`fsGroup` gives a `runAsUser` writable HugePages emptyDir mountpoint:
+
+```
+total 0
+drwxrwsr-x 2 root 3000   0 Sep  7 10:55 .
+drwxr-xr-x 7 root root 380 Sep  7 10:55 ..
+```
+
+## Help us test it and provide feedback!
+The functionality described here is expected to help with cluster security and the configurability of device permissions. To allow
+non-root containers to use devices requires cluster admins to opt-in to the functionality by setting
+`device_ownership_from_security_context = true`. To make it a default setting, please test it and provide your feedback (via SIG-Node meetings or issues)!
+The flag is available in CRI-O v1.22 release and queued for containerd v1.6.
+
+More work is needed to get it *properly* supported. It is known to work with `runc` but it also needs to be made to function
+with other OCI runtimes too, where applicable. For instance, Kata Containers supports device passthrough and allows it to make devices
+available to containers in VM sandboxes too.
+
+Moreover, the additional challenge comes with support of user names and devices. This problem is still [open](https://github.com/kubernetes/enhancements/pull/2101)
+and requires more brainstorming.
+
+Finally, it needs to be understood whether `runAsUser`/`runAsGroup` are enough or if device specific settings similar to `fsGroups` are needed in PodSpec/CRI v2.
+
+## Thanks
+My thanks goes to Mike Brown (IBM, containerd), Peter Hunt (Redhat, CRI-O), and Alexander Kanevskiy (Intel) for providing all the feedback and good conversations.