Merge pull request #29703 from mythi/non-root-containers-and-devices-blog
blog: non-root containers and devices
This commit is contained in:
commit
5ac4b5f765
|
|
@ -0,0 +1,238 @@
|
|||
---
|
||||
layout: blog
|
||||
title: 'Non-root Containers And Devices'
|
||||
date: 2021-11-09
|
||||
slug: non-root-containers-and-devices
|
||||
---
|
||||
|
||||
**Author:** Mikko Ylinen (Intel)
|
||||
|
||||
The user/group ID related security settings in Pod's `securityContext` trigger a problem when users want to
|
||||
deploy containers that use accelerator devices (via [Kubernetes Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)) on Linux. In this blog
|
||||
post I talk about the problem and describe the work done so far to address it. It's not meant to be a long story about getting the [k/k issue](https://github.com/kubernetes/kubernetes/issues/92211) fixed.
|
||||
|
||||
Instead, this post aims to raise awareness of the issue and to highlight important device use-cases too. This is needed as Kubernetes works on new related features such as support for user namespaces.
|
||||
|
||||
## Why non-root containers can't use devices and why it matters
|
||||
One of the key security principles for running containers in Kubernetes is the
|
||||
principle of least privilege. The Pod/container `securityContext` specifies the config
|
||||
options to set, e.g., Linux capabilities, MAC policies, and user/group ID values to achieve this.
|
||||
|
||||
Furthermore, the cluster admins are supported with tools like [PodSecurityPolicy](/docs/concepts/policy/pod-security-policy/) (deprecated) or
|
||||
[Pod Security Admission](/docs/concepts/security/pod-security-admission/) (alpha) to enforce the desired security settings for pods that are being deployed in
|
||||
the cluster. These settings could, for instance, require that containers must be `runAsNonRoot` or
|
||||
that they are forbidden from running with root's group ID in `runAsGroup` or `supplementalGroups`.
|
||||
|
||||
In Kubernetes, the kubelet builds the list of [`Device`](https://pkg.go.dev/k8s.io/cri-api@v0.22.1/pkg/apis/runtime/v1#Device) resources to be made available to a container
|
||||
(based on inputs from the Device Plugins) and the list is included in the CreateContainer CRI message
|
||||
sent to the CRI container runtime. Each `Device` contains little information: host/container device
|
||||
paths and the desired devices cgroups permissions.
|
||||
|
||||
The [OCI Runtime Spec for Linux Container Configuration](https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md)
|
||||
expects that in addition to the devices cgroup fields, more detailed information about the devices
|
||||
must be provided:
|
||||
|
||||
```yaml
|
||||
{
|
||||
"type": "<string>",
|
||||
"path": "<string>",
|
||||
"major": <int64>,
|
||||
"minor": <int64>,
|
||||
"fileMode": <uint32>,
|
||||
"uid": <uint32>,
|
||||
"gid": <uint32>
|
||||
},
|
||||
```
|
||||
|
||||
The CRI container runtimes (containerd, CRI-O) are responsible for obtaining this information
|
||||
from the host for each `Device`. By default, the runtimes copy the host device's user and group IDs:
|
||||
|
||||
- `uid` (uint32, OPTIONAL) - id of device owner in the container namespace.
|
||||
- `gid` (uint32, OPTIONAL) - id of device group in the container namespace.
|
||||
|
||||
Similarly, the runtimes prepare other mandatory `config.json` sections based on the CRI fields,
|
||||
including the ones defined in `securityContext`: `runAsUser`/`runAsGroup`, which become part of the POSIX
|
||||
platforms user structure via:
|
||||
|
||||
- `uid` (int, REQUIRED) specifies the user ID in the container namespace.
|
||||
- `gid` (int, REQUIRED) specifies the group ID in the container namespace.
|
||||
- `additionalGids` (array of ints, OPTIONAL) specifies additional group IDs in the container namespace to be added to the process.
|
||||
|
||||
However, the resulting `config.json` triggers a problem when trying to run containers with
|
||||
both devices added and with non-root uid/gid set via `runAsUser`/`runAsGroup`: the container user process
|
||||
has no permission to use the device even when its group id (gid, copied from host) was permissive to
|
||||
non-root groups. This is because the container user does not belong to that host group (e.g., via `additionalGids`).
|
||||
|
||||
Being able to run applications that use devices as non-root user is normal and expected to work so that
|
||||
the security principles can be met. Therefore, several alternatives were considered to get the gap filled with what the PodSec/CRI/OCI supports today.
|
||||
|
||||
## What was done to solve the issue?
|
||||
You might have noticed from the problem definition that it would at least be possible to workaround
|
||||
the problem by manually adding the device gid(s) to `supplementalGroups`, or in
|
||||
the case of just one device, set `runAsGroup` to the device's group id. However, this is problematic because the device gid(s) may have
|
||||
different values depending on the nodes' distro/version in the cluster. For example, with GPUs the following commands for different distros and versions return different gids:
|
||||
|
||||
Fedora 33:
|
||||
```
|
||||
$ ls -l /dev/dri/
|
||||
total 0
|
||||
drwxr-xr-x. 2 root root 80 19.10. 10:21 by-path
|
||||
crw-rw----+ 1 root video 226, 0 19.10. 10:42 card0
|
||||
crw-rw-rw-. 1 root render 226, 128 19.10. 10:21 renderD128
|
||||
$ grep -e video -e render /etc/group
|
||||
video:x:39:
|
||||
render:x:997:
|
||||
```
|
||||
|
||||
Ubuntu 20.04:
|
||||
```
|
||||
$ ls -l /dev/dri/
|
||||
total 0
|
||||
drwxr-xr-x 2 root root 80 19.10. 17:36 by-path
|
||||
crw-rw---- 1 root video 226, 0 19.10. 17:36 card0
|
||||
crw-rw---- 1 root render 226, 128 19.10. 17:36 renderD128
|
||||
$ grep -e video -e render /etc/group
|
||||
video:x:44:
|
||||
render:x:133:
|
||||
```
|
||||
|
||||
Which number to choose in your `securityContext`? Also, what if the `runAsGroup`/`runAsUser` values cannot be hard-coded because
|
||||
they are automatically assigned during pod admission time via external security policies?
|
||||
|
||||
Unlike volumes with `fsGroup`, the devices have no official notion of `deviceGroup`/`deviceUser` that the CRI runtimes (or kubelet)
|
||||
would be able to use. We considered using container annotations set by the device plugins (e.g., `io.kubernetes.cri.hostDeviceSupplementalGroup/`) to get custom OCI `config.json` uid/gid values.
|
||||
This would have required changes to all existing device plugins which was not ideal.
|
||||
|
||||
Instead, a solution that is *seamless* to end-users without getting the device plugin vendors involved was preferred. The selected approach was
|
||||
to re-use `runAsUser` and `runAsGroup` values in `config.json` for devices:
|
||||
|
||||
```yaml
|
||||
{
|
||||
"type": "c",
|
||||
"path": "/dev/foo",
|
||||
"major": 123,
|
||||
"minor": 4,
|
||||
"fileMode": 438,
|
||||
"uid": <runAsUser>,
|
||||
"gid": <runAsGroup>
|
||||
},
|
||||
```
|
||||
|
||||
With `runc` OCI runtime (in non-rootless mode), the device is created (`mknod(2)`) in
|
||||
the container namespace and the ownership is changed to `runAsUser`/`runAsGroup` using `chmod(2)`.
|
||||
|
||||
{{< note >}}
|
||||
[Rootless mode](/docs/tasks/administer-cluster/kubelet-in-userns/) and devices is not supported.
|
||||
{{</note>}}
|
||||
Having the ownership updated in the container namespace is justified as the user process is the only one accessing the device. Only `runAsUser`/`runAsGroup`
|
||||
are taken into account, and, e.g., the `USER` setting in the container is currently ignored.
|
||||
|
||||
While it is likely that the "faulty" deployments (i.e., non-root `securityContext` + devices) do not exist, to be absolutely sure no
|
||||
deployments break, an opt-in config entry in both containerd and CRI-O to enable the new behavior was added. The following:
|
||||
|
||||
`device_ownership_from_security_context (bool)`
|
||||
|
||||
defaults to `false` and must be enabled to use the feature.
|
||||
|
||||
## See non-root containers using devices after the fix
|
||||
To demonstrate the new behavior, let's use a Data Plane Development Kit (DPDK) application using hardware accelerators, Kubernetes CPU manager, and HugePages as an example. The cluster runs containerd with:
|
||||
|
||||
```toml
|
||||
[plugins]
|
||||
[plugins."io.containerd.grpc.v1.cri"]
|
||||
device_ownership_from_security_context = true
|
||||
```
|
||||
|
||||
or CRI-O with:
|
||||
|
||||
```toml
|
||||
[crio.runtime]
|
||||
device_ownership_from_security_context = true
|
||||
```
|
||||
|
||||
and the `Guaranteed` QoS Class Pod that runs DPDK's crypto-perf test utility with this YAML:
|
||||
|
||||
```yaml
|
||||
...
|
||||
metadata:
|
||||
name: qat-dpdk
|
||||
spec:
|
||||
securityContext:
|
||||
runAsUser: 1000
|
||||
runAsGroup: 2000
|
||||
fsGroup: 3000
|
||||
containers:
|
||||
- name: crypto-perf
|
||||
image: intel/crypto-perf:devel
|
||||
...
|
||||
resources:
|
||||
requests:
|
||||
cpu: "3"
|
||||
memory: "128Mi"
|
||||
qat.intel.com/generic: '4'
|
||||
hugepages-2Mi: "128Mi"
|
||||
limits:
|
||||
cpu: "3"
|
||||
memory: "128Mi"
|
||||
qat.intel.com/generic: '4'
|
||||
hugepages-2Mi: "128Mi"
|
||||
...
|
||||
```
|
||||
|
||||
To verify the results, check the user and group ID that the container runs as:
|
||||
|
||||
```
|
||||
$ kubectl exec -it qat-dpdk -c crypto-perf -- id
|
||||
```
|
||||
|
||||
They are set to non-zero values as expected:
|
||||
|
||||
```
|
||||
uid=1000 gid=2000 groups=2000,3000
|
||||
```
|
||||
|
||||
Next, check the device node permissions (`qat.intel.com/generic` exposes `/dev/vfio/` devices) are accessible to `runAsUser`/`runAsGroup`:
|
||||
|
||||
```
|
||||
$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/vfio
|
||||
total 0
|
||||
drwxr-xr-x 2 root root 140 Sep 7 10:55 .
|
||||
drwxr-xr-x 7 root root 380 Sep 7 10:55 ..
|
||||
crw------- 1 1000 2000 241, 0 Sep 7 10:55 58
|
||||
crw------- 1 1000 2000 241, 2 Sep 7 10:55 60
|
||||
crw------- 1 1000 2000 241, 10 Sep 7 10:55 68
|
||||
crw------- 1 1000 2000 241, 11 Sep 7 10:55 69
|
||||
crw-rw-rw- 1 1000 2000 10, 196 Sep 7 10:55 vfio
|
||||
```
|
||||
|
||||
Finally, check the non-root container is also allowed to create HugePages:
|
||||
|
||||
```
|
||||
$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/hugepages/
|
||||
```
|
||||
|
||||
`fsGroup` gives a `runAsUser` writable HugePages emptyDir mountpoint:
|
||||
|
||||
```
|
||||
total 0
|
||||
drwxrwsr-x 2 root 3000 0 Sep 7 10:55 .
|
||||
drwxr-xr-x 7 root root 380 Sep 7 10:55 ..
|
||||
```
|
||||
|
||||
## Help us test it and provide feedback!
|
||||
The functionality described here is expected to help with cluster security and the configurability of device permissions. To allow
|
||||
non-root containers to use devices requires cluster admins to opt-in to the functionality by setting
|
||||
`device_ownership_from_security_context = true`. To make it a default setting, please test it and provide your feedback (via SIG-Node meetings or issues)!
|
||||
The flag is available in CRI-O v1.22 release and queued for containerd v1.6.
|
||||
|
||||
More work is needed to get it *properly* supported. It is known to work with `runc` but it also needs to be made to function
|
||||
with other OCI runtimes too, where applicable. For instance, Kata Containers supports device passthrough and allows it to make devices
|
||||
available to containers in VM sandboxes too.
|
||||
|
||||
Moreover, the additional challenge comes with support of user names and devices. This problem is still [open](https://github.com/kubernetes/enhancements/pull/2101)
|
||||
and requires more brainstorming.
|
||||
|
||||
Finally, it needs to be understood whether `runAsUser`/`runAsGroup` are enough or if device specific settings similar to `fsGroups` are needed in PodSpec/CRI v2.
|
||||
|
||||
## Thanks
|
||||
My thanks goes to Mike Brown (IBM, containerd), Peter Hunt (Redhat, CRI-O), and Alexander Kanevskiy (Intel) for providing all the feedback and good conversations.
|
||||
Loading…
Reference in New Issue