community/contributors/design-proposals/propagation.md

7.8 KiB

HostPath Volume Propagation

Abstract

A proposal to add support for propagation mode in HostPath volume, which allows mounts within containers to visible outside the container and mounts after pods creation visible to containers. Propagation [modes] (https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt) contains "shared", "slave", "private", "unbindable". Out of them, docker supports "shared" / "slave" / "private".

Several existing issues and PRs were already created regarding that particular subject:

Use Cases

  1. (From @Kaffa-MY) Our team attempts to containerize flocker with zfs as back-end storage, and launch them in DaemonSet. Containers in the same flocker node need to read/write and share the same mounted volume. Currently the volume mount propagation mode cannot be specified between the host and the container, and then the volume mount of each container would be isolated from each other. This use case is also referenced by Containerized Volume Client Drivers - Design Proposal [#22216] (https://github.com/kubernetes/kubernetes/pull/22216)

  2. (From @majewsky) I'm currently putting the [OpenStack Swift object storage] (https://github.com/openstack/swift) into k8s on CoreOS. Swift's storage services expect storage drives to be mounted at /srv/node/{drive-id} (where {drive-id} is defined by the cluster's ring, the topology description data structure which is shared between all cluster members). Because there are several such services on each node (about a dozen, actually), I assemble /srv/node in the host mount namespace, and pass it into the containers as a hostPath volume. Swift is designed such that drives can be mounted and unmounted at any time (most importantly to hot-swap failed drives) and the services can keep running, but if the services run in a private mount namespace, they won't see the mounts/unmounts performed on the host mount namespace until the containers are restarted. The slave mount namespace is the correct solution for this AFAICS. Until this becomes available in k8s, we will have to have operations restart containers manually based on monitoring alerts.

  3. (From @victorgp) When using CoreOS Container Linux that does not provides external fuse systems like, in our case, GlusterFS, and you need a container to do the mounts. The only way to see those mounts in the host, hence also visible by other containers, is by sharing the mount propagation.

  4. (From @YorikSar) For OpenStack project, Neutron, we need network namespaces created by it to persist across reboot of pods with Neutron agents. Without it we have unnecessary data plane downtime during rolling update of these agents. Neutron L3 agent creates interfaces and iptables rules for each virtual router in a separate network namespace. For managing them it uses ip netns command that creates persistent network namespaces by calling unshare(CLONE_NEWNET) and then bind-mounting new network namespace's inode from /proc/self/ns/net to file with specified name in /run/netns dir. These bind mounts are the only references to these namespaces that remain. When we restart the pod, its mount namespace is destroyed with all these bind mounts, so all network namespaces created by the agent are gone. For them to survive we need to bind mount a dir from host mount namespace to container one with shared flag, so that all bind mounts are propagated across mount namespaces and references to network namespaces persist.

Implementation Alternatives

Add an option in VolumeMount API

The new VolumeMount will look like:

type VolumeMount struct {
	// Required: This must match the Name of a Volume [above].
	Name string `json:"name"`
	// Optional: Defaults to false (read-write).
	ReadOnly bool `json:"readOnly,omitempty"`
	// Required.
	MountPath string `json:"mountPath"`
	// Optional.
	Propagation string `json:"propagation"`
}

Opinion against this:

  1. This will affect all volumes, while only HostPath need this.

  2. This need API change, which is discouraged.

Add an option in HostPathVolumeSource

The new HostPathVolumeSource will look like:

const (
	PropagationShared  PropagationMode = "Shared"
	PropagationSlave   PropagationMode = "Slave"
	PropagationPrivate PropagationMode = "Private"
)

type HostPathVolumeSource struct {
	Path string `json:"path"`
	// Mount the host path with propagation mode specified. Docker only.
	Propagation PropagationMode `json:"propagation,omitempty"`
}

Opinion against this:

  1. This need API change, which is discouraged.

  2. All containers use this volume will share the same propagation mode.

  3. (From @jonboulle) May cause cross-runtime compatibility issue.

Make HostPath shared for privileged containers, slave for non-privileged.

Given only HostPath needs this feature, and CAP_SYS_ADMIN access is needed when making mounts inside container, we can bind propagation mode with existing option privileged, or we can introduce a new option in SecurityContext to control this.

The propagation mode could be determined by the following logic:

// Environment check to ensure "rshared" is supported.
if !dockerNewerThanV110 || !mountPathIsShared {
	return ""
}
if container.SecurityContext.Privileged {
	return "rshared"
} else {
	return "rslave"
}

Opinion against this:

  1. This changes the behavior of existing config.

  2. (From @euank) "shared" is not correctly supported by some kernels, we need runtime support matrix and when that will be addressed.

  3. This may cause silently fail and be a debuggability nightmare on many distros.

  4. (From @euank) Changing those mountflags may make docker even less stable, this may lock up kernel accidentally or potentially leak mounts.

Decision

We will take 'Make HostPath shared for privileged containers, slave for non-privileged', an environment check and an WARNING log will be emitted about whether propagation mode is supported.

Extra Concerns

@lucab and @euank has some extra concerns about pod isolation when propagation modes are changed, listed below:

  1. how to clean such pod resources (as mounts are now crossing pod boundaries, thus they can be kept busy indefinitely by processes outside of the pod)

  2. side-effects on restarts (possibly piling up layers of full-propagation mounts)

  3. how does this interacts with other mount features (nested volumeMounts may or may not propagate back to the host, depending of ordering of mount operations)

  4. limitations this imposes on runtimes (RO-remounting may now affects the host, is it on purpose or a dangerous side-effect?)

  5. A shared mount target imposes some costraints on its parent subtree (generally, it has to be shared as well), which in turn prevents some mount operations when preparing a pod (eg. MS_MOVE).

  6. The "on-by-default" nature means existing hostpath mounts, which used to be harmless, could begin consuming kernel resources and cause a node to crash. Even if a pod does not create any new mountpoints under its hostpath bindmount, it's not hard to reach multiplicative explosions with shared bindmounts and so the change in default + no cleanup could result in existing workloads knocking the node over.

These concerns are valid and we decide to limit the propagation mode to HostPath volume only, in HostPath, we expect any runtime should NOT perform any additional actions (such as clean up). This behavior is also consistent with current HostPath logic: kube does not take care of the content in HostPath either.

Analytics