podman/libpod
Matthew Heon ebacfbd091 podman: fix memleak caused by renaming and not deleting
the exit file

If the container exit code needs to be retained, it cannot be retained
in tmpfs, because libpod runs in a memcg itself so it can't leave
traces with a daemon-less design.

This wasn't a memleak detectable by kmemleak for example. The kernel
never lost track of the memory and there was no erroneous refcounting
either. The reference count dependencies however are not easy to track
because when a refcount is increased, there's no way to tell who's
still holding the reference. In this case it was a single page of
tmpfs pagecache holding a refcount that kept pinned a whole hierarchy
of dying memcg, slab kmem, cgropups, unrechable kernfs nodes and the
respective dentries and inodes. Such a problem wouldn't happen if the
exit file was stored in a regular filesystem because the pagecache
could be reclaimed in such case under memory pressure. The tmpfs page
can be swapped out, but that's not enough to release the memcg with
CONFIG_MEMCG_SWAP_ENABLED=y.

No amount of more aggressive kernel slab shrinking could have solved
this. Not even assigning slab kmem of dying cgroups to alive cgroup
would fully solve this. The only way to free the memory of a dying
cgroup when a struct page still references it, would be to loop over
all "struct page" in the kernel to find which one is associated with
the dying cgroup which is a O(N) operation (where N is the number of
pages and can reach billions). Linking all the tmpfs pages to the
memcg would cost less during memcg offlining, but it would waste lots
of memory and CPU globally. So this can't be optimized in the kernel.

A cronjob running this command can act as workaround and will allow
all slab cache to be released, not just the single tmpfs pages.

    rm -f /run/libpod/exits/*

This patch solved the memleak with a reproducer, booting with
cgroup.memory=nokmem and with selinux disabled. The reason memcg kmem
and selinux were disabled for testing of this fix, is because kmem
greatly decreases the kernel effectiveness in reusing partial slab
objects. cgroup.memory=nokmem is strongly recommended at least for
workstation usage. selinux needs to be further analyzed because it
causes further slab allocations.

The upstream podman commit used for testing is
1fe2965e4f (v1.4.4).

The upstream kernel commit used for testing is
f16fea666898dbdd7812ce94068c76da3e3fcf1e (v5.2-rc6).

Reported-by: Michele Baldessari <michele@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

<Applied with small tweaks to comments>
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
2019-07-31 17:28:42 -04:00
..
common Set blob cache directory based on GraphDriver 2019-03-29 08:27:33 -04:00
define refactor to reduce duplicated error parsing 2019-07-23 16:49:04 -04:00
driver Begin to break up pkg/inspect 2019-06-03 15:54:53 -04:00
events golangci-lint pass number 2 2019-07-11 09:13:06 -05:00
image Fix possible runtime panic if image history len is zero 2019-07-25 12:45:08 +02:00
layers Initial checkin from CRI-O repo 2017-11-01 11:24:59 -04:00
lock trivial cleanups from golang 2019-07-03 15:41:33 -05:00
logs golangci-lint pass number 2 2019-07-11 09:13:06 -05:00
boltdb_state.go first pass of corrections for golangci-lint 2019-07-10 15:52:17 -05:00
boltdb_state_internal.go first pass of corrections for golangci-lint 2019-07-10 15:52:17 -05:00
boltdb_state_linux.go podman-remote inspect 2019-01-18 15:43:11 -06:00
boltdb_state_unsupported.go podman-remote inspect 2019-01-18 15:43:11 -06:00
common_test.go code cleanup 2019-07-08 09:18:11 -05:00
container.go golangci-lint round #3 2019-07-21 14:22:39 -05:00
container.log.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
container_api.go move editing of exitCode to runtime 2019-07-23 13:29:33 -04:00
container_commit.go Fix commit --changes env=X=Y 2019-07-26 16:04:17 -07:00
container_graph.go golangci-lint pass number 2 2019-07-11 09:13:06 -05:00
container_graph_test.go Update unit tests to use in-memory lock manager 2019-01-04 09:51:09 -05:00
container_inspect.go golangci-lint phase 4 2019-07-22 15:44:04 -05:00
container_internal.go podman: fix memleak caused by renaming and not deleting 2019-07-31 17:28:42 -04:00
container_internal_linux.go podman: support --userns=ns|container 2019-07-25 23:04:55 +02:00
container_internal_test.go Potentially breaking: Make hooks sort order locale-independent 2019-04-09 21:08:44 +02:00
container_internal_unsupported.go remove libpod from main 2019-06-25 13:51:24 -05:00
container_linux.go Do not fetch pod and ctr State on retrieval in Bolt 2018-07-31 14:19:50 +00:00
container_log_linux.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
container_log_unsupported.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
container_top_linux.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
container_top_unsupported.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
container_unsupported.go Do not fetch pod and ctr State on retrieval in Bolt 2018-07-31 14:19:50 +00:00
diff.go Add function to get a filtered tarstream diff 2019-07-11 14:43:34 +02:00
events.go get last container event 2019-07-07 08:54:20 -05:00
healthcheck.go Implement conmon exec 2019-07-22 15:57:23 -04:00
healthcheck_linux.go golangci-lint pass number 2 2019-07-11 09:13:06 -05:00
healthcheck_unsupported.go remove libpod from main 2019-06-25 13:51:24 -05:00
in_memory_state.go remove libpod from main 2019-06-25 13:51:24 -05:00
info.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
kube.go golangci-lint round #3 2019-07-21 14:22:39 -05:00
mounts_linux.go set root propagation based on volume properties 2018-11-26 13:55:02 +01:00
networking_linux.go golangci-lint pass number 2 2019-07-11 09:13:06 -05:00
networking_unsupported.go remove libpod from main 2019-06-25 13:51:24 -05:00
oci.go Implement conmon exec 2019-07-22 15:57:23 -04:00
oci_attach_linux.go golangci-lint cleanup 2019-07-23 10:13:04 -05:00
oci_attach_linux_cgo.go Implement conmon exec 2019-07-22 15:57:23 -04:00
oci_attach_linux_nocgo.go Implement conmon exec 2019-07-22 15:57:23 -04:00
oci_attach_unsupported.go Implement conmon exec 2019-07-22 15:57:23 -04:00
oci_internal_linux.go golangci-lint cleanup 2019-07-23 10:13:04 -05:00
oci_linux.go Implement conmon exec 2019-07-22 15:57:23 -04:00
oci_unsupported.go Implement conmon exec 2019-07-22 15:57:23 -04:00
options.go podman: support --userns=ns|container 2019-07-25 23:04:55 +02:00
pod.go remove libpod from main 2019-06-25 13:51:24 -05:00
pod_api.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
pod_internal.go remove libpod from main 2019-06-25 13:51:24 -05:00
pod_top_linux.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
pod_top_unsupported.go remove libpod from main 2019-06-25 13:51:24 -05:00
runtime.go Update libpod.conf to be NixOS friendly 2019-07-30 12:59:11 +02:00
runtime_cstorage.go remove libpod from main 2019-06-25 13:51:24 -05:00
runtime_ctr.go golangci-lint round #3 2019-07-21 14:22:39 -05:00
runtime_img.go remove libpod from main 2019-06-25 13:51:24 -05:00
runtime_img_test.go switch projectatomic to containers 2018-08-16 17:12:36 +00:00
runtime_migrate.go code cleanup 2019-07-08 09:18:11 -05:00
runtime_migrate_unsupported.go system: migrate stops the pause process 2019-05-17 20:48:25 +02:00
runtime_pod.go remove libpod from main 2019-06-25 13:51:24 -05:00
runtime_pod_infra_linux.go remove libpod from main 2019-06-25 13:51:24 -05:00
runtime_pod_linux.go golangci-lint round #3 2019-07-21 14:22:39 -05:00
runtime_pod_unsupported.go remove libpod from main 2019-06-25 13:51:24 -05:00
runtime_renumber.go Add System event type and renumber, refresh events 2019-04-25 16:23:09 -04:00
runtime_volume.go When retrieving volumes, only use exact names 2019-07-24 22:30:16 -04:00
runtime_volume_linux.go remove libpod from main 2019-06-25 13:51:24 -05:00
runtime_volume_unsupported.go remove libpod from main 2019-06-25 13:51:24 -05:00
state.go Switch Libpod over to new explicit named volumes 2019-04-04 12:26:29 -04:00
state_test.go libpod removal from main (phase 2) 2019-06-27 07:56:24 -05:00
stats.go Build fix for 32-bit systems. 2019-07-30 12:25:36 -04:00
stats_config.go changes to allow for darwin compilation 2018-06-29 20:44:09 +00:00
stats_unsupported.go remove libpod from main 2019-06-25 13:51:24 -05:00
storage.go remove libpod from main 2019-06-25 13:51:24 -05:00
util.go code cleanup 2019-07-08 09:18:11 -05:00
util_linux.go stats: fix cgroup path for rootless containers 2019-06-26 13:17:06 +02:00
util_test.go Stage3 Image Library 2018-03-14 20:21:31 +00:00
util_unsupported.go remove libpod from main 2019-06-25 13:51:24 -05:00
volume.go Purge all use of easyjson and ffjson in libpod 2019-06-13 11:03:20 -04:00
volume_internal.go Remove locks from volumes 2019-02-21 10:51:42 -05:00