podman/libpod/define
Valentin Rothberg 30e7cbccc1 libpod: fix wait and exit-code logic
This commit addresses three intertwined bugs to fix an issue when using
Gitlab runner on Podman.  The three bug fixes are not split into
separate commits as tests won't pass otherwise; avoidable noise when
bisecting future issues.

1) Podman conflated states: even when asking to wait for the `exited`
   state, Podman returned as soon as a container transitioned to
   `stopped`.  The issues surfaced in Gitlab tests to fail [1] as
   `conmon`'s buffers have not (yet) been emptied when attaching to a
   container right after a wait.  The race window was extremely narrow,
   and I only managed to reproduce with the Gitlab runner [1] unit
   tests.

2) The clearer separation between `exited` and `stopped` revealed a race
   condition predating the changes.  If a container is configured for
   autoremoval (e.g., via `run --rm`), the "run" process competes with
   the "cleanup" process running in the background.  The window of the
   race condition was sufficiently large that the "cleanup" process has
   already removed the container and storage before the "run" process
   could read the exit code and hence waited indefinitely.

   Address the exit-code race condition by recording exit codes in the
   main libpod database.  Exit codes can now be read from a database.
   When waiting for a container to exit, Podman first waits for the
   container to transition to `exited` and will then query the database
   for its exit code. Outdated exit codes are pruned during cleanup
   (i.e., non-performance critical) and when refreshing the database
   after a reboot.  An exit code is considered outdated when it is older
   than 5 minutes.

   While the race condition predates this change, the waiting process
   has apparently always been fast enough in catching the exit code due
   to issue 1): `exited` and `stopped` were conflated.  The waiting
   process hence caught the exit code after the container transitioned
   to `stopped` but before it `exited` and got removed.

3) With 1) and 2), Podman is now waiting for a container to properly
   transition to the `exited` state.  Some tests did not pass after 1)
   and 2) which revealed the third bug: `conmon` was executed with its
   working directory pointing to the OCI runtime bundle of the
   container.  The changed working directory broke resolving relative
   paths in the "cleanup" process.  The "cleanup" process error'ed
   before actually cleaning up the container and waiting "main" process
   ran indefinitely - or until hitting a timeout.  Fix the issue by
   executing `conmon` with the same working directory as Podman.

Note that fixing 3) *may* address a number of issues we have seen in the
past where for *some* reason cleanup processes did not fire.

[1] https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27119#note_970712864

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

[MH: Minor reword of commit message]

Signed-off-by: Matthew Heon <mheon@redhat.com>
2022-06-23 09:11:57 -04:00
..
annotations.go Truncate annotations when generating kubernetes yaml files 2022-04-27 04:39:05 -04:00
checkpoint_restore.go Added optional container checkpointing statistics 2021-11-15 11:50:24 +00:00
config.go use libnetwork from c/common 2022-01-12 17:07:30 +01:00
container.go fix --init with /dev bind mount 2022-05-23 13:59:05 +02:00
container_inspect.go golangci-lint: enable nolintlint 2022-06-14 16:29:42 +02:00
containerstate.go podman stats: improve cpu average calc 2022-03-22 17:44:58 +01:00
diff.go podman diff accept two images or containers 2021-07-02 17:11:56 +02:00
errors.go libpod: fix wait and exit-code logic 2022-06-23 09:11:57 -04:00
exec_codes.go Revert "Exec: use ErrorConmonRead" 2020-03-09 09:50:40 -04:00
fileinfo.go Fixes from make codespell 2021-04-21 13:16:33 -04:00
healthchecks.go fix healthcheck timeouts and ut8 coercion 2022-01-06 13:56:54 -06:00
info.go Add Authorixation field to Plugins for Info 2022-05-26 11:15:48 -07:00
mount.go separate file with mount consts in libpod/define 2021-03-07 12:01:04 +01:00
pod_inspect.go Fix swagger model of `InspectPodResponse` 2022-05-26 16:34:05 +02:00
podstate.go Add a Degraded state to pods 2020-10-21 13:31:40 -04:00
runtime.go Add support for containers.conf 2020-03-27 14:36:03 -04:00
terminal.go prune remotecommand dependency 2021-02-25 10:02:41 -06:00
version.go Add 'Os' to be queried via 'version' output 2022-03-29 18:10:59 -04:00
volume_inspect.go Set volume NeedsCopyUp to false iff data was copied up 2022-01-06 10:42:34 -05:00