Commit Graph

15 Commits

Author SHA1 Message Date
Valentin Rothberg 8684d41e38 k8systemd: run k8s workloads in systemd
Support running `podman play kube` in systemd by exploiting the
previously added "service containers".  During `play kube`, a service
container is started before all the pods and containers, and is stopped
last.  The service container communicates its conmon PID via sdnotify.

Add a new systemd template to dispatch such k8s workloads.  The argument
of the template is the path to the k8s file.  Note that the path must be
escaped for systemd not to bark:

Let's assume we have a `top.yaml` file in the home directory:
```
$ escaped=$(systemd-escape ~/top.yaml)
$ systemctl --user start podman-play-kube@$escaped.service
```

Closes: https://issues.redhat.com/browse/RUN-1287
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-05-17 10:18:58 +02:00
Valentin Rothberg 03af8213ce sdnotify: send MAINPID only once
Send the main PID only once.  Previously, `(*Container).start()` and
the conmon handler sent them ~simultaneously and went into a race.

I noticed the issue while debugging a WIP PR.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-05-12 11:11:37 +02:00
Ed Santiago cc42321697 sdnotify test: accept MAINPID anywhere
systemd sometimes spits out lines in the wrong order. Deal with it.

This fixes an infrequent flake that I haven't filed because I
didn't understand it well enough. (Hence, this reduces BUGS
but does not reduce BUG COUNT. Sorry!)

Signed-off-by: Ed Santiago <santiago@redhat.com>
2021-09-30 12:09:48 -06:00
Ed Santiago e3c7e02a0e System tests: add cleanup & debugging output
Cleanup: the final 'play' test wasn't cleaning up after itself,
leading to angry warning messages when rerunning tests (in
my environment; never in CI)

Debug: I'm seeing a lot of "Could not parse READY=1 as MAINPID=nnn"
flakes in the sdnotify:container test (nine in the past month). Add
debug traces to help diagnose in future flakes.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2021-09-01 11:29:59 -06:00
Daniel J Walsh c22f3e8b4e Implement SD-NOTIFY proxy in conmon
This leverages conmon's ability to proxy the SD-NOTIFY socket.
This prevents locking caused by OCI runtime blocking, waiting for
SD-NOTIFY messages, and instead passes the messages directly up
to the host.

NOTE: Also re-enable the auto-update tests which has been disabled due
to flakiness.  With this change, Podman properly integrates into
systemd.

Fixes: #7316
Signed-off-by: Joseph Gooch <mrwizard@dok.org>
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
2021-08-20 11:12:05 +02:00
Ed Santiago 9fd7ab50f8 System tests: honor $OCI_RUNTIME (for CI)
Some CI systems set $OCI_RUNTIME as a way to override the
default crun. Integration (e2e) tests honor this, but system
tests were not aware of the convention; this means we haven't
been testing system tests with runc, which means RHEL gating
tests are now failing.

The proper solution would be to edit containers.conf on CI
systems. Sorry, that would involve too much CI-VM work.
Instead, this PR detects $OCI_RUNTIME and creates a dummy
containers.conf file using that runtime.

Add: various skips for tests that don't work with runc.

Refactor: add a helper function so we don't need to do
the complicated 'podman info blah blah .OCIRuntime.blah'
thing in many places.

BUG: we leave a tmp file behind on exit.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2021-05-03 20:15:21 -06:00
Ed Santiago 660a72993c sdnotify tests: try real hard to kill socat processes
podman gating tests are hanging in the new Fedora CI setup;
long and tedious investigation suggests that 'socat' processes
are being left unkilled, which then causes BATS to hang when
it (presumably) runs a final 'wait' in its end cleanup.

The two principal changes are to exec socat in a subshell
with fd3 closed, and to pkill its child processes before
killing the process itself. I don't know if both are needed.
The pkill definitely is; the exec may just be superstition.
Since I've wasted more than a day of PTO time on this, I'm
okay with a little superstition. What I do know is that with
these two changes, my reproducer fails to reproduce in over
one hour of trying (normally it fails within 5 minutes).

AND, update: only rawhide (f35) leaves stray socat processes
behind. f33 and ubuntu do not, so 'pkill -P' fails.

I really have no idea what's going on.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2021-03-11 16:21:51 -07:00
Ed Santiago 1345d0358b system tests: the catch-up game
- run test: minor cleanup to .containerenv test. Basically,
  make it do only two podman-runs (they're expensive) and
  tighten up the results checks

- ps test: add ps -a --storage. Requires small tweak to
  run_podman helper, so we can have "timeout" be an expected
  result

- sdnotify test: workaround for #8718 (seeing MAINPID=xxx as
  last output line instead of READY=1). As found by the
  newly-added debugging echos, what we are seeing is:

      MAINPID=103530
      READY=1
      MAINPID=103530

  It's not supposed to be that way; it's supposed to be just
  the first two. But when faced with reality, we must bend
  to accommodate it, so let's accept READY=1 anywhere in
  the output stream, not just as the last line.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-12-14 15:06:43 -07:00
Ed Santiago f5b3dc976c Tests: Fix common flakes, and improve apiv2 test log
- apiv2 - the 'ten /info requests' test is flaking often,
  taking ~8 seconds (our limit is 7, up from 5 a few weeks
  ago). Brent suggested that the first /info call might be
  expensive, because it needs to access storage. So, let's
  prime it by running one /info outside the timing loop.
  And, because even that continues to fail, bump it up
  to 10 seconds and file #8076 to track the slowdown.

- toolbox test - WaitForReady() has timed out, even on one
  occasion causing a run failure because it failed 3 times.
  Solution: bump up timeout from 2s to 5s. Not really great,
  but CI systems are underpowered, and it's not unreasonable
  that 2s might be too low.

- sdnotify test - add a 'podman wait' between stop & rm.
  This may prevent a "cannot rm container as it is running"
  race condition.

While working on this, Brent and I noticed a few ways that
test-apiv2 logging can be improved:

- test name: when request is POST, display the jsonified
  parameters, not the original input ones. This should
  make it much easier to reproduce failures.

- use curl's "--write-out" option to capture http code,
  content type, and request time. We were getting the
  first two via grep from logged headers; this is cleaner.
  And there was no other way to get timing. We now include
  the timing as X-Response-Time in the log file.

- abort on *any* curl error, not just 7 (cannot connect).
  Any error at all from curl is bad news.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-10-20 11:32:49 -06:00
Ed Santiago 1646da834c System test additions
- run --userns=keep-id: confirm that $HOME gets set (#8013)

 - inspect: confirm that JSON output is a sane number of
   lines (10 or more), not an unreadable one-liner (#8011
   and #8021). Do so with image, pod, network, volume
   because the code paths might be different.

 - cgroups: confirm that 'run' preserves cgroup manager (#7970)

 - sdnotify: reenable tests, and hope CI doesn't hang. This
   test was disabled on August 18 because CI jobs were hanging
   and timing out. My suspicion was that it was #7316, which
   in turn seems to have hinged on conmon #182. The latter
   was merged on Sep 16, so let's cross our fingers and see
   what happens.

Also: remove inaccurate warning from a networking test.

And, wow, fix is_cgroupsv2(), it has never actually worked.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-10-14 15:32:02 -06:00
Ed Santiago a9dbd2b3de Migrate away from docker.io
CI and system tests currently pull some images from docker.io.
Eliminate that, by:

  - building a custom image containing much of what we need
    for testing; and
  - copying other needed images to quay.io

(Reason: effective 2020-11-01 docker.io will limit the
number of image pulls).

The principal change is to create a new quay.io/libpod/testimage,
using the new test/system/build-testimage script, instead of
relying on quay.io/libpod/alpine_labels. We also switch to
using a hardcoded :YYYYMMDD tag, instead of :latest, in an
attempt to futureproof our CI. This image includes 'httpd'
from busybox-extras, which we use in our networking test
(previously we had to pull and run busybox from docker.io).

The testimage can and should be extended as needed for future
tests, e.g. adding test file content or other useful tools.

For the '--pull' tests which require actually pulling from
the registry, I've created an image with the same name but
tagged :00000000 so it will never be pulled by default.
Since this image is only used minimally, it's just busybox.

Unfortunately there remain two cases we cannot solve in
this tiny alpine-based image:

  1) docker registry
  2) systemd

For those, I've (manually) run:

    podman pull [ docker.io/library/registry:2.7 | registry.fedoraproject.org/fedora:31 ]
    podman tag !$ quay.io/...
    podman push !$

...and amended the calling tests accordingly.

I've tried to make the the smallest reasonable diff, not the
smallest possible one. I hope it's a reasonable tradeoff.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-09-08 06:06:06 -06:00
Ed Santiago d254fa4c35 system tests: enable more remote tests; cleanup
info, images, run, networking tests: remove some skip_if_remote()s
that were added in the varlink days. All of these tests now seem
to work with APIv2.

help test: check that first output line from 'podman --help'
is the program description (regression check for #7273).

load test: clean up stray images, rewrite test to make it conform
to existing convention. In the process, discover and file #7337

exec test (and networking): file #7360, and add FIXME comment
to skip()s suggesting evaluating those tests once that is fixed.

pod test: now that #6328 is fixed, use 'podman pod inspect --format'
instead of relying on jq

Various other tests: add an explanation of why test is disabled
so we can more easily distinguish "this will never be meaningful
under remote" vs "hey, doesn't work for now, but maybe someday".

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-08-19 08:12:14 -06:00
Ed Santiago 18f36d8cf6 Re-disable sdnotify tests to try to fix CI
Some CI tests are hanging, timing out in 60 or 120 minutes.
I wonder if it's #7316, the bug where all podman commands
hang forever if NOTIFY_SOCKET is set?

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-08-18 07:21:47 -06:00
Ed Santiago 60ab5f3ae6 system tests: enable sdnotify tests
Oops. PR #6693 (sdnotify) added tests, but they were disabled
due to broken crun on f31. I tried for three weeks to get a
magic CI:IMG PR to update crun on the CI VMs ... but in that
time I forgot to actually enable those new tests.

This PR removes a 'skip', replacing it with a check that systemd
is running plus one more to make sure our runtime is crun. It
looks like sdnotify just doesn't work on Ubuntu (it hangs), and
my guess is that it's a crun/runc issue.

I also changed the test image from fedora:latest to :31, because,
sigh, fedora:latest removed the systemd-notify tool.

WARNING WARNING WARNING: the symptom of a missing systemd-notify
is that podman will hang forever, not even stopped by the timeout
command in podman_run! (Filed: #7316). This means that if the
sdnotify-in-container test ever fails, the symptom will be that
Cirrus itself will time out (2 hours?). This is horrible. I
don't know what to do about it other than push for a fix for 7316.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-08-13 19:16:25 -06:00
Ed Santiago 10ad46eb73 BATS system tests for new sdnotify
Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-07-06 17:47:22 +00:00