We cannot get first the volume lock and the container locks. Other code
paths always have to first lock the container and the lock the volumes,
i.e. to mount/umount them. As such locking the volume fust can always
result in ABBA deadlocks.
To fix this move the lock down after the container removal. The removal
code is racy regardless of the lock as the volume lcok on create is no
longer taken since commit 3cc9db8626 due another deadlock there.
Fixes#23613
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Now that on-failure exits right away the test is racy as the
RestartCount is not at the value we expect as the container is still
restarting in the background. As such add a timer based approach.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Init containers are meant to exit early before other containers are
started. Thus stopping the infra container in such case is wrong.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The current code did several complicated state checks that simply do not
work properly on a fast restarting container. It uses a special case for
--restart=always but forgot to take care of --restart=on-failure which
always hang for 20s until it run into the timeout.
The old logic also used to call CheckConmonRunning() but synced the
state before which means it may check a new conmon every time and thus
misses exits.
To fix the new the code is much simpler. Check the conmon pid, if it is
no longer running then get then check exit file and get exit code.
This is related to #23473 but I am not sure if this fixes it because we
cannot reproduce.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
CI will fail if quay is down, but a build-time check does not
help us in any way. It just introduces another pain point
where we have to hit the Rerun button.
Signed-off-by: Ed Santiago <santiago@redhat.com>
By enabling UserKnownHostsFile=/dev/null, and CheckHostIP=no
options to the defaults we prevent the user from adding the host key
multiple times and from flakes that can raise Remote Host Id change.
Resolves: https://github.com/containers/podman/issues/23505
Signed-off-by: Nicola Sella <nsella@redhat.com>
When will I learn not to dismiss something as "easy"?
Anyhow, this doesn't actually change anything parallel-wise
but it does reduce a race condition seen on heavily-loaded
slow systems, wherein a container goes into unhealthy before
we want it to. This version isn't perfect; I don't think
there's an ideal fix for this.
Signed-off-by: Ed Santiago <santiago@redhat.com>
Read stderr from ssh-keygen before calling wait(), since cmd.Wait() closes cmd.StderrPipe() after it exits, causing a read-on-closed-pipe error.
Signed-off-by: Ashley Cui <acui@redhat.com>
Only one test can be parallelized. Do so, and add a comment
to the other one explaining why it can't be.
Also, add some missing error-message checks.
Signed-off-by: Ed Santiago <santiago@redhat.com>
Very few changes needed, all of them simple.
It is impossible to parallelize this entire file, because "stop -a".
Add tags to tests that can be parallelized, and comments to those
that can't.
Signed-off-by: Ed Santiago <santiago@redhat.com>
When the client gets a 404 back we know the container does not exists,
if ignore is set as well we should just ignore the error client side.
seen in #23554
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
When the cidfile does not exists and ignore is set the cli parser skips
the file without error and we call into the backend code without any
names at all. This should logically be a NOP but on remote it caused all
containers to be returned which caused podman stop to stop everything in
this case.
Fixes#23554
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The previous comment included way too many details. It also referenced
a docker-hub container image which is not accessible under all
circumstances. Switch to the GitHub container registry and include
mention of the pre-commit hook that's available.
Signed-off-by: Chris Evich <cevich@redhat.com>
The main change is a global "packageRules" config that encompasses all
rules instead of configuring them as options to a manager.
Signed-off-by: Chris Evich <cevich@redhat.com>
In podman-systemd we are intersecting the worlds of containers
and systemd, and I had to stop and think to understand what
`Exec=` does.
I tried to clarify things more here.
I found it especially confusing because the example at the
very top of the file does:
```
Image=quay.io/fedora/fedora
Exec=sleep 10
```
But that only makes sense because the fedora base image
(being generic) doesn't define an `ENTRYPOINT`, just a `CMD`.
But IMO by far the most common usage for podman-systemd
is "app images" which conventionally should use `ENTRYPOINT`
in general. Maybe we should change the default example,
but I'm leaving that for a later followup.
(It perhaps would have been less confusing if this field
had been called `Args=` to make clear it's quite different
in practice from systemd `ExecStart=`)
Signed-off-by: Colin Walters <walters@verbum.org>
If we manage to init/start a container successfully we should unset any
previously stored state errors. Otherwise a user might be confused why
there is an error in the state about some old error even though the
container works/runs.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Do not rely on an arbitrary delay in order to ensure the port was bound
in the container. Instead this approach checks if the port is bound in
the netns and only then starts the client. This speeds up the entire
test file by 50% but more importantly in parallel testing it solves
hangs as the timeout there was unreliable.
Fixes#23471
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
If we try to stop a contianer that is not running or paused we get an
ErrCtrStateInvalid or ErrCtrStopped error. As podman stop is idempotent
this is not a user visable error at all so we should also never log it
in the container state.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
We cannot unlock then lock again without syncing the state as this will
then save a potentially old state causing very bad things, such as
double netns cleanup issues.
The fix here is simple move the saveContainerError() under the same
lock. The comment about the re-lock is just wrong. Not doing this under
the same lock would cause us to update the error after something else
changed the container alreayd.
Most likely this was caused by a misunderstanding on how go defer's work.
Given they run Last In - First Out (LIFO) it is safe as long as out
defer function is after the defer unlock() call.
I think this issue is very bad and might have caused a variety of other
weird flakes. As fact I am confident that this fixes the double cleanup
errors.
Fixes#21569
Also fixes the netns removal ENOENT issues seen in #19721.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>