the previous implementation was expecting the rlimits to be set for the
entire process and clamping the values only when running as rootless.
Change the implementation to always specify the expected values in the
OCI spec file and do the clamping only when running as rootless and
using the default values.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In the common scenario of podman-remote run --rm the API is required to
attach + start + wait to get exit code. This has the problem that the
wait call races against the container removal from the cleanup process
so it may not get the exit code back. However we keep the exit code
around for longer than the container so we can just look it up in the
endpoint. Of course this only works when we get a full id as param but
podman-remote will do that.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Undoing some of my own work here from #24090 now that we have the
ExposedPorts field implemented in inspect. I considered a revert
of that patch, but it's still needed as without it we'd be
including exposed ports when --net=container which is not
correct.
Basically, exposed ports for a container should always go in the
new ExposedPorts field we added. They sometimes go in the Ports
field in NetworkSettings, but only when the container is not
net=host and not net=container. We were always including exposed
ports, which was not correct, but is an easy logical fix.
Also required is a test change to correct the expected behavior
as we were testing for incorrect behavior.
Fixes https://issues.redhat.com/browse/RHEL-60382
Signed-off-by: Matt Heon <mheon@redhat.com>
the kernel checks that both the uid and the gid are mapped inside the
user namespace, not only the uid:
/**
* privileged_wrt_inode_uidgid - Do capabilities in the namespace work over the inode?
* @ns: The user namespace in question
* @idmap: idmap of the mount @inode was found from
* @inode: The inode in question
*
* Return true if the inode uid and gid are within the namespace.
*/
bool privileged_wrt_inode_uidgid(struct user_namespace *ns,
struct mnt_idmap *idmap,
const struct inode *inode)
{
return vfsuid_has_mapping(ns, i_uid_into_vfsuid(idmap, inode)) &&
vfsgid_has_mapping(ns, i_gid_into_vfsgid(idmap, inode));
}
for this reason, improve the check for hasCurrentUserMapped to verify
that the gid is also mapped, and if it is not, use an intermediate
mount for the container rootfs.
Closes: https://github.com/containers/podman/issues/24159
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In this code, g.HostSpecific is _always_ false, as it is never set by
generate.New and is thus left at the default value (false).
Remove dead code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A field we missed versus Docker. Matches the format of our
existing Ports list in the NetworkConfig, but only includes
exposed ports (and maps these to struct{}, as they never go to
real ports on the host).
Fixes https://issues.redhat.com/browse/RHEL-60382
Signed-off-by: Matt Heon <mheon@redhat.com>
Previously, we didn't bother including exposed ports in the
container config when creating a container with --net=host. Per
Docker this isn't really correct; host-net containers are still
considered to have exposed ports, even though that specific
container can be guaranteed to never use them.
We could just fix this for host container, but we might as well
make it generic. This patch unconditionally adds exposed ports to
the container config - it was previously conditional on a network
namespace being configured. The behavior of `podman inspect` with
exposed ports when using `--net=container:` has also been
corrected. Previously, we used exposed ports from the container
sharing its network namespace, which was not correct. Now, we use
regular port bindings from the namespace container, but exposed
ports from our own container.
Fixes https://issues.redhat.com/browse/RHEL-60382
Signed-off-by: Matt Heon <mheon@redhat.com>
As shown in #23671 these functions can return the raw error without any
useful context to the user which makes it hard to understand where
things went wrong. Simply add some context to some error paths here.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Currently podman run -d can exit 0 if we send SIGTERM during startup
even though the contianer was never started. That just doesn't make any
sense is horribly confusing for a external job manager like systemd.
The original motivation was to exit 0 for the podman.service in commit
ca7376bb11. That does make sense but it should only do so for the
service and only if the server did indeed gracefully shutdown.
So we rework how the exit logic works, do not let the handler perform
the exit. Instead the shutdown package does the exit after all handlers
are run, this solves the issue of ordering. Then we default to exit code
1 like we did before and allow the service exit handler to overwrite the
exit code 0 in case of a graceful shutdown.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
When we are killed during netns setup it will leak the netns path as it
was not commited in the db. This is rather common if you run systemctl
stop on a podman systemd unit. Of course we cannot protect against
SIGKILL but in systemd case we get SIGTERM and we really should not exit
in a critical section like this.
Fixes#24044
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
These flags can affect the output of the HealtCheck log. Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via `podman inspect`.
It is also limited to the last five executions and the first 500 characters per execution.
This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic `healthcheck service failed` record.
- The `--health-log-destination` flag sets the destination of the HealthCheck log.
- `none`: (default behavior) `HealthCheckResults` are stored in overlay containers. (For example: `$runroot/healthcheck.log`)
- `directory`: creates a log file named `<container-ID>-healthcheck.log` with JSON `HealthCheckResults` in the specified directory.
- `events_logger`: The log will be written with logging mechanism set by events_loggeri. It also saves the log to a default directory, for performance on a system with a large number of logs.
- The `--health-max-log-count` flag sets the maximum number of attempts in the HealthCheck log file.
- A value of `0` indicates an infinite number of attempts in the log file.
- The default value is `5` attempts in the log file.
- The `--health-max-log-size` flag sets the maximum length of the log stored.
- A value of `0` indicates an infinite log length.
- The default value is `500` log characters.
Add --health-max-log-count flag
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
Add --health-max-log-size flag
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
Add --health-log-destination flag
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
The netns dir has a special logic to bind mout itself and make itslef
shared. This code here didn't which lead to catastrophic bug during
netns unmounting as we were unable to unmount the netns as the mount got
duplicated and had the wrong parent mount. This caused us to loop forever
trying to remove the file.
Fixes https://issues.redhat.com/browse/RHEL-59620Fixes#23685
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
As it turns on things are not so simple after all...
In podman-py it was reported[1] that waiting might hang, per our docs wait
on multiple conditions should exit once the first one is hit and not all
of them. However because the new wait logic never checked if the context
was cancelled the goroutine kept running until conmon exited and because
we used a waitgroup to wait for all of them to finish it blocked until
that happened.
First we can remove the waitgroup as we only need to wait for one of
them anyway via the channel. While this alone fixes the hang it would
still leak the other goroutine. As there is no way to cancel a goroutine
all the code must check for a cancelled context in the wait loop to no
leak.
Fixes 8a943311db ("libpod: simplify WaitForExit()")
[1] https://github.com/containers/podman-py/issues/425
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
convert the owner UID and GID into the user namespace only when
":idmap" mount is used.
This changes the behaviour of :idmap with an empty volume. Now the
existing directory ownership is copied up as in the other case.
Closes: https://github.com/containers/podman/issues/23347
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
pasta added a new --map-guest-addr to option that maps a to the actual
host ip. This is exactly what we need for host.containers.internal
entry. So we now make use of this option by default but still have to
keep the exclude fallback because the option is very new and some
users/distros will not have it yet.
This also fixes an issue where the --dns-forward ip were not used when
using the bridge network mode, only useful when not using aardvark-dns
as this used the proper ips there already from the rootless netns
resolv.conf file.
Fixes#19213
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The kube generate command can now generate a yaml for
the Job kind and the kube play command can create a pod
and containers with podman when passed in a Job yaml.
Add relevant tests and docs for this.
Signed-off-by: Urvashi Mohnani <umohnani@redhat.com>
Many dependencies started using go 1.22 which means we have to follow in
order to update.
Disable the now depracted exportloopref linter as it was replaced by
copyloopvar as go fixed the loop copy problem in 1.22[1]
Another new chnage in go 1.22 is the for loop syntax over ints, the
intrange linter chacks for this but there a lot of loops that have to be
converted so I didn't do it here and disable th elinter for now, th eold
syntax is still fine.
[1] https://go.dev/blog/loopvar-preview
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This changes the value emitted for HostConfig.Devices from 'null' to
'[]'. The 'null' value was preventing ansible's podman module from
working correctly on FreeBSD.
Signed-off-by: Doug Rabson <dfr@rabson.org>
This looks like a case of boilerplate error handling making it
too easy to miss a legitimately ignored error, which is annoying.
In a more featureful language most of the SQL code here could be
macros (at the very least, Rust would have forced us to handle
all error cases, not just the one seen here).
Found while looking through the Libpod DB code, no actual bug I
can think of associated with this.
Signed-off-by: Matt Heon <mheon@redhat.com>
The podman container cleanup process runs asynchronous and by the time
it gets the lock it is possible another podman process already did the
cleanup and then did a new init() to start it again. If the cleanup
process gets the lock there it will cause very weird things.
This can be observed in the remote start API as CI flakes.
Fixes#23754
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This is very similar to commit 3280da0500, we cannot check the state
then unlock to then lock again and do the action. Everything must
happen under one lock. To fix this move the code into the HTTPAttach
function in libpod. The locking here is a bit weird because attach
blocks for the lifetime of attach which can be very long so we must
unlock before performing the attach.
Fixes#23757
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
There seems to be one change[1] which breaks our tests, the route Dst field
is no longer nil for a default route but rather the empty ipnet, i.e.
0.0.0.0/0 and ::/0 so fix that up in our code.
[1] https://github.com/vishvananda/netlink/pull/852
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This started off as an attempt to make `podman stop` on a
container started with `--rm` actually remove the container,
instead of just cleaning it up and waiting for the cleanup
process to finish the removal.
In the process, I realized that `podman run --rmi` was rather
broken. It was only done as part of the Podman CLI, not the
cleanup process (meaning it only worked with attached containers)
and the way it was wired meant that I was fairly confident that
it wouldn't work if I did a `podman stop` on an attached
container run with `--rmi`. I rewired it to use the same
mechanism that `podman run --rm` uses, so it should be a lot more
durable now, and I also wired it into `podman inspect` so you can
tell that a container will remove its image.
Tests have been added for the changes to `podman run --rmi`. No
tests for `stop` on a `run --rm` container as that would be racy.
Fixes#22852
Fixes RHEL-39513
Signed-off-by: Matt Heon <mheon@redhat.com>
Now that we have propert !remote tags set everywhere we can just rely on
that and do not need to skip any dirs.
Also on linux do not lint three times, one remote run is enough.
We still have to skip the test dir for windows/macos though or we need
to add linux build tags there everywhere as well. This seems simpler.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
By default wait only waits for the exit of a container, there is really
no way to make it wait for the removal too when the container was
created with --rm. I though I found a clever way in 8a943311db but this
is not working race free. While it works most of the time any other
parallel process might call syncContainer() before the cleanup process
holds the lock until it removes it. As such the wait hack to only update
the state and not sync the exit file did not work so we can drop that.
However the test wants to wait for the removal to happen by the cleanup
process and we can already say --condition=removing to do this but this
will throw an error if the ctr was removed instead of counting this as
success so fix that as well.
Fixes#23640
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
There are two major problems with UpdateContainerStatus()
First, it can deadlock when the the state json is to big as it tries to
read stderr until EOF but it will never hit EOF as long as the runtime
process is alive. This means if the runtime json is to big to git into
the pipe buffer we deadlock ourselves.
Second, the function modifies the container state struct and even adds
and exit code to the db however when it is called from the stop() code
path we will be unlocked here.
While the first problem is easy to fix the second one not so much. And
when we cannot update the state there is no point in reading the from
runtime in the first place as such remove the function as it does more
harm then good.
And add some warnings the the functions that might be called unlocked.
Fixes#22246
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
We cannot get first the volume lock and the container locks. Other code
paths always have to first lock the container and the lock the volumes,
i.e. to mount/umount them. As such locking the volume fust can always
result in ABBA deadlocks.
To fix this move the lock down after the container removal. The removal
code is racy regardless of the lock as the volume lcok on create is no
longer taken since commit 3cc9db8626 due another deadlock there.
Fixes#23613
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Init containers are meant to exit early before other containers are
started. Thus stopping the infra container in such case is wrong.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The current code did several complicated state checks that simply do not
work properly on a fast restarting container. It uses a special case for
--restart=always but forgot to take care of --restart=on-failure which
always hang for 20s until it run into the timeout.
The old logic also used to call CheckConmonRunning() but synced the
state before which means it may check a new conmon every time and thus
misses exits.
To fix the new the code is much simpler. Check the conmon pid, if it is
no longer running then get then check exit file and get exit code.
This is related to #23473 but I am not sure if this fixes it because we
cannot reproduce.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>