make sure the user overrides are stored in the configuration file when
first created.
Closes: https://github.com/containers/libpod/issues/2659
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Previously, when `podman run` encountered a volume mount without
separate source and destination (e.g. `-v /run`) we would assume
that both were the same - a bind mount of `/run` on the host to
`/run` in the container. However, this does not match Docker's
behavior - in Docker, this makes an anonymous named volume that
will be mounted at `/run`.
We already have (more limited) support for these anonymous
volumes in the form of image volumes. Extend this support to
allow it to be used with user-created volumes coming in from the
`-v` flag.
This change also affects how named volumes created by the
container but given names are treated by `podman run --rm` and
`podman rm -v`. Previously, they would be removed with the
container in these cases, but this did not match Docker's
behaviour. Docker only removed anonymous volumes. With this patch
we move to that model as well; `podman run -v testvol:/test` will
not have `testvol` survive the container being removed by `podman
rm -v`.
The sum total of these changes let us turn on volume removal in
`--rm` by default.
Fixes: #4276
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
When a container is created with a given OCI runtime, but then it
is uninstalled or removed from the configuration file, Libpod
presently reacts very poorly. The EvictContainer code can
potentially remove these containers, but we still can't see them
in `podman ps` (aside from the massive logrus.Errorf messages
they create).
Providing a minimal OCI runtime implementation for missing
runtimes allows us to behave better. We'll be able to retrieve
containers from the database, though we still pop up an error for
each missing runtime. For containers which are stopped, we can
remove them as normal.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
network statistics cannot be collected for rootless network devices with
the current implementation. for now, we return nil so that stats will
at least for users.
Fixes:#4268
Signed-off-by: baude <bbaude@redhat.com>
The json field is called `Image` while the go field is called `ImageID`,
tricking users into filtering for `Image` which ultimately results in an
error. Hence, rename the field to `Image` to align json and go.
To prevent podman users from regressing, rename `Image` to `ImageID` in
the specified filters. Add tests to prevent us from regressing. Note
that consumers of the go API that are using `ImageID` are regressing;
ultimately we consider it to be a bug fix.
Fixes: #4193
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
Also, ensure that we don't try to mount them without root - it
appears that it can somehow not error and report that mount was
successful when it clearly did not succeed, which can induce this
case.
We reuse the `--force` flag to indicate that a volume should be
removed even after unmount errors. It seems fairly natural to
expect that --force will remove a volume that is otherwise
presenting problems.
Finally, ignore EINVAL on unmount - if the mount point no longer
exists our job is done.
Fixes: #4247Fixes: #4248
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
In some cases, conmon can fail without writing logs. Change the wording
of the error message from
"error reading container (probably exited) json message"
to
"container create failed (no logs from conmon)"
to have a more helpful error message that is more consistent with other
errors at that stage of execution.
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
Previously, `podman checkport restore` with exported containers,
when told to create a new container based on the exported
checkpoint, would create a new container, with a new container
ID, but not reset CGroup path - which contained the ID of the
original container.
If this was done multiple times, the result was two containers
with the same cgroup paths. Operations on these containers would
this have a chance of crossing over to affect the other one; the
most notable was `podman rm` once it was changed to use the --all
flag when stopping the container; all processes in the cgroup,
including the ones in the other container, would be stopped.
Reset cgroups on restore to ensure that the path matches the ID
of the container actually being run.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
This is a horrible hack to work around issues with Fedora 31, but
other distros might need it to, so we'll move it upstream.
I do not recommend this functionality for general use, and the
manpages and other documentation will reflect this. But for some
upgrade cases, it will be the only thing that allows for a
working system.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
For future work, we need multiple implementations of the OCI
runtime, not just a Conmon-wrapped runtime matching the runc CLI.
As part of this, do some refactoring on the interface for exec
(move to a struct, not a massive list of arguments). Also, add
'all' support to Kill and Stop (supported by runc and used a bit
internally for removing containers).
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
when runc returns an error about not being v2 complient, catch the error
and logrus an actionable message for users.
Signed-off-by: baude <bbaude@redhat.com>
This requires updating all import paths throughout, and a matching
buildah update to interoperate.
I can't figure out the reason for go.mod tracking
github.com/containers/image v3.0.2+incompatible // indirect
((go mod graph) lists it as a direct dependency of libpod, but
(go list -json -m all) lists it as an indirect dependency),
but at least looking at the vendor subdirectory, it doesn't seem
to be actually used in the built binaries.
Signed-off-by: Miloslav Trmač <mitr@redhat.com>
This ensures that containers that didn't require an evict will be
dealt with normally, and we only break out evict for containers
that refuse to be removed by normal means.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
fixes a segfault when slirp4netns is not installed and the slirp sync
pipe is not created.
Closes: https://github.com/containers/libpod/issues/4113
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
A true result from reexec.Init() isn't an error, but it indicates that
main() should exit with a success exit status.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
Add ability to evict a container when it becomes unusable. This may
happen when the host setup changes after a container creation, making it
impossible for that container to be used or removed.
Evicting a container is done using the `rm --force` command.
Signed-off-by: Marco Vedovati <mvedovati@suse.com>
CNI expects that a DELETE be run before re-creating container
networks. If a reboot occurs quickly enough that containers can't
stop and clean up, that DELETE never happens, and Podman
currently wipes the old network info and thinks the state has
been entirely cleared. Unfortunately, that may not be the case on
the CNI side. Some things - like IP address reservations - may
not have been cleared.
To solve this, manually re-run CNI Delete on refresh. If the
container has already been deleted this seems harmless. If not,
it should clear lingering state.
Fixes: #3759
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
In order to run Podman with VM-based runtimes unprivileged, the
network must be set up prior to the container creation. Therefore
this commit modifies Podman to run rootless containers by:
1. create a network namespace
2. pass the netns persistent mount path to the slirp4netns
to create the tap inferface
3. pass the netns path to the OCI spec, so the runtime can
enter the netns
Closes#2897
Signed-off-by: Gabi Beyer <gabrielle.n.beyer@intel.com>
If a user upgrades to a machine that defaults to a cgroups V2 machine
and has a libpod.conf file in their homedir that defaults to OCI Runtime runc,
then we want to change it one time to crun.
runc as of this point does not work on cgroupV2 systems. This patch will
eventually be removed but is needed until runc has support.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
If the HOME environment variable is not set, make sure it is set to
the configuration found in the container /etc/passwd file.
It was previously depending on a runc behavior that always set HOME
when it is not set. The OCI runtime specifications do not require
HOME to be set so move the logic to libpod.
Closes: https://github.com/debarshiray/toolbox/issues/266
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
We've been seeing a lot of issues (ref: #4061, but there are
others) where Podman hiccups on trying to start a container,
because some temporary files have been retained and Conmon will
not overwrite them.
If we're calling start() we can safely assume that we really want
those files gone so the container starts without error, so invoke
the cleanup routine. It's relatively cheap (four file removes) so
it shouldn't hurt us that much.
Also contains a small simplification to the removeConmonFiles
logic - we don't need to stat-then-remove when ignoring ENOENT is
fine.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
There were two problems with preserve fds.
libpod didn't open the fds before passing _OCI*PIPE to conmon. This caused libpod to talk on the preserved fds, rather than the pipes, with conmon talking on the pipes. This caused a hang.
Libpod also didn't convert an int to string correctly, so it would further fail.
Fix these and add a unit test to make sure we don't regress in the future
Note: this test will not pass on crun until crun supports --preserve-fds
Signed-off-by: Peter Hunt <pehunt@redhat.com>
if slirp4netns supports sandboxing, enable it.
It automatically creates a new mount namespace where slirp4netns will
run and have limited access to the host resources.
It needs slirp4netns 0.4.1.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
We should not be throwing errors because the operation we wanted
to perform is already done. Now, it is definitely strange that a
container is actually unmounted, but shows as mounted in the DB -
if this reoccurs in a way where we can investigate, it's worth
tearing into.
Fixes#4033
Signed-off-by: Matthew Heon <mheon@redhat.com>
If you are running a rootless container on cgroupV1
you can not pause the container. We need to report the proper error
if this happens.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This change matches what is happening on the podman local side
and should eliminate a race condition.
Also exit commands on the server side should start to return to client.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
We have leaked the exit number codess all over the code, this patch
removes the numbers to constants.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
if we register the resize func too early, it attempts to read from the 'ctl' file before it exists. this causes the func to error, and the resize to not go through.
Fix this by registering resize func later for conmon. This, along with a conmon fix, will allow exec to know the terminal size at startup
Signed-off-by: Peter Hunt <pehunt@redhat.com>
when executing a healthcheck, we were not cleaning up after exec's use
of a socket. we now remove the socket file and ignore if for reason it
does not exist.
Fixes: #3962
Signed-off-by: baude <bbaude@redhat.com>
When --cgroupns=private is used we need to mount a new cgroup file
system so that it points to the correct namespace.
Needs: https://github.com/containers/crun/pull/88
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
when running in rootless mode and using systemd as cgroup manager
create automatically a systemd scope when the user doesn't own the
current cgroup.
This solves a couple of issues:
on cgroup v2 it is necessary that a process before it can moved to a
different cgroup tree must be in a directory owned by the unprivileged
user. This is not always true, e.g. when creating a session with su
-l.
Closes: https://github.com/containers/libpod/issues/3937
Also, for running systemd in a container it was before necessary to
specify "systemd-run --scope --user podman ...", now this is done
automatically as part of this PR.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
This will be used when we allow 'podman ps' to display info on
storage containers instead of Libpod containers.
Signed-off-by: Matthew Heon <mheon@redhat.com>
Lookup was written before volume states merged, but merged after,
and CI didn't catch the obvious failure here. Without a valid
state, we try to unmarshall into a null pointer, and 'volume rm'
is completely broken because of it.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Podman is not the only user of containers/storage, and as such we
cannot rely on our database as the sole source of truth when
pruning images. If images do not show as in use from Podman's
perspective, but subsequently fail to remove because they are
being used by a container, they're probably being used by Buildah
or another c/storage client.
Since the images in question are in use, we shouldn't error on
failure to prune them - we weren't supposed to prune them in the
first place.
Fixes: #3983
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
This is mostly used with Systemd, which really wants to manage
CGroups itself when managing containers via unit file.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
This change adds the following annotation to every container created by
podman:
```json
"Annotations": {
"io.containers.manager": "libpod"
}
```
Target of this annotaions is to indicate which project in the containers
ecosystem is the major manager of a container when applications share
the same storage paths. This way projects can decide if they want to
manipulate the container or not. For example, since CRI-O and podman are
not using the same container library (libpod), CRI-O can skip podman
containers and provide the end user more useful information.
A corresponding end-to-end test has been adapted as well.
Relates to: https://github.com/cri-o/cri-o/pull/2761
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
Previously, we only did this for volumes created at the same time
as the container. However, this is not correct behavior - Docker
does so for all named volumes, even those made with
'podman volume create' and mounted into a container later.
Fixes#3945
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
This isn't included in Docker, but seems handy enough.
Use the new API for 'volume rm' and 'volume inspect'.
Fixes#3891
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
We want to get podman info to tell us about the version of
the mount program to help us diagnose issues users are having.
Also if in rootless mode and slirp4netns is installed reveal package
info on slirp4netns.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
If c/storage paths are explicitly set to "" (the empty string) it
will use compiled-in defaults. However, it won't tell us this via
`storage.GetDefaultStoreOptions()` - we just get the empty string
(which can put our defaults, some of which are relative to
c/storage, in a bad spot).
Hardcode a sane default for cases like this. Furthermore, add
some sanity checks to paths, to ensure we don't use relative
paths for core parts of libpod.
Fixes#3952
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
When we fail to remove a container's SHM, that's an error, and we
need to report it as such. This may be part of our lingering
storage woes.
Also, remove MNT_DETACH. It may be another cause of the storage
removal failures.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
When volume options and the local volume driver are specified,
the volume is intended to be mounted using the 'mount' command.
Supported options will be used to volume the volume before the
first container using it starts, and unmount the volume after the
last container using it dies.
This should work for any local filesystem, though at present I've
only tested with tmpfs and btrfs.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
We need to be able to track the number of times a volume has been
mounted for tmpfs/nfs/etc volumes. As such, we need a mutable
state for volumes. Add one, with the expected update/save methods
in both states.
There is backwards compat here, in that older volumes without a
state will still be accepted.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
In upcoming commits, we're going to turn on the backends for
these fields. Volumes with these set will act fundamentally
differently from other volumes. There will probably be validation
required for each field.
Until now, though, we've freely allowed creation of volumes with
these set - they just did nothing. So we have no idea what could
be in the DB with old volumes.
Change the struct tags so we don't have to worry about old,
unvalidated data. We'll start fresh with new volumes.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
use the inotify backend to be notified on the container exit instead
of polling continuosly the runtime. Polling the runtime slowns
significantly down the podman execution time for short lived
processes:
$ time bin/podman run --rm -ti fedora true
real 0m0.324s
user 0m0.088s
sys 0m0.064s
from:
$ time podman run --rm -ti fedora true
real 0m4.199s
user 0m5.339s
sys 0m0.344s
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
when cni returns a list of dns servers, we should add them under the
right conditions. the defined conditions are as follows:
- if the user provides dns, it and only it are added.
- if not above and you get a cni name server, it is added and a
forwarding dns instance is created for what was in resolv.conf.
- if not either above, the entries from the host's resolv.conf are used.
Signed-off-by: baude <bbaude@redhat.com>
Signed-off-by: baude <bbaude@redhat.com>
If I mount, say, /usr/bin into my container - I expect to be able
to run the executables in that mount. Unconditionally applying
noexec would be a bad idea.
Before my patches to change mount options and allow exec/dev/suid
being set explicitly, we inferred the mount options from where on
the base system the mount originated, and the options it had
there. Implement the same functionality for the new option
handling.
There's a lot of performance left on the table here, but I don't
know that this is ever going to take enough time to make it worth
optimizing.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Previously, we explicitly set noexec/nosuid/nodev on every mount,
with no ability to disable them. The 'mount' command on Linux
will accept their inverses without complaint, though - 'noexec'
is counteracted by 'exec', 'nosuid' by 'suid', etc. Add support
for passing these options at the command line to disable our
explicit forcing of security options.
This also cleans up mount option handling significantly. We are
still parsing options in more than one place, which isn't good,
but option parsing for bind and tmpfs mounts has been unified.
Fixes: #3819Fixes: #3803
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
This will require a 'podman system renumber' after being applied
to get lock numbers for existing volumes.
Add the DB backend code for rewriting volume configs and use it
for updating lock numbers as part of 'system renumber'.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Decompose() returns an error defined in CNI which has been removed
upstream because it had no in-tree (eg in CNI) users.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Support generating systemd unit files for a pod. Podman generates one
unit file for the pod including the PID file for the infra container's
conmon process and one unit file for each container (excluding the infra
container).
Note that this change implies refactorings in the `pkg/systemdgen` API.
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
Add the digestfile option to the push command so the digest can
be stored away in a file when requested by the user. Also have added
a debug statement to show the completion of the push.
Emulates Buildah's https://github.com/containers/buildah/pull/1799/files
Signed-off-by: TomSweeneyRedHat <tsweeney@redhat.com>
Before, if the container was run with a specified user that wasn't root, exec would fail because it always set to root unless respecified by user.
instead, inherit the user from the container start.
Signed-off-by: Peter Hunt <pehunt@redhat.com>
drop the pkg/firewall module and start using the firewall CNI plugin.
It requires an updated package for CNI plugins.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
podman stats does not work in rootless environments with cgroups V1.
Fix error message and document this fact.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This is a breaking change and modifies the resulting image name when
pulling from an directory via `oci:...`.
Without this patch, the image names pulled via a local directory got
processed incorrectly, like this:
```
> podman pull oci:alpine
> podman images
REPOSITORY TAG IMAGE ID CREATED SIZE
localhost/oci alpine 4fa153a82426 5 weeks ago 5.85 MB
```
We now use the same approach as in the corresponding [buildah fix][1] to
adapt the behavior for correct `localhost/` prefixing.
[1]: https://github.com/containers/buildah/pull/1800
After applying the patch the same OCI image pull looks like this:
```
> ./bin/podman pull oci:alpine
> podman images
REPOSITORY TAG IMAGE ID CREATED SIZE
localhost/alpine latest 4fa153a82426 5 weeks ago 5.85 MB
```
End-to-end tests have been adapted as well to cover the added scenario.
Relates to: https://github.com/containers/buildah/issues/1797
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
add ability to not activate sd_notify when running under varlink as it
causes deadlocks and hangs.
Fixes: #3572
Signed-off-by: baude <bbaude@redhat.com>
in the case where the host has a large journald, iterating the journal
without using a Match is very poor performance. this might be a
temporary fix while we figure out why the systemd library does not seem to
behave properly.
Signed-off-by: baude <bbaude@redhat.com>
Remove GraphDriver.Data.MergedDir from the result of podman inspect if the container not mounte. Because the /var/lib/containers/.../merged directory is no longer created by default; it only exists during the scope of podman mount.
Signed-off-by: Qi Wang <qiwan@redhat.com>
JSON optimizes it out in that case anyways, so don't waste cycles
doing an Itoa (and Atoi on the decode side).
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
We weren't actually storing this, so we'd lose the exit code for
containers run with --rm or force-removed while running if the
journald backend for events was in use.
Fixes#3795
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Set the string to "libpod/VERSION" so that we don't use the unspecific
default of "Go-http-client/xxx".
Fixes#3788
Signed-off-by: Stefan Becker <chemobejk@gmail.com>
when creating the default libpod.conf file, be sure the default OCI
runtime is cherry picked from the system configuration.
Closes: https://github.com/containers/libpod/issues/3781
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Requirement from https://github.com/containers/libpod/issues/3575#issuecomment-512238393
Added --pull for podman create and pull to match the newly added flag in docker CLI.
`missing`: default value, podman will pull the image if it does not exist in the local.
`always`: podman will always pull the image.
`never`: podman will never pull the image.
Signed-off-by: Qi Wang <qiwan@redhat.com>
After restoring a container with a different name (ID) the ConmonPidFile
was still pointing to the path of the original container.
This means that the last restored container will overwrite the
ConmonPidFile of the original container. It was also not possible to
restore a container with a new name (ID) if the original container was
not running.
The ConmonPidFile is only changed if the ConmonPidFile starts with the
value of RunRoot. This assumes that if RunRoot is part of ConmonPidFile
the user did not specify --conmon-pidfile' during run or create.
Signed-off-by: Adrian Reber <areber@redhat.com>
in the case where we rmi an image that has only one reponame, we print
out an untagged reponame message.
$ sudo podman rmi busybox
Untagged: docker.io/library/busybox:latest
Deleted: db8ee88ad75f6bdc74663f4992a185e2722fa29573abcc1a19186cc5ec09dceb
Signed-off-by: baude <bbaude@redhat.com>
it looks like the core-os systemd library has some issue when using
seektail and add match. this patch works around that shortcoming for
the time being.
Fixes: #3616
Signed-off-by: baude <bbaude@redhat.com>
commit 223fe64dc0 introduced the
regression.
When running on cgroups v1, bind mount only /sys/fs/cgroup/systemd as
rw, as the code did earlier.
Also, simplify the rootless code as it doesn't require any special
handling when using --systemd.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1737554
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
A container restored from an exported checkpoint did not have its
StartedTime set. Which resulted in a status like 'Up 292 years ago'
after the restore.
This just sets the StartedTime to time.Now() if a container is restored
from an exported checkpoint.
Signed-off-by: Adrian Reber <areber@redhat.com>
Old versions of conmon have a bug where they create the exit file before
closing open file descriptors causing a race condition when restarting
containers with open ports since we cannot bind the ports as they're not
yet closed by conmon.
Killing the old conmon PID is ~okay since it forces the FDs of old
conmons to be closed, while it's a NOP for newer versions which should
have exited already.
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
we should be looking for the libpod.conf file in /usr/share/containers
and not in /usr/local. packages of podman should drop the default
libpod.conf in /usr/share. the override remains /etc/containers/ as
well.
Fixes: #3702
Signed-off-by: baude <bbaude@redhat.com>
Begin to separate the internal structures and frontend for
inspect on volumes. We can't rely on keeping internal data
structures for external presentation - separating presentation
and internal data format is good practice.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
There are two cases logdriver can be empty, if it wasn't set by libpod, or if the user did --log-driver ""
The latter case is an odd one, and the former is very possible and already handled for LogPath.
Instead of printing an error for an entirely reasonable codepath, let's supress the error
Signed-off-by: Peter Hunt <pehunt@redhat.com>
If a container is restored multiple times from an exported checkpoint
with the help of '--import --name', the restore will fail if during
'podman run' a static container IP was set with '--ip'. The user can
tell the restore process to ignore the static IP with
'--ignore-static-ip'.
Signed-off-by: Adrian Reber <areber@redhat.com>
close https://bugzilla.redhat.com/show_bug.cgi?id=1732280
From the bug Podman search returns 25 results even when limit option `--limit` is larger than 25(maxQueries). They want Podman to return `--limit` results.
This PR fixes the number of output result.
if --limit not set, return MIN(maxQueries, len(res))
if --limit is set, return MIN(option, len(res))
Signed-off-by: Qi Wang <qiwan@redhat.com>
Somehow this managed to slip through the cracks, but this is
definitely something inspect should print.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
The `$PATH` environment variable will now used as fallback if no valid
runtime or conmon path matches. The debug logs has been updated to state
the used executable.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
when running on a cgroups v2 system, do not bind mount
the named hierarchy /sys/fs/cgroup/systemd as it doesn't exist
anymore. Instead bind mount the entire /sys/fs/cgroup.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
When forcibly removing a container, we are initiating an explicit
stop of the container, which is not reflected in 'podman events'.
Swap to using our standard 'stop()' function instead of a custom
one for force-remove, and move the event into the internal stop
function (so internal calls also register it).
This does add one more database save() to `podman remove`. This
should not be a terribly serious performance hit, and does have
the desirable side effect of making things generally safer.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
We need this specifically for tests, but others may find it
useful if they don't explicitly need events and don't want the
performance implications of using them.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
the exit file
If the container exit code needs to be retained, it cannot be retained
in tmpfs, because libpod runs in a memcg itself so it can't leave
traces with a daemon-less design.
This wasn't a memleak detectable by kmemleak for example. The kernel
never lost track of the memory and there was no erroneous refcounting
either. The reference count dependencies however are not easy to track
because when a refcount is increased, there's no way to tell who's
still holding the reference. In this case it was a single page of
tmpfs pagecache holding a refcount that kept pinned a whole hierarchy
of dying memcg, slab kmem, cgropups, unrechable kernfs nodes and the
respective dentries and inodes. Such a problem wouldn't happen if the
exit file was stored in a regular filesystem because the pagecache
could be reclaimed in such case under memory pressure. The tmpfs page
can be swapped out, but that's not enough to release the memcg with
CONFIG_MEMCG_SWAP_ENABLED=y.
No amount of more aggressive kernel slab shrinking could have solved
this. Not even assigning slab kmem of dying cgroups to alive cgroup
would fully solve this. The only way to free the memory of a dying
cgroup when a struct page still references it, would be to loop over
all "struct page" in the kernel to find which one is associated with
the dying cgroup which is a O(N) operation (where N is the number of
pages and can reach billions). Linking all the tmpfs pages to the
memcg would cost less during memcg offlining, but it would waste lots
of memory and CPU globally. So this can't be optimized in the kernel.
A cronjob running this command can act as workaround and will allow
all slab cache to be released, not just the single tmpfs pages.
rm -f /run/libpod/exits/*
This patch solved the memleak with a reproducer, booting with
cgroup.memory=nokmem and with selinux disabled. The reason memcg kmem
and selinux were disabled for testing of this fix, is because kmem
greatly decreases the kernel effectiveness in reusing partial slab
objects. cgroup.memory=nokmem is strongly recommended at least for
workstation usage. selinux needs to be further analyzed because it
causes further slab allocations.
The upstream podman commit used for testing is
1fe2965e4f (v1.4.4).
The upstream kernel commit used for testing is
f16fea666898dbdd7812ce94068c76da3e3fcf1e (v5.2-rc6).
Reported-by: Michele Baldessari <michele@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
<Applied with small tweaks to comments>
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
In order to run Podman with VM-based runtimes unprivileged, the
network must be set up prior to the container creation. Therefore
this commit modifies Podman to run rootless containers by:
1. create a network namespace
2. pass the netns persistent mount path to the slirp4netns
to create the tap inferface
3. pass the netns path to the OCI spec, so the runtime can
enter the netns
Closes#2897
Signed-off-by: Gabi Beyer <gabrielle.n.beyer@intel.com>
NixOS links the current system state to `/run/current-system`, so we
have to add these paths to the configuration files as well to work out
of the box.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
Touch up a number of formating issues for XDG_RUNTIME_DIRS in a number
of man pages. Make use of the XDG_CONFIG_HOME environment variable
in a rootless environment if available, or set it if not.
Also added a number of links to the Rootless Podman config page and
added the location of the auth.json files to that doc.
Signed-off-by: TomSweeneyRedHat <tsweeney@redhat.com>
We should not be fuzzy matching on volume names. Docker doesn't
do it, and it doesn't make much sense. Everything requires exact
matches for names - only IDs allow partial matches.
Fixes#3635
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
When a command (like `ps`) requests no store be created, but also
requires a refresh be performed, we have to ignore its request
and initialize the store anyways to prevent segfaults. This work
was done in #3532, but that missed one thing - initializing a
storage service. Without the storage service, Podman will still
segfault. Fix that oversight here.
Fixes#3625
Signed-off-by: Matthew Heon <mheon@redhat.com>
There's no way to get the error if we successfully get an exit code (as it's just printed to stderr instead).
instead of relying on the error to be passed to podman, and edit based on the error code, process it on the varlink side instead
Also move error codes to define package
Signed-off-by: Peter Hunt <pehunt@redhat.com>