In order to run Podman with VM-based runtimes unprivileged, the
network must be set up prior to the container creation. Therefore
this commit modifies Podman to run rootless containers by:
1. create a network namespace
2. pass the netns persistent mount path to the slirp4netns
to create the tap inferface
3. pass the netns path to the OCI spec, so the runtime can
enter the netns
Closes#2897
Signed-off-by: Gabi Beyer <gabrielle.n.beyer@intel.com>
allow a container to run in a new cgroup namespace.
When running in a new cgroup namespace, the current cgroup appears to
be the root, so that there is no way for the container to access
cgroups outside of its own subtree.
By default it uses --cgroup=host to keep the previous behavior.
To create a new namespace, --cgroup=private must be provided.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
We can infer no-new-privileges. For now, manually populate
seccomp (can't infer what file we sourced from) and
SELinux/Apparmor (hard to tell if they're enabled or not).
Signed-off-by: Matthew Heon <mheon@redhat.com>
When we first began writing Podman, we ran into a major issue
when implementing Inspect. Libpod deliberately does not tie its
internal data structures to Docker, and stores most information
about containers encoded within the OCI spec. However, Podman
must present a CLI compatible with Docker, which means it must
expose all the information in 'docker inspect' - most of which is
not contained in the OCI spec or libpod's Config struct.
Our solution at the time was the create artifact. We JSON'd the
complete CreateConfig (a parsed form of the CLI arguments to
'podman run') and stored it with the container, restoring it when
we needed to run commands that required the extra info.
Over the past month, I've been looking more at Inspect, and
refactored large portions of it into Libpod - generating them
from what we know about the OCI config and libpod's (now much
expanded, versus previously) container configuration. This path
comes close to completing the process, moving the last part of
inspect into libpod and removing the need for the create
artifact.
This improves libpod's compatability with non-Podman containers.
We no longer require an arbitrarily-formatted JSON blob to be
present to run inspect.
Fixes: #3500
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
when the container is running in a user namespace, check if gid=5 is
available, otherwise drop the option gid=5 for /dev/pts.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
add a simple way to copy ulimit values from the host.
if --ulimit host is used then the current ulimits in place are copied
to the container.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
the compilation demands of having libpod in main is a burden for the
remote client compilations. to combat this, we should move the use of
libpod structs, vars, constants, and functions into the adapter code
where it will only be compiled by the local client.
this should result in cleaner code organization and smaller binaries. it
should also help if we ever need to compile the remote client on
non-Linux operating systems natively (not cross-compiled).
Signed-off-by: baude <bbaude@redhat.com>
These are only used on OS X Docker, and ignored elsewhere - but
since they are ignored, they're guaranteed to be safe everywhere,
and people are using them.
Fixes: #3340
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
We made changes earlier that empty storage options when setting
storage driver explicitly. Unfortunately, this breaks rootless
cleanup commands, as they lose the fuse-overlayfs mount program
path.
Fix this by passing along the storage options to the cleanup
process.
Also, fix --syslog, which was broken a while ago (probably when
we broke up main to add main_remote).
Fixes#3326
Signed-off-by: Matthew Heon <mheon@redhat.com>
When we supercede low-priority mounts and volumes (image volumes,
and volumes sourced from --volumes-from) with higher-priority
ones (the --volume and --mount flags), we always replaced
lower-priority mounts of the same type (e.g. a user mount to
/tmp/test1 would supercede a volumes-from mount to the same
destination). However, we did not supercede the opposite type - a
named volume from image volumes at /tmp/test1 would be allowed to
remain and create a conflict, preventing container creation.
Solve this by destroying opposite types before merging (we can't
do it in the same loop, as then named volumes, which go second,
might trample changes made by mounts).
Fixes#3174
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
force the resources block to be empty instead of having default
values.
Regression introduced by 8e88461511
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
We were always raising an error when the rootless user attempted to
setup resources, but this is not the case anymore with cgroup v2.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The on-failure restart option supports restarting only a given
number of times. To do this, we need one additional field in the
DB to track restart count (which conveniently fills a field in
Inspect we weren't populating), plus some plumbing logic.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
This ensures that all tmpfs mounts added by the user, even with
the --mount flag, share a few common options (nosuid, noexec,
nodev), and options for tmpfs mounts are properly validated to
ensure they are correct.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
As part of this, move bind mount option validity parsing and
modification (adding e.g. rbind on bind mounts that are missing
it), which requires test changes (expected values have changed).
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
We were unconditionally resetting volume mount options for all
mount points (and by the looks of things, completely dropping
tmpfs mounts), which was causing runc to refuse to run containers
and all the tests to consequently fail.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Several changes made in the interface of pkg/spec make
interacting with it without a runtime difficult to impossible,
so move the existing limited testing from cmd/podman (which
mostly tested pkg/spec) into pkg/spec itself where we can call
individual functions that don't break things.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Unify handling for the --volume, --mount, --volumes-from, --tmpfs
and --init flags into a single file and set of functions. This
will greatly improve readability and maintainability.
Further, properly handle superceding and conflicting mounts. Our
current patchwork has serious issues when mounts conflict, or
when a mount from --volumes-from or an image volume should be
overwritten by a user volume or named volume.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Play kube was passing the pod, but CreateConfig was not. Unify it
so they both do, so we can remove some unnecessary duplicate
lookup code.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
The goal here is to keep only the configuration directly used to
build the container in CreateConfig, and scrub temporary state
and helpers that we need to generate. We'll keep those internally
in MakeContainerConfig.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Right now, there are two major API calls necessary to turn a
filled-in CreateConfig into the options and OCI spec necessary to
make a libpod Container. I'm intending on refactoring both of
these extensively to unify a few things, so make a common
frontend to both that will prevent API changes from leaking out
of the package.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
build a podman-remote binary for windows that allows users to use the
remote client on windows and interact with podman on linux system.
Signed-off-by: baude <bbaude@redhat.com>
The --read-only-tmpfs option caused podman to mount tmpfs on /run, /tmp, /var/tmp
if the container is running int read-only mode.
The default is true, so you would need to execute a command like
--read-only --read-only-tmpfs=false to turn off this behaviour.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
We were never using it. It's actually a potentially quite sizable
field (very expensive to decode an array of structs!). Removing
it should do no harm.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
The flag should be substantially more durable, and no longer
relies on the create artifact.
This should allow it to properly handle our new named volume
implementation.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Now that named volumes must be explicitly enumerated rather than
passed in with all other volumes, we need to split normal and
named volumes up before passing them into libpod. This PR does
this.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
in the few places where we care about skipping the storage
initialization, we can simply use the process effective UID, instead
of relying on a global boolean flag.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
We accidentally patched this out trying to enable ns:/path/to/ns
This should restore the ability to configure nondefault CNI
networks with Podman, by ensuring that they request creation of a
network namespace.
Completely remove the WithNetNS() call when we do use an explicit
namespace from a path. We use that call to indicate that a netns
is going to be created - there should not be any question about
whether it actually does.
Fixes#2795
Signed-off-by: Matthew Heon <mheon@redhat.com>
We have a very high performance JSON library that doesn't need to
perform code generation. Let's use it instead of our questionably
performant, reflection-dependent deep copy library.
Most changes because some functions can now return errors.
Also converts cmd/podman to use jsoniter, instead of pkg/json,
for increased performance.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Currently cobra can not handle a boolean option without a vailue.
This change fixes an issue if you want syslog information to show up
based on the cleanup call.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Signed-off-by: TomSweeneyRedHat <tsweeney@redhat.com>
Vendors in fsouza/docker-client, docker/docker and
a few more related. Of particular note, changes to the TweakCapabilities()
function from docker/docker along with the parse.IDMappingOptions() function
from Buildah. Please pay particular attention to the related changes in
the call from libpod to those functions during the review.
Passes baseline tests.
integration of healthcheck into create and run as well as inspect.
healthcheck enhancements are as follows:
* add the following options to create|run so that non-docker images can
define healthchecks at the container level.
* --healthcheck-command
* --healthcheck-retries
* --healthcheck-interval
* --healthcheck-start-period
* podman create|run --healthcheck-command=none disables healthcheck as
described by an image.
* the healthcheck itself and the healthcheck "history" can now be
observed in podman inspect
* added the wiring for healthcheck history which logs the health history
of the container, the current failed streak attempts, and log entries
for the last five attempts which themselves have start and stop times,
result, and a 500 character truncated (if needed) log of stderr/stdout.
The timings themselves are not implemented in this PR but will be in
future enablement (i.e. next).
Signed-off-by: baude <bbaude@redhat.com>
Currently if you turn on --net=host on a rootless container
and have selinux-policy installed in the image, tools running with
SELinux will see that the system is SELinux enabled in rootless mode.
This patch mounts a tmpfs over /sys/fs/selinux blocking this behaviour.
This patch also fixes the fact that if you shared --pid=host we were not
masking over certin /proc paths.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Add the ability to manually run a container's healthcheck command.
This is only the first phase of implementing the healthcheck.
Subsequent pull requests will deal with the exposing the results and
history of healthchecks as well as the scheduling.
Signed-off-by: baude <bbaude@redhat.com>
if there is already a bind mount specified for the target, do not
create a new volume.
Regression introduced by 52df1fa7e0
Closes: https://github.com/containers/libpod/issues/2441
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
This is a workaround for the runc issue:
https://github.com/opencontainers/runc/issues/1247
If the source of a bind mount has any of nosuid, noexec or nodev, be
sure to propagate them to the bind mount so that when runc tries to
remount using MS_RDONLY, these options are also used.
Closes: https://github.com/containers/libpod/issues/2312
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
If user specifies network namespace and the /etc/netns/XXX/resolv.conf
exists, we should use this rather then /etc/resolv.conf
Also fail cleaner if the user specifies an invalid Network Namespace.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
iFix builtin volumes to work with podman volume
Currently builtin volumes are not recored in podman volumes when
they are created automatically. This patch fixes this.
Remove container volumes when requested
Currently the --volume option on podman remove does nothing.
This will implement the changes needed to remove the volumes
if the user requests it.
When removing a volume make sure that no container uses the volume.
Signed-off-by: Daniel J Walsh dwalsh@redhat.com
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
At present, when manually detaching from an attached container
(using the detach hotkeys, default C-p C-q), Podman will still
wait for the container to exit to obtain its exit code (so we can
set Podman's exit code to match). This is correct in the case
where attach finished because the container exited, but very
wrong for the manual detach case.
As a result of this, we can no longer guarantee that the cleanup
and --rm functions will fire at the end of 'podman run' - we may
be exiting before we get that far. Cleanup is easy enough - we
swap to unconditionally using the cleanup processes we've used
for detached and rootless containers all along. To duplicate --rm
we need to also teach 'podman cleanup' to optionally remove
containers instead of cleaning them up.
(There is an argument for just using 'podman rm' instead of
'podman cleanup --rm', but cleanup does have different semantics
given that we only ever expect it to run when the container has
just exited. I think it might be useful to keep the two separate
for things like 'podman events'...)
Signed-off-by: Matthew Heon <mheon@redhat.com>
when running in rootless mode we were unconditionally overriding
/dev/pts to take ride of gid=5. This is not needed when multiple gids
are present in the namespace, which is always the case except when
running the tests suite with only one mapping. So change it to check
how many gids are present before overriding the default mount.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
add support for ports redirection from the host.
It needs slirp4netns v0.3.0-alpha.1.
Closes: https://github.com/containers/libpod/issues/2081
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
When using --pid=host don't try to cover /proc paths, as they are
coming from the /proc bind mounted from the host.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
when defining containers, we missed the conditional logic to allow
the container to be defined with "WithPod" and so forth. I had to
slightly modify the createcontainer process to pass a libpod.Pod
that could override things; use nil as no pod.
Signed-off-by: baude <bbaude@redhat.com>
the rootless container storage is always mounted in a different mount
namespace, owned by the unprivileged user. Even if it is mounted, a
process running in another namespace cannot reuse the already mounted
storage.
Make sure the storage is always cleaned up once the container
terminates.
This has worked with vfs since there is no real mounted storage.
Closes: https://github.com/containers/libpod/issues/2112
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Apply the default AppArmor profile at container initialization to cover
all possible code paths (i.e., podman-{start,run}) before executing the
runtime. This allows moving most of the logic into pkg/apparmor.
Also make the loading and application of the default AppArmor profile
versio-indepenent by checking for the `libpod-default-` prefix and
over-writing the profile in the run-time spec if needed.
The intitial run-time spec of the container differs a bit from the
applied one when having started the container, which results in
displaying a potentially outdated AppArmor profile when inspecting
a container. To fix that, load the container config from the file
system if present and use it to display the data.
Fixes: #2107
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
Updating the vendor or runc to pull in some fixes that we need.
In order to get this vendor to work, we needed to update the vendor
of docker/docker, which causes all sorts of issues, just to fix
the docker/pkg/sysinfo. Rather then doing this, I pulled in pkg/sysinfo
into libpod and fixed the code locally.
I then switched the use of docker/pkg/sysinfo to libpod/pkg/sysinfo.
I also switched out the docker/pkg/mount to containers/storage/pkg/mount
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Add support for executing an init binary as PID 1 in a container to
forward signals and reap processes. When the `--init` flag is set for
podman-create or podman-run, the init binary is bind-mounted to
`/dev/init` in the container and "/dev/init --" is prepended to the
container's command.
The default base path of the container-init binary is `/usr/libexec/podman`
while the default binary is catatonit [1]. This default can be changed
permanently via the `init_path` field in the `libpod.conf` configuration
file (which is recommended for packaging) or temporarily via the
`--init-path` flag of podman-create and podman-run.
[1] https://github.com/openSUSE/catatonitFixes: #1670
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
We had two problems with /dev/shm, first, you mount the
container read/only then /dev/shm was mounted read/only.
This is a bug a tmpfs directory should be read/write within
a read-only container.
The second problem is we were ignoring users mounted /dev/shm
from the host.
If user specified
podman run -d -v /dev/shm:/dev/shm ...
We were dropping this mount and still using the internal mount.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Podman will search through the directory and will add any device
nodes that it finds. If no devices are found we return an error.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
when a user specifies --pod to podman create|run, we should create that pod
automatically. the port bindings from the container are then inherited by
the infra container. this signicantly improves the workflow of running
containers inside pods with podman. the user is still encouraged to use
podman pod create to have more granular control of the pod create options.
Signed-off-by: baude <bbaude@redhat.com>
we need to allow users to expose ports to the host for the purposes
of networking, like a webserver. the port exposure must be done at
the time the pod is created.
strictly speaking, the port exposure occurs on the infra container.
Signed-off-by: baude <bbaude@redhat.com>
We are still requiring oci-systemd-hook to be installed in order to run
systemd within a container. This patch properly mounts
/sys/fs/cgroup/systemd/libpod_parent/libpod-UUID on /sys/fs/cgroup/systemd inside of container.
Since we need the UUID of the container, we needed to move Systemd to be a config option of the
container.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
As of now, there is no way to debug podman clean up processes.
They are started by conmon with no stdout/stderr and log nowhere.
This allows us to actually figure out what is going on when a
cleanup process runs.
Signed-off-by: Matthew Heon <matthew.heon@gmail.com>
Add the --ip flag back with bash completions. Manpages still
missing.
Add plumbing to pass appropriate the appropriate option down to
libpod to connect the flag to backend logic added in the previous
commits.
Signed-off-by: Matthew Heon <matthew.heon@gmail.com>
Also update some missing fields libpod.conf obtions in man pages.
Fix sort order of security options and add a note about disabling
labeling.
When a process requests a new label. libpod needs to reserve all
labels to make sure that their are no conflicts.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Closes: #1406
Approved by: mheon
Firstly, when adding the privileged catch-all resource device,
first remove the spec's default catch-all resource device.
Second, remove our default rootfs propogation config - Docker
does not set this by default, so I don't think we should either.
Signed-off-by: Matthew Heon <matthew.heon@gmail.com>
Closes: #1491
Approved by: TomSweeneyRedHat
This matches Docker behavior more closely and should resolve an
issue we were seeing with /sys mounts
Signed-off-by: Matthew Heon <matthew.heon@gmail.com>
Closes: #1465
Approved by: rhatdan
This is an incomplete fix, as it would be best for the libpod library to be in charge of coordinating the container's dependencies on the infra container. A TODO was left as such. UTS is a special case, because the docker library that namespace handling is based off of doesn't recognize a UTS based on another container as valid, despite the library being able to handle it correctly. Thus, it is left in the old way.
Signed-off-by: haircommander <pehunt@redhat.com>
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Closes: #1347
Approved by: mheon
We should be sharing cgroups namespace by default in pods
uts namespace sharing was broken in pods.
Create a new libpod/pkg/namespaces for handling of namespace fields
in containers
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Closes: #1418
Approved by: mheon
When there was a conflict between a user-added volume and a mount
already in the spec, we previously respected the mount already in
the spec and discarded the user-added mount. This is counter to
expected behavior - if I volume-mount /dev into the container, I
epxect it will override the default /dev in the container, and
not be ignored.
Signed-off-by: Matthew Heon <matthew.heon@gmail.com>
Closes: #1419
Approved by: TomSweeneyRedHat
When user-specified volume mounts overlap with mounts already in
the spec, remove the mount in the spec to ensure there are no
conflicts.
Signed-off-by: Matthew Heon <matthew.heon@gmail.com>
Closes: #1419
Approved by: TomSweeneyRedHat
Unfortunately this is not enough to get it working as runc doesn't
allow to bind mount /proc.
Depends on: https://github.com/opencontainers/runc/pull/1832
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closes: #1349
Approved by: rhatdan
Fix the test for checking when /sys must be bind mounted from the
host. It should be done only when userNS are enabled (the
!UsernsMode.IsHost() check is not enough for that).
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closes: #1349
Approved by: rhatdan
Also it fix the issue of exposing both tc/udp port even if
only one proto specified.
Signed-off-by: Kunal Kushwaha <kushwaha_kunal_v7@lab.ntt.co.jp>
Closes: #1325
Approved by: mheon
As well as small style corrections, update pod_top_test to use CreatePod, and move handling of adding a container to the pod's namespace from container_internal_linux to libpod/option.
Signed-off-by: haircommander <pehunt@redhat.com>
Closes: #1187
Approved by: mheon
A pause container is added to the pod if the user opts in. The default pause image and command can be overridden. Pause containers are ignored in ps unless the -a option is present. Pod inspect and pod ps show shared namespaces and pause container. A pause container can't be removed with podman rm, and a pod can be removed if it only has a pause container.
Signed-off-by: haircommander <pehunt@redhat.com>
Closes: #1187
Approved by: mheon
Devices are supposed to be able to be passed in via the form of
--device /dev/foo
--device /dev/foo:/dev/bar
--device /dev/foo:rwm
--device /dev/foo:/dev/bar:rwm
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Closes: #1299
Approved by: umohnani8
Do not set any hostname value in the OCI configuration when --uts=host
is used and the user didn't specify any value. This prevents an error
from the OCI runtime as it cannot set the hostname without a new UTS
namespace.
Differently, the HOSTNAME environment variable is always set. When
--uts=host is used, HOSTNAME gets the value from the host.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closes: #1280
Approved by: baude
Need to get some small changes into libpod to pull back into buildah
to complete buildah transition.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Closes: #1270
Approved by: mheon
Hostname should be set to the hosts hostname when network is none.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Closes: #1274
Approved by: giuseppe
If more than one volume was mounted using the --volume flag in
podman run, the second and onwards volumes were picking up options
of the previous volume mounts defined. Found out that the options were
not be cleared out after every volume was parsed.
Signed-off-by: umohnani8 <umohnani@redhat.com>
Closes: #1142
Approved by: mheon
This is a refresh of Dan William's PR #974 with a rebase and proper
vendoring of ocicni and containernetworking/cni. It adds the ability
to define multiple networks as so:
podman run --network=net1,net2,foobar ...
Signed-off-by: baude <bbaude@redhat.com>
Closes: #1082
Approved by: baude
podman now supports --volumes-from flag, which allows users
to add all the volumes an existing container has to a new one.
Signed-off-by: umohnani8 <umohnani@redhat.com>
Closes: #931
Approved by: mheon