We don't want to leak our mounts to the host but we still like to to
update mounts/umount events from the host. This is so when a fs is
unmounted on the host we don't happen to keep it open in aardvark-dns.
Fixes: containers/podman#25994
Fixes: 2589ef49aa ("libnetwork/rootlessnetns: make mountns tree private")
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
It is possible that the netns file where we bind mount the netns already
exists. This can happen if a previous setup process was killed between
creating the file and mounting to it. Or likely more common as described
in the podman issue if the runroot is not a tmpfs and not deleted after
boot.
Fixescontainers/podman#25144
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
When using the rootless netns (bridge mode) so far podman ignored the
proper pasta or slirp4netns dns sever for networks without aardvark-dns.
This is not good. We should try to use them by default, and with the new
MapGuestAddr option we need to use that as well for
host.containers.internal. The problem is that becuase we only know what
options we uses when we started the process later container starts from
a new podman process do not really see these options if we just cache
the result in memory. So in order to make all following podman process
aware we serialize this info struct as json and later processes read it
when needed.
It also means we do not have to lookup the netns ip evey time so I
removed that code.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The Run() function is used to run long running command in the netns,
namly podman unshare --rootless-netns used that. As such the function
actually unlocks for the main command as otherwise a user could hold the
lock forever effectively causing deadlocks.
Now because we unlock the ref count might change during that time. And
just because we create the netns doesn't mean there are now other users
of the netns. Therefore the cleanup in runInner() was wrong in that
case causing problems for other running containers.
To fix this make sure we do not cleanup in the Run() case unless the
count is 0.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
When I wrote this originally I thought we must avoid leaking the netns
so I tried to decrement first. However now I think this wrong because
podman actially calls into the cleanup function again if it returned an
error on the next cleanup attempt. As such we ended up doing a double
decrement and the ref counter went below zero causing a sort of issues[1].
Now if we have a bug the other way around were we not decrement
correctly this is much less of a problem. It simply means we leak once
netns file and the pasta/slirp4netns process which isn't a problem other
than needed a bit of resources.
[1] https://github.com/containers/podman/issues/21569
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Currently it does the mkdir only implicitly because the code creates
run/systemd but this only happens when /run/systemd exists on the host.
As such the rootless code was broken on all non systemd distros[1].
[1] https://github.com/containers/podman/discussions/22903
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
When using the bridge network mode as rootless we use the rootless netns
logic, for podman this looks like just as using bridge as root. The
issue is however due the extra namespace we block certain address there.
This can be seen best with pasta but actually effects other cases too.
The podman logic tries to use any host ip address for
host.containers.internal but we must make sure to exculde all these
address in the rootless netns as they are not actually the hostns as
thus cause great confusion.
For the --network pasta case I already fixed this by returning the ips on
the pasta.Setup2() call in 83573fa60c.
For the bridge mode this more complicated due several layers of function
calls. I decided to implement this as extra function call on the interface
to return the ips as this makes the usage in podman the easiest. And I
also didn't want to break the API as we only have to fix this in podman
not buildah.
It is needed to address #22653 but it needs podman changes as well to
use this new function.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This here just logs unnecessary errors in case there is an error during
the Run() call (podman unshare --rootless-netns). runInner() will
already call cleanup on errors if it created a new netns so we only need
to cleanup when there is no error.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
While this is a none issue normally because we run in a unprivileged
userns we cannot modify the host mounts in any way. However in case
where the rootless netns logic might be executed from a non userns
context we might change the mount tree if the mounts are shared which is
the systemd default. While this should never happen let's make sure we
never mess up the system by accident in case there are more bugs and
explicitly make our mount tree private.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
We have little to no control over what happens tot he slirp4netns/pasta
process after we started it. It could crash or get killed then we end up
in state where we think networking works when it doesn't.
To fix this each time we access the rootless-netns we should also make
to program is still running, if it is not try to recover by starting it
again. This ensures that we are much more fault tolerant.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
When the netns program fails to configure the netns or we fail for any
other reason during the setup we must make sure to remove the netns
mount again. Without it the next command sees the existing mount and
thinks the netns was setup correctly but than later fails during the
custom resolv.conf mount because the resolv.conf source file was never
created.
For future we should consider adding checks due ensure pasta/slirp4netns
is still running when we access the netns to make it more fault
tolerant.
The reason this is a common problem is that on boot pasta can likely
fail if it was started before the networking was fully configured (i.e.
no default route).
Fixescontainers/podman#22168
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This tells us if ipv6 is supported. And we should use the forward ips as
dns servers to make sure we can read localhost resolvers on the host.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This makes the code for setting up rootless network namespaces
dependent on what the default rootless network provider is, and
allows Pasta to be used for traffic forwarding on the rootless
netns.
This also switches the default rootless network provider to Pasta
Signed-off-by: Matt Heon <mheon@redhat.com>
The cni code is used by freebsd so this package must build for it as
well. Given the logic is linux specific and not called by freebsd just
return an error.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Just pass down the full containers.conf as this is needed by
rootlessnetns code, also remove the now duplicated fields and read the
options directly from the config struct.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Make sure we correctly cleanup the netns if there was an error and the
netns was just created. Also make sure the parent dir for the netns is
always created because a previous cleanup() may have it deleted.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
In podman we have code to move a process into a new systemd cgroup. This
code lived in the podman utils package. Because the new rootlessnetns
must call into that move this code to c/common.
Instead of dumping this again into a "util" package create a systemd
package which should have a better name. Also move the cgroup code
directly into pkg/cgroup. I am sure we can do some cleanup there in a
followup to prevent duplication.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Add a new rootlessnetns package based on the rootless netns code from
podman. It however makes some significant changes:
- First it uses a directory in the runroot and not tmpdir.
- The netns mount is stored in the directoy and not the global netns
runtime dir to prevent name collisions. The old code used the sha256
to do that.
- The teardown and setup logic has been made more robust and now used a
reference counter to keep track on when to cleanup. The podman
cleanup logic was racy and tied to running podman containers. Given
the plan to allow buildah to use this as well we need this.
- There is no lock for this code, the goal is to have this called
through the network interface which is already locked so there is no
need for another lock here.
Future work:
- add pasta support
- add port forwarding logic here
Signed-off-by: Paul Holzinger <pholzing@redhat.com>