mqueue can not be mounted on the host os and then shared into the container.
There is only one mqueue per mount namespace, so current code ends up leaking
the /dev/mqueue from the host into ALL containers. Since SELinux changes the
label of the mqueue, only the last container is able to use the mqueue, all
other containers will get a permission denied. If you don't have SELinux protections
sharing of the /dev/mqueue allows one container to interact in potentially hostile
ways with other containers.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
We cannot rely on the tar command for this type of operation because tar
versions, flags, and functionality can very from distro to distro.
Since this is in the container execution path it is not safe to have
this as a dependency from dockers POV where the user cannot change the
fact that docker is adding these pre and post mount commands.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
In the default seccomp rule, allow use of 32 bit syscalls on
64 bit architectures, so you can run x86 Linux images on x86_64
without disabling seccomp or using a custom rule.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
Being able to obtain a file handle is no use as we cannot perform
any operation in it, and it may leak kernel state.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
This change is done so that driver_unsupported.go and driver_unsupported_nocgo.go
declare the same signature for NewDriver as driver.go.
Fixes#19032
Signed-off-by: Lukas Waslowski <cr7pt0gr4ph7@gmail.com>
This can be allowed because it should only restrict more per the seccomp docs, and multiple apps use it today.
Signed-off-by: Jessica Frazelle <acidburn@docker.com>
Block kcmp, procees_vm_readv, process_vm_writev.
All these require CAP_PTRACE, and are only used for ptrace related
actions, so are not useful as we block ptrace.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
The bpf syscall can load code into the kernel which may
persist beyond container lifecycle. Requires CAP_SYS_ADMIN
already.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
These provide an in kernel virtual machine for x86 real mode on x86
used by one very early DOS emulator. Not required for any normal use.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
The stime syscall is a legacy syscall on some architectures
to set the clock, should be blocked as time is not namespaced.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
clock_adjtime is the new posix style version of adjtime allowing
a specific clock to be specified. Time is not namespaced, so do
not allow.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
This is a new version of init_module that takes a file descriptor
rather than a file name.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
The set_robust_list syscall sets the list of futexes which are
cleaned up on thread exit, and are needed to avoid mutexes
being held forever on thread exit.
See for example in Musl libc mutex handling:
http://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_mutex_trylock.c#n22
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>