syscall() emulates all other syscalls, so having this allowed makes no
sense as far as seccomp filters go.
This is a breaking change, but this probably will not break much.
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
This doesn't deny anything new (perf_event_open is currently allowed for
SYS_ADMIN)
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
the readdir syscall hasn't existed forever (wasn't present in linux
2.6.12 initial import into git), remove it and don't even bother adding
it to the list of EPERM syscalls
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
The following syscalls have been added in recent kernels and considered
for this list:
- cachestat, prints information about cache misses; it is less accurate
than userfaultfd so probably safe but deny it until a clear need
shows up
- io_pgetevents_time64: io_pgetevents is already blocked, so block this
variant as well. Note these are pretty close to io_getenvents, so we
should probably block that as well, but since it is currently allowed
keep that where it is.
- map_shadow_stack: this allows creating a new shadow stack, required for
user-space threading if shadow stack verification is enabled (prctl
PR_SET_SHADOW_STACK_STATUS with PR_SHADOW_STACK_ENABLE); this might
be required in the future but delay this decision until someone
requests it
- futex_* new interface is primarily intended for io_uring which we
disallow, and does not have any known user yet so likewise block
until someone requests it.
- quotactl_fd: this is identical to quotactl, so only allow for
SYS_ADMIN like quotactl.
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
timerfd as a syscall does not seem to have ever existed,
remove it from allowed syscalls list.
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
These have been replaced by the rt_sigaction family, and have not been
compiled in on most kernels since linux v3.9 (2013)
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
Landlock is a Linux feature that enables creating security sandboxes
(see https://docs.kernel.org/userspace-api/landlock.html). Allow the
three related system calls (available since Linux 5.13):
landlock_create_ruleset, landlock_add_rule, and landlock_restrict_self.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
We manually checked the syscalls from this list and compared it to our
supported ones:
https://github.com/seccomp/libseccomp/blob/main/src/syscalls.csv
This patch adds a bunch of new safe syscalls to be allowed, namely:
membarrier, mount_setattr, process_mrelease, sigaction, signal,
sigpending, sigprocmask, sigsuspend, syscall and timerfd.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
simplify maintainance of the seccomp.json file and accept errno as
strings.
It also fixes a portability problem since errno values are arch
dependent.
The existing `DefaultErrnoRet` and `ErrnoRet` are maintained for
backward compatibility but they are obsoleted and will be removed in a
future release.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
add missing equivalents of already allowed syscalls for 32-bit platforms
with 64-bit time for countering Y2038
Fixes#593
Signed-off-by: Jan Palus <jpalus@fastmail.com>
In order to run containers within containers via podman
and do a podman exec, we need to allow setns syscalls.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This mirrors the Docker and containerd changes, with the caveat that
because mount(2) is permitted under podman for all containers we
therefore add all of the v2 mount API syscalls as available to all
containers.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Several syscalls were enabled globally (SCMP_ACT_ALLOW without any
conditions for all containers), but also had conditional rules later in
the profile (likely inherited from Docker). The following syscalls do
not need special casing because they were globally enabled:
* clone, unshare, mount, umount, umount2 all had special CAP_SYS_ADMIN
restrictions but those don't make sense since they were also enabled
for all containers.
* reboot was permitted for CAP_SYS_BOOT and all containers.
* name_to_handle_at was permitted for CAP_SYS_ADMIN, CAP_SYS_NICE(?),
and all containers.
And certain syscalls had globally-enabled rules when they shouldn't
have:
* socket has special rules for CAP_AUDIT_WRITE but it also had a global
"allow unconditionally" rule. It turns out that libseccomp will
override unconditional rules with conditional ones but this is
somewhat of an implementation detail and it's much safer to remove
the rule and use the existing cases.
Now the only syscalls remaining with complicated rules (meaning they
appear more than once in the profile) are:
* sync_file_range2 which is architecture specific (though in principle
we could move it to enabled-without-rules because runc ignores
unknown syscalls).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Since secrets is shared by buildah, podman and cri-o, we need
to move it to containers/common.
Also move containers-mounts.conf.5.md to common from podman,
since this is common to all packages.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
12 new syscalls have been added for handling 64 bit time.
These syscalls are breaking containers on newer kernels.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This syscall is proposed for the kernel but does not exists yet. Having it in
the default syscall table is causing crun to print warning messages.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
faccessat2, openat2, fchmodat2 are all new syscalls to help eliminate
race conditions, current containers get the older versions of these syscalls
so adding them by default makes sense.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Add the following default syscalls:
"clock_adjtime" -- Already allow adjtimex
"clone" -- Needed so we can use a usernamespace within a container.
Since this is allowed for non root users, it should be safe
to use, and can allow us to support containers/user namespaces
within locked down containers.
"pivot_root" -- Can be used by containers within containers
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>