Commit Graph

60 Commits

Author SHA1 Message Date
Giuseppe Scrivano b690083685 seccomp: allow fanotify_init without CAP_SYS_ADMIN
Closes: https://github.com/containers/common/issues/2411

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-04-07 12:49:20 +02:00
Dominique Martinet ad947e0c3f seccomp: block syscall()
syscall() emulates all other syscalls, so having this allowed makes no
sense as far as seccomp filters go.

This is a breaking change, but this probably will not break much.

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-06 11:23:33 +09:00
Dominique Martinet 9ce468e30f seccomp: allow perf_event_open if CAP_PERFMON
This doesn't deny anything new (perf_event_open is currently allowed for
SYS_ADMIN)

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-06 11:23:33 +09:00
Dominique Martinet ff0a68d772 seccomp: allow bpf() if CAP_BPF
This does not deny anything new (bpf is currently allowed for sys admin)

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-06 11:23:32 +09:00
Dominique Martinet 61e2251d50 seccomp: deny readdir()
the readdir syscall hasn't existed forever (wasn't present in linux
2.6.12 initial import into git), remove it and don't even bother adding
it to the list of EPERM syscalls

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-06 11:23:32 +09:00
Dominique Martinet b1cffd1ba1 seccomp: riscv: add riscv_flush_icache
apparently harmless and used

Link: https://github.com/systemd/systemd/pull/25018
Link: https://github.com/containerd/containerd/pull/6882
Link: https://github.com/moby/moby/pull/43553
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-05 13:46:21 +09:00
Dominique Martinet 237eb57a9b seccomp: ppc64le: allow swapcontext
swapcontext seems to be used for coroutines in some languages (at least
ruby), enough to have been added to other major engines by an actual user.

Link: https://github.com/moby/moby/pull/43092
Link: https://github.com/systemd/systemd/pull/9487
Link: https://github.com/containerd/containerd/pull/6411
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-05 12:43:47 +09:00
Dominique Martinet f9fb2eba22 seccomp: explicitly block new (already blocked) syscalls
The following syscalls have been added in recent kernels and considered
for this list:
 - cachestat, prints information about cache misses; it is less accurate
   than userfaultfd so probably safe but deny it until a clear need
   shows up
 - io_pgetevents_time64: io_pgetevents is already blocked, so block this
   variant as well. Note these are pretty close to io_getenvents, so we
   should probably block that as well, but since it is currently allowed
   keep that where it is.
 - map_shadow_stack: this allows creating a new shadow stack, required for
   user-space threading if shadow stack verification is enabled (prctl
   PR_SET_SHADOW_STACK_STATUS with PR_SHADOW_STACK_ENABLE); this might
   be required in the future but delay this decision until someone
   requests it
 - futex_* new interface is primarily intended for io_uring which we
   disallow, and does not have any known user yet so likewise block
   until someone requests it.
 - quotactl_fd: this is identical to quotactl, so only allow for
   SYS_ADMIN like quotactl.

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-05 12:43:03 +09:00
Dominique Martinet dbf22d13ae seccomp: remove 'timerfd'
timerfd as a syscall does not seem to have ever existed,
remove it from allowed syscalls list.

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-05 12:43:00 +09:00
Dominique Martinet 85fe468cff seccomp: remove obsolete sigaction family syscalls
These have been replaced by the rt_sigaction family, and have not been
compiled in on most kernels since linux v3.9 (2013)

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
2024-06-05 12:42:45 +09:00
openshift-merge-bot[bot] ce424557dd Merge pull request #1781 from alexandear/fix-typos-across-repo
Fix typos across repo; extend codespell config
2024-01-04 11:12:20 +00:00
Oleksandr Redko 3cc2a76ae9 Fix typos across repo; extend codespell config
Signed-off-by: Oleksandr Redko <Oleksandr_Redko@epam.com>
2024-01-03 23:38:47 +02:00
Oleksandr Redko ba4c7c98bb chore: remove outdated build constraints
Signed-off-by: Oleksandr Redko <Oleksandr_Redko@epam.com>
2024-01-03 22:56:00 +02:00
Giuseppe Scrivano 850e306b5b seccomp: allow fchmodat2
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2023-12-20 21:12:50 +01:00
Valentin Rothberg e17483b871 bump to golangci-lint v1.50.0
Used `go fmt` rules to migrate away from deprecated functions, for
instance `gofmt -w -s -r 'ioutil.TempDir(a, b) -> os.MkdirTemp(a, b)'`

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-10-17 15:03:07 +02:00
Sascha Grunert 426d69c00f Switch to golang native error wrapping
`github.com/pkg/errors` is deprecated since quite some time so we now
use the native error wrapping for more idiomatic golang.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2022-07-12 10:54:07 +02:00
Mickaël Salaün 4ddc450d00 seccomp: Allow Landlock syscalls
Landlock is a Linux feature that enables creating security sandboxes
(see https://docs.kernel.org/userspace-api/landlock.html).  Allow the
three related system calls (available since Linux 5.13):
landlock_create_ruleset, landlock_add_rule, and landlock_restrict_self.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
2022-06-30 14:47:57 +02:00
cdoern 7f76a6b52d use runc cgroup creation logic
switch c/common to use runc cgroup creation so that we can use resource limits

This entails importing the newly refactored runc code to manage reading from and writing to cgroup.
vendoring in directly an unreleased runc commit from opencontainers/runc#3452

Signed-off-by: cdoern <cdoern@redhat.com>
2022-06-07 22:17:40 -04:00
Cosmin Cojocar eee9ad48d0 Add SCMP_ACT_NOTIFY to seccomp actions
Signed-off-by: Cosmin Cojocar <gcojocar@adobe.com>
2022-05-09 11:00:04 +02:00
Paul Holzinger cc110440e4 enable unparam, exportloopref and revive linters
unparam and exportloopref already work without changes.
For revive I had to silence many naming issues. I decided to silence them
instead of changing the name because I didn't want to break any code.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
2022-05-06 13:32:35 +02:00
Kir Kolyshkin b951b72412 Gofumpt the code
gofumpt is a stricter version of gofmt, basically making the code more
readable, and fixing the gocritic's octalLiterar warnings like this one:

	pkg/util/util_supported.go:26:17: octalLiteral: use new octal literal style, 0o722 (gocritic)
		return (perm & 0722) == 0700
			       ^

Generated by gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-04-09 16:50:11 -07:00
Daniel J Walsh 41811d83ac Add ptrace as a default seccomp allow to match Docker
Also sort all syscalls in alphabetic order.

Fixes: https://github.com/containers/buildah/issues/3833

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2022-03-22 17:11:15 -04:00
Valentin Rothberg 095aded91c go fmt: use go 1.18 conditional-build syntax
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-03-18 11:04:40 +01:00
Sascha Grunert 6485117310 Allow more syscalls
We manually checked the syscalls from this list and compared it to our
supported ones:
https://github.com/seccomp/libseccomp/blob/main/src/syscalls.csv

This patch adds a bunch of new safe syscalls to be allowed, namely:
membarrier, mount_setattr, process_mrelease, sigaction, signal,
sigpending, sigprocmask, sigsuspend, syscall and timerfd.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2022-03-15 10:09:33 +01:00
Sascha Grunert e2ebb542c8 Add support for seccomp `ListenerPath` and `ListenerMetadata`
We have to copy both fields in the same way we did with the flags to
support them in container runtimes.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2022-02-28 11:37:02 +01:00
Sascha Grunert 941bc06e84 Add support for seccomp filter flags
crun supports seccomp filter flags since fefabffa28
runc will get them with https://github.com/opencontainers/runc/pull/3390
youki will get them with https://github.com/containers/youki/pull/733

To support them generally, we now copy the flags during the seccomp
setup, otherwise they will get lost.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2022-02-23 12:02:13 +01:00
Daniel J Walsh 924a25ff41 Fix make cross
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2021-12-11 07:19:10 -05:00
Giuseppe Scrivano c0d068931f seccomp: accept strings for errno values
simplify maintainance of the seccomp.json file and accept errno as
strings.

It also fixes a portability problem since errno values are arch
dependent.

The existing `DefaultErrnoRet` and `ErrnoRet` are maintained for
backward compatibility but they are obsoleted and will be removed in a
future release.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-11-09 11:41:03 +01:00
Giuseppe Scrivano c2495428c7 seccomp: refactor code out
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-11-09 11:40:58 +01:00
Giuseppe Scrivano 639e8c87d0 seccomp: allow memfd_secret
memfd_secret is a new syscall that will be added to Linux 5.14

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-08-25 18:03:37 +02:00
Kir Kolyshkin 701f0ee3b6 pkg/seccomp: avoid DefaultErrnoRet: null
This prevents

	"defaultErrnoRet": null,

from appearing in seccomp.json.

This member is similar to ErrnoRet in type Syscall,
and should also be marked with omitempty.

Fixes: c662eb936b
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-07-23 16:54:29 -07:00
Giuseppe Scrivano feefe26072 seccomp: always allow get_mempolicy, set_mempolicy, mbind
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-06-16 13:17:26 +02:00
Giuseppe Scrivano b355656ccc seccomp: let membarrier fail with ENOSYS
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-06-16 12:18:01 +02:00
Giuseppe Scrivano b0235eadb1 seccomp: allow rseq
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-06-16 12:17:23 +02:00
Giuseppe Scrivano 339f5cbdb9 seccomp: allow pkey_*
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-06-16 12:16:41 +02:00
Giuseppe Scrivano 24114130c2 seccomp: let io_uring_* fail with ENOSYS
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-06-16 12:15:05 +02:00
Giuseppe Scrivano d4fd05c527 seccomp: allow clone3
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-06-16 12:14:26 +02:00
Giuseppe Scrivano 526b9a36e7 seccomp: switch default to ENOSYS
add the currently blocked syscalls to a deny-list and switch the
default to ENOSYS.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-06-14 19:08:07 +02:00
Giuseppe Scrivano c662eb936b seccomp: add support for defaultErrnoRet
Add support to specify the default errno return value.

The OCI runtime specs already have support for it, and both crun (>=
0.19) and runc (>= 1.0-rc95) have support for it.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-06-14 19:08:06 +02:00
Jan Palus 0583bac499 seccomp: allow timer_settime64
allow time64 variant of timer_settime which was missed in 4405585

Signed-off-by: Jan Palus <jpalus@fastmail.com>
2021-06-14 12:55:55 +02:00
Jan Palus e50fdde382 seccomp: allow more *_time64 syscalls
add missing equivalents of already allowed syscalls for 32-bit platforms
with 64-bit time for countering Y2038

Fixes #593

Signed-off-by: Jan Palus <jpalus@fastmail.com>
2021-06-01 18:05:14 +02:00
Daniel J Walsh a482b92f4a Add setns to default seccomp.json
In order to run containers within containers via podman
and do a podman exec, we need to allow setns syscalls.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2021-04-19 06:21:02 -04:00
Kir Kolyshkin c9a7d176a1 pkg/seccomp: use sync.Once to speed up IsSupported
It does not make sense to check if seccomp is supported by the kernel
more than once per runtime, so let's use sync.Once to speed it up.

A quick benchmark:

BenchmarkIsSupported-4       	  1252161	       947 ns/op
BenchmarkIsSupportedOnce-4   	666274008	      2.14 ns/op

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-24 13:38:18 -07:00
Kir Kolyshkin 2f8a504f7c pkg/seccomp: simplify IsSupported
Current implementation of seccomp.IsSupported (rooted in runc) is not
very good.

First, it parses the whole /proc/self/status, adding each key: value
pair into the map (lots of allocations and future work for garbage
collector), when using a single key from that map.

Second, the presence of "Seccomp" key in /proc/self/status merely means
that kernel option CONFIG_SECCOMP is set, but there is a need to _also_
check for CONFIG_SECCOMP_FILTER (the code for which exists but never
executed in case /proc/self/status has Seccomp key).

Replace all this with a single call to prctl; see the long comment in
the code for details.

NOTE historically, parsing /proc/self/status was added after a concern
was raised in https://github.com/opencontainers/runc/pull/471 that
prctl(PR_GET_SECCOMP, ...) can result in the calling process being
killed with SIGKILL. This is a valid concern, so the new code here
does not use PR_GET_SECCOMP at all.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-23 17:58:36 -07:00
Aleksa Sarai 1478f9331d seccomp: update profile to Linux 5.11 list
This mirrors the Docker and containerd changes, with the caveat that
because mount(2) is permitted under podman for all containers we
therefore add all of the v2 mount API syscalls as available to all
containers.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-01-27 21:40:48 +11:00
Aleksa Sarai 1195c8bb0b seccomp: re-add generation script
The generate.go script used to fill the default seccomp profile file is
quite important as otherwise distributions will end up having outdated
seccomp filters even after a podman update.

This script comes from the Docker repo.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-01-27 21:40:42 +11:00
Aleksa Sarai 624d0aa703 seccomp: deduplicate default profile
Several syscalls were enabled globally (SCMP_ACT_ALLOW without any
conditions for all containers), but also had conditional rules later in
the profile (likely inherited from Docker). The following syscalls do
not need special casing because they were globally enabled:

 * clone, unshare, mount, umount, umount2 all had special CAP_SYS_ADMIN
   restrictions but those don't make sense since they were also enabled
   for all containers.
 * reboot was permitted for CAP_SYS_BOOT and all containers.
 * name_to_handle_at was permitted for CAP_SYS_ADMIN, CAP_SYS_NICE(?),
   and all containers.

And certain syscalls had globally-enabled rules when they shouldn't
have:

 * socket has special rules for CAP_AUDIT_WRITE but it also had a global
   "allow unconditionally" rule. It turns out that libseccomp will
   override unconditional rules with conditional ones but this is
   somewhat of an implementation detail and it's much safer to remove
   the rule and use the existing cases.

Now the only syscalls remaining with complicated rules (meaning they
appear more than once in the profile) are:

 * sync_file_range2 which is architecture specific (though in principle
   we could move it to enabled-without-rules because runc ignores
   unknown syscalls).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-01-27 21:39:54 +11:00
Giuseppe Scrivano 10e862731c seccomp: drop 'vmsplice' from the allowed list
More details: https://lore.kernel.org/linux-mm/X+PoXCizo392PBX7@redhat.com/

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-01-08 13:43:54 +01:00
Daniel J Walsh 70d93c6deb Fix building on non linux platforms
Currently this code is not building correctly on darwin builds.
This PR handles non linux platforms correctly.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2020-12-21 13:33:12 -05:00
Daniel J Walsh 297a9ab8d6 Add pidfd_open syscall by default
This syscall will actually allow processes to be more secure,  Should be allowed by
default.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2020-12-15 05:46:02 -05:00