The system tests can be very I/O intensive, because many of them copy
OCI images from the test suite's image cache directory to its local
container/storage store, create containers, and then delete everything
to run the next test with a clean slate. This makes them slow.
The runtime environment tests, which includes the resource limit tests,
are particularly slow because they don't skip the I/O even when testing
error handling. This makes them a good target for optimizations.
The resource limit tests query the values for different resources from
the same default container without changing its state. Therefore, a lot
of disk I/O can be avoided by creating the default container only once
for all the tests.
This can save even 30 minutes.
https://github.com/containers/toolbox/pull/1552
The test suite has expanded to 415 system tests. These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate. This makes the system tests slow.
Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2]. So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.
One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes. This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.
First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers. The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.
The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes. Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower. Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use. Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.
The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea6f0
already added a dependency on Bats >= 1.10.0. So, there's nothing to
worry about.
At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4]. Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().
[1] https://zuul-ci.org/docs/zuul/latest/tenants.html
[2] Commit 83f28c52e4https://github.com/containers/toolbox/commit/83f28c52e47c2d44https://github.com/containers/toolbox/pull/1548
[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html
[4] https://github.com/bats-core/bats-core/issues/1006https://github.com/containers/toolbox/pull/1551
Using the word 'containerized' gives the false impression of heightened
security. As if it's a mechanism to run untrusted software in a
sandboxed environment without access to the user's private data (such as
$HOME), hardware peripherals (such as cameras and microphones), etc..
That's not what Toolbx is for.
Toolbx aims to offer an interactive command line environment for
development and troubleshooting the host operating system, without
having to install software on the host. That's all. It makes no
promise about security beyond what's already available in the usual
command line environment on the host that everybody is familiar with.
https://github.com/containers/toolbox/issues/1020
Mention that Toolbx is meant for system administrators to troubleshoot
the host operating system. The word 'debugging' is often used in the
context of software development, and hence most readers might not
interpret it as 'troubleshooting'.
https://github.com/containers/toolbox/pull/1549
Use 'software development' instead of just 'development' when
introducing Toolbx. The additional context makes it more understandable
to the reader.
https://github.com/containers/toolbox/pull/1549
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].
The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it. It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h. It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.
This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.
Compare the output of:
$ nm /path/to/toolbox | grep ' subid_init'
... with those from:
$ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
U nvmlGpuInstanceGetComputeInstanceProfileInfoV
$ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
U nvmlDeviceGetAccountingPids
Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
$ toolbox
toolbox: symbol lookup error: toolbox: undefined symbol:
nvmlGpuInstanceGetComputeInstanceProfileInfoV
With the recent expansion of the test suite, it's necessary to increase
the timeout for the Fedora nodes to prevent the CI from timing out.
Fallout from 6e848b250b
[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
https://github.com/NVIDIA/nvidia-container-toolkit/commit/1407ace94ab7c150https://github.com/NVIDIA/go-nvml/issues/18https://github.com/NVIDIA/nvidia-container-toolkit/issues/49
[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
[3] https://github.com/NVIDIA/go-nvml/blob/main/README.mdhttps://github.com/NVIDIA/go-nvml/tree/main/pkg/dlhttps://github.com/NVIDIA/go-nvml/tree/main/pkg/nvmlhttps://github.com/containers/toolbox/pull/1548
Commit 87eaeea6f0 already added a dependency on Bats >= 1.10.0,
which is present on Fedora >= 39. Therefore, it should be exploited
wherever possible to simplify things.
Currently, the CI has been frequently timing out on stable Fedora nodes.
So, increase the timeout from 1 hour 50 minutes to 2 hours to avoid
that.
For what it's worth, the timeout for Fedora Rawhide nodes is 2 hours 10
minutes and it seems enough.
https://github.com/containers/toolbox/pull/1546
The proprietary NVIDIA driver has a kernel space part and a user space
part, and they must always have the same matching version. Sometimes,
the host operating system might end up with mismatched parts. One
reason could be that the different third-party repositories used to
distribute the driver might be incompatible with each other. eg., in
the case of Fedora it could be RPM Fusion and NVIDIA's own repository.
This shows up in the systemd journal as:
$ journalctl --dmesg
...
kernel: NVRM: API mismatch: the client has the version 555.58.02, but
NVRM: this kernel module has the version 560.35.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
...
Without any special handling of this scenario, users would be presented
with a very misleading error:
$ toolbox enter
Error: failed to get Container Device Interface containerEdits for
NVIDIA
Instead, improve the error message to be more self-documenting:
$ toolbox enter
Error: the proprietary NVIDIA driver's kernel and user space don't
match
Check the host operating system and systemd journal.
https://github.com/containers/toolbox/pull/1541
Note that github.com/NVIDIA/go-nvlib > 0.2.0 isn't API compatible with
github.com/NVIDIA/nvidia-container-toolkit 1.15.0. The next release of
nvidia-container-toolkit is 1.16.0 and it requires go-nvlib 0.6.0.
Therefore, these two Go modules need to be updated together.
The src/go.sum file was updated with 'go mod tidy'.
https://github.com/containers/toolbox/pull/1540
When 'toolbox run' is invoked on the host, an exit code of 127 from
'podman exec' means either that the specified command couldn't be found
or that the working directory didn't exist. The only way to tell these
two scenarios apart is to actually look inside the container.
Secondly, Toolbx containers always have an executable toolbox(1) binary
at /usr/bin/toolbox and it's assumed that /usr/bin is always part of the
PATH environment variable.
When 'toolbox run toolbox ...' is invoked, the inner toolbox(1)
invocation will be forwarded back to the host by the Toolbx container's
/usr/bin/toolbox, which is always present as an executable. Hence, if
the outer 'podman exec' on the host fails with an exit code of 127,
then it doesn't mean that the container didn't have a toolbox(1)
executable, but that some subordinate process started by the container's
toolbox(1) failed with that exit code.
Therefore, handle this as a special case to avoid losing the exit code.
Otherwise, it leads to:
$ toolbox run toolbox run non-existent-command
bash: line 1: exec: non-existent-command: not found
Error: command non-existent-command not found in container
fedora-toolbox-40
$ echo "$?"
0
Instead, it will now be:
$ toolbox run toolbox run non-existent-command
bash: line 1: exec: non-existent-command: not found
Error: command non-existent-command not found in container
fedora-toolbox-40
$ echo "$?"
127
https://github.com/containers/toolbox/issues/957https://github.com/containers/toolbox/pull/1052
When 'toolbox run' is invoked on the host, an exit code of 126 from
'podman exec' means that the specified command couldn't be invoked
because it's not an executable. eg., the command was actually a
directory. Note that this doesn't mean that the command couldn't be
found. That's denoted by exit code 127.
Secondly, Toolbx containers always have an executable toolbox(1) binary
at /usr/bin/toolbox and it's assumed that /usr/bin is always part of the
PATH environment variable.
When 'toolbox run toolbox ...' is invoked, the inner toolbox(1)
invocation will be forwarded back to the host by the Toolbx container's
/usr/bin/toolbox, which is always present as an executable. Hence, if
the outer 'podman exec' on the host fails with an exit code of 126,
then it doesn't mean that the container didn't have a working toolbox(1)
executable, but that some subordinate process started by the container's
toolbox(1) failed with that exit code.
Therefore, handle this as a special case to avoid showing an extra error
message. Otherwise, it leads to:
$ toolbox run toolbox run /etc
bash: line 1: /etc: Is a directory
bash: line 1: exec: /etc: cannot execute: Is a directory
Error: failed to invoke command /etc in container fedora-toolbox-40
Error: failed to invoke command toolbox in container fedora-toolbox-40
$ echo "$?"
126
Instead, it will now be:
$ toolbox run toolbox run /etc
bash: line 1: /etc: Is a directory
bash: line 1: exec: /etc: cannot execute: Is a directory
Error: failed to invoke command /etc in container fedora-toolbox-40
$ echo "$?"
126
https://github.com/containers/toolbox/issues/957https://github.com/containers/toolbox/pull/1052
The test suite uses its own separate local container/storage store to
isolate itself from the default store, so that the tests' interactions
with containers and images don't affect anything else. This is done by
using the CONTAINERS_STORAGE_CONF environment variable [1] to specify a
separate storage.conf(5) file [2].
Therefore, when running the test suite, the CONTAINERS_STORAGE_CONF
environment variable must be preserved when forwarding toolbox(1)
invocations inside containers to the host. Otherwise, the initial
toolbox(1) invocation on the host and the forwarded invocation running
on the host won't use the same local container/storage store.
This problem only impacts test cases that cover toolbox(1) code paths
that invoke podman(1).
[1] https://docs.podman.io/en/latest/markdown/podman.1.html
[2] https://manpages.debian.org/testing/containers-storage/containers-storage.conf.5.en.htmlhttps://github.com/containers/toolbox/issues/957https://github.com/containers/toolbox/pull/1052
This will make it easier to propagate the exit codes of subordinate
processes through an exitError instance, when toolbox(1) is invoked
inside a container, and invocation is forwarded to the host.
Cobra doesn't honour the root command's SilenceErrors, if an error
occurred when parsing the command line for a command, even though the
command was found. However, Cobra does honour SilenceErrors, if the
error occurred afterwards.
Therefore, to avoid setting SilenceErrors in each and every command, it
was set in the PersistentPreRunE hook (ie., preRun), which is called
after all command line parsing has been successfully completed.
https://github.com/containers/toolbox/issues/957
It shouldn't be necessary to use the --assumeyes option when creating a
Toolbx container, if the corresponding image is already present in the
local containers/storage image store. It's harmful to test it with the
option, even when it shouldn't be needed, because it's off by default
and most users won't enable it.
Therefore, it's better to test the most common scenario that most users
will encounter.
https://github.com/containers/toolbox/pull/1536