The toolbox(1) binary always relies on the PATH environment variable to
find the podman(1) and skopeo(1) binaries. There's no way to override
those with the PODMAN and SKOPEO environment variables, and they only
affect any direct use of podman(1) and skopeo(1) within the test suite.
Therefore, offering the PODMAN and SKOPEO environment variables in their
current form is needlessly confusing and misleading, and can lead to
surprises arising from different podman(1) and skopeo(1) binaries being
used in different places. Either toolbox(1) should also honour them or
the test suite shouldn't offer them. The former is more complicated
without any obvious need for it, so the latter was chosen.
https://github.com/containers/toolbox/pull/1592
The test suite has expanded to 415 system tests. These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate. This makes the system tests slow.
Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2]. So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.
One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes. This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.
First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers. The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.
The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes. Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower. Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use. Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.
The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea6f0
already added a dependency on Bats >= 1.10.0. So, there's nothing to
worry about.
At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4]. Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().
[1] https://zuul-ci.org/docs/zuul/latest/tenants.html
[2] Commit 83f28c52e4https://github.com/containers/toolbox/commit/83f28c52e47c2d44https://github.com/containers/toolbox/pull/1548
[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html
[4] https://github.com/bats-core/bats-core/issues/1006https://github.com/containers/toolbox/pull/1551
It's far more consistent and understandable if all tests start with a
clean state without any containers or images present. Otherwise, the
subtle side-effects of having some image left behind from a previous
test can lead to surprises, and there's no need to spend time wondering
whether some tests should only clean up the containers or both
containers and images.
This additional work of cleaning up the images for all tests makes it
necessary to increase the timeout for all Fedora nodes to prevent the CI
from timing out.
https://github.com/containers/toolbox/pull/1526
This is meant to make the project more searchable on the Internet. More
and more people have been pointing out that "toolbox" is terribly
difficult to search for, and it's impossible to find any decent
Internet real estate by that name.
Some exceptions:
* The code repository is still https://github.com/containers/toolbox.
It will be renamed after giving a heads-up to other contributors.
* The name of the binary is still 'toolbox'. The name is embedded
into existing Toolbx containers as their entry point, which is bind
mounted from the host operating system when the containers are
started. Trivially renaming the binary will prevent these
containers from starting.
* For similar reasons, the TOOLBOX_PATH environment variable is still
the same.
* For similar reasons, the profile.d file to be read by the shell on
start-up is still called toolbox.sh.
* The label used to identify Toolbx containers and images is still
called com.github.containers.toolbox. There are many existing
Toolbx containers, and many Toolbx images beyond the control of the
Toolbx project that use this label to identity themselves. Simply
renaming the label will prevent these containers and images from
being recognized.
* The names of the built-in Toolbx images still retain the word
'toolbox'. Images under the new name need to be published on the
OCI registries and the toolbox(1) binary needs to be taught to
handle both old and new names, wherever necessary, for backwards
compatibility.
* The stamp file used to identify Toolbx containers is still called
/run/.toolboxenv because it's used by various external programs and
users to identify Toolbx containers.
* The OSC 777 escape sequence to track and preserve the user's current
Toolbx container [1] still emits 'toolbox' as the name of the
container runtime. Changing the escape sequence can break terminal
emulation applications, like Prompt [2], that consume it. Hence, it
needs to be done carefully.
* The runtime directories at /run/toolbox, when used as root, and
$XDG_RUNTIME_DIR/toolbox, when used rootless, weren't renamed.
When used as root, /run/toolbox is embedded into existing Toolbx
containers as a bind mount from the host. Trivially renaming the
path will prevent these containers from starting.
Secondly, both these paths are used to synchronize container
start-up. If the paths are trivially renamed, and the toolbox(1)
binary is updated and used without stopping all existing containers,
then it won't be able to enter containers that were already started.
Strictly speaking, this scenario isn't supported, since updates are
always expected to be "offline" [3]. However, it's worth noting
because solving the previous problem might also address this.
* The configuration file for RPM is still called
/usr/lib/rpm/macros.d/macros.toolbox.
[1] https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/17
[2] https://gitlab.gnome.org/chergert/prompt
[3] https://www.freedesktop.org/software/systemd/man/latest/systemd.offline-updates.htmlhttps://github.com/containers/toolbox/issues/1399
Currently, if 'skopeo copy ...' fails to download and cache an OCI image
during setup_suite(), the test suite doesn't immediately fail, but
continues. It only fails later when trying to set up the Docker
registry and contains a lot of noise:
not ok 1 setup_suite
# (from function `assert_success' in file
test/system/libs/bats-assert/src/assert.bash, line 114,
# from function `_setup_docker_registry' in file
test/system/libs/helpers.bash, line 211,
# from function `setup_suite' in test file
test/system/setup_suite.bash, line 59)
# `_setup_docker_registry' failed
# Failed to cache image registry.fedoraproject.org/fedora-toolbox:38
to /tmp/bats-run-GyTP7A/image-cache/fedora-toolbox-38
#
# -- command failed --
# status : 1
# output : time="2023-09-25T12:19:52+02:00" level=fatal
msg="initializing source
docker://registry.fedoraproject.org/fedora-toolbox:38-foo:
reading manifest 38-foo in
registry.fedoraproject.org/fedora-toolbox: manifest unknown"
# --
#
# Failed to cache image quay.io/toolbx/arch-toolbox:latest to
/tmp/bats-run-GyTP7A/image-cache/arch-toolbox-latest
#
# -- command failed --
# status : 1
# output : time="2023-09-25T12:20:48+02:00" level=fatal
msg="initializing source
docker://quay.io/toolbx/arch-toolbox:latest-foo: reading
manifest latest-foo in quay.io/toolbx/arch-toolbox: manifest
unknown"
# --
#
# Failed to cache image registry.fedoraproject.org/fedora-toolbox:34
to /tmp/bats-run-GyTP7A/image-cache/fedora-toolbox-34
#
# -- command failed --
# status : 1
# output : time="2023-09-25T12:21:42+02:00" level=fatal
msg="initializing source
docker://registry.fedoraproject.org/fedora-toolbox:34-foo:
reading manifest 34-foo in
registry.fedoraproject.org/fedora-toolbox: manifest unknown"
# --
#
# ...
#
# -- command failed --
# status : 1
# output : time="2023-09-25T12:26:33+02:00" level=fatal
msg="determining manifest MIME type for
dir:/tmp/bats-run-GyTP7A/image-cache/fedora-toolbox-34: open
/tmp/bats-run-GyTP7A/image-cache/fedora-toolbox-34/manifest.json:
no such file or directory"
# --
#
# docker-registry
# 27fa141e291e64e4c7a148c88ddab219ff2bfb5802a2982dc4188dc11f41692d
# Untagged: quay.io/toolbox_tests/registry:latest
# Deleted: fea5a12cde107bb407bc44ede6dd9edea1d2b4171cd8e52b0cb330bf45e517e1
It makes it look as if the root cause of the failure is related to
setting up the Docker registry, which it isn't, and all that noise makes
it difficult to spot the actual problem.
Instead, from now on, it will be more obvious:
not ok 1 setup_suite
# (from function `setup_suite' in test file
test/system/setup_suite.bash, line 44)
# `_pull_and_cache_distro_image "$system_id" "$system_version" ||
false' failed
# Failed to cache image registry.fedoraproject.org/fedora-toolbox:38
to /tmp/bats-run-62b8CU/image-cache/fedora-toolbox-38
# time="2023-09-25T13:55:42+02:00" level=fatal msg="initializing
source docker://registry.fedoraproject.org/fedora-toolbox:38-foo:
reading manifest 38-foo in
registry.fedoraproject.org/fedora-toolbox: manifest unknown"
Note that Bats' 'run' helper [1] isn't designed to work inside
setup_suite(). eg., 'run --separate-stderr' doesn't work because
BATS_TEST_TMPDIR isn't defined.
[1] https://bats-core.readthedocs.io/en/stable/writing-tests.htmlhttps://github.com/containers/toolbox/pull/1377
If setup_suite() fails for some reason, then an unrelated message from
'podman system reset' would show up:
not ok 1 setup_suite
# (from function `setup_suite' in test file
test/system/setup_suite.bash, line 43)
# `_pull_and_cache_distro_image foo || false' failed
# Requested distro (foo) does not have a matching image
# A "/home/rishi/.cache/toolbox/system-test-storage/storage.conf"
config file exists.
# Remove this file if you did not modify the configuration.
This extra error message from 'podman system reset' serves no purpose
because it's not related to the cause of the setup_suite() failure.
It's just noise and it's better to silence it.
https://github.com/containers/toolbox/pull/1375
If setup_suite() fails for some reason, causing the Docker registry to
not be created, then an unrelated message from 'podman stop' would show
up:
not ok 1 setup_suite
# (from function `setup_suite' in test file
test/system/setup_suite.bash, line 43)
# `_pull_and_cache_distro_image foo || false' failed
# Requested distro (foo) does not have a matching image
# Error: no container with name or ID "docker-registry" found: no such
container
# ...
# ...
This extra error message from 'podman stop' serves no purpose because
it's not related to the cause of the setup_suite() failure. It's just
noise and it's better to silence it.
https://github.com/containers/toolbox/pull/1375
Contrary to what the documentation might seem to imply [1], Bats' 'fail'
helper only aborts a test case under certain circumstances. eg., when
called from setup_suite(), but not from within a child function, and a
@test case, but not from within the 'run' helper.
If 'fail' is called from within 'run', then the code after it will
continue to execute. The test case will only fail if 'run' eventually
catches a non-zero exit code that's caught by 'assert_success' [2].
Similarly, it doesn't abort if called from within a child function in
setup_suite().
Currently, _pull_and_cache_distro_image() is a child function called
from setup_suite(). So 'fail' won't abort if an invalid distribution is
specified.
Fortunately, pull_distro_image() is being called from within @test
cases, but outside 'run'. So, there's no problem with it now. However,
some future code changes can unknowingly alter this reality and it too
can run into unexpected behaviour.
Therefore, it's better to be safe, and explicitly specify a non-zero
exit code after 'fail'. It will ensure that it works as expected under
all circumstances.
[1] https://github.com/bats-core/bats-support
[2] https://github.com/bats-core/bats-asserthttps://github.com/containers/toolbox/pull/1375
Currently, if a Toolbx container's entry point fails to initialize the
container, there's no way to see the debug logs and error messages from
the entry point:
not ok 106 container: Check container starts without issues
# (from function `assert_success' in file
test/system/libs/bats-assert/src/assert.bash, line 114,
# in test file test/system/103-container.bats, line 39)
# `assert_success' failed
#
# -- command failed --
# status : 1
# output :
# --
#
Instead, from now on, they will be visible:
not ok 106 container: Check container starts without issues
# (from function `assert_success' in file
test/system/libs/bats-assert/src/assert.bash, line 114,
# in test file test/system/103-container.bats, line 39)
# `assert_success' failed
#
# -- command failed --
# status : 1
# output (90 lines):
# Failed to initialize container fedora-toolbox-38
# level=debug msg="Running as real user ID 0"
# level=debug msg="Resolved absolute path to the executable as
/usr/bin/toolbox"
# level=debug msg="TOOLBOX_PATH is /opt/bin/toolbox"
# level=debug msg="Migrating to newer Podman"
# level=debug msg="Migration not needed: running inside a container"
# level=debug msg="Setting up configuration"
# ...
# --
#
https://github.com/containers/toolbox/pull/1374
Bats' 'run' helper returns with an exit code of 0 even when the command
that it was given to run failed with a non-zero exit code [1]. This is
to enable making further assertions about the command after 'run' has
finished. If there's nothing that checks for failures, then it will
continue as if everything is alright.
Therefore, currently, if 'podman logs' fails, there's no indication of
it and the test only fails later because it thinks that the container
failed to initialize:
not ok 106 container: Check container starts without issues
# (from function `assert_success' in file
test/system/libs/bats-assert/src/assert.bash, line 114,
# in test file test/system/103-container.bats, line 39)
# `assert_success' failed
#
# -- command failed --
# status : 1
# output :
# --
#
Instead, from now on, it will be more obvious:
not ok 106 container: Check container starts without issues
# (from function `assert_success' in file
test/system/libs/bats-assert/src/assert.bash, line 114,
# in test file test/system/103-container.bats, line 39)
# `assert_success' failed
#
# -- command failed --
# status : 125
# output (2 lines):
# Failed to invoke '/usr/bin/podman logs'
# Error: no container with name or ID "foo" found: no such container
# --
#
One alternative was to use 'assert_success' [2] to assert that the
command given to 'run' succeeded. That would show the 'podman logs'
failure as:
not ok 106 container: Check container starts without issues
# (from function `assert_success' in file
test/system/libs/bats-assert/src/assert.bash, line 114,
# in test file test/system/103-container.bats, line 39)
# `assert_success' failed
#
# -- command failed --
# status : 1
# output (29 lines):
#
# -- command failed --
# status : 125
# output : Error: no container with name or ID "foo" found: no such
container
# --
#
# ...
#
# -- command failed --
# status : 125
# output : Error: no container with name or ID "foo" found: no such
container
# --
# --
#
However, it's a bit too noisy because of the 'assert_success' not
terminating container_started() and continuing to loop for the remaining
attempts.
[1] https://bats-core.readthedocs.io/en/stable/writing-tests.html
[2] https://github.com/bats-core/bats-asserthttps://github.com/containers/toolbox/pull/1372
A subsequent commit will use this variable to set the return value for a
different condition. Therefore, the name needs to be changed to suit
the purpose.
https://github.com/containers/toolbox/pull/1372
Until Bats 1.10.0, 'run' with options had a bug where it would overwrite
the value of the 'i' variable even outside 'run' [1].
In these particular instances, no options are being passed to 'run',
and, hence, currently there's no problem. However, in case a future
commit adds an option, then it could lead to hard-to-debug problems.
eg., --separate-stderr sets 'i' to 1, --show-output-of-passing-tests
sets it to 2, etc.. Therefore, depending on the flag and the loop, the
loop might get terminated prematurely or run infinitely or something
else.
Moreover, Bats 1.10.0 is only available in Fedora >= 39 and is absent
from Fedoras 37 and 38. Therefore, it's not possible to consider this
bug fixed.
Hence, it's better to preemptively work around it to avoid any future
issues.
[1] Bats commit 502dc47dd063c187
https://github.com/bats-core/bats-core/commit/502dc47dd063c187https://github.com/bats-core/bats-core/issues/726https://github.com/containers/toolbox/pull/1373
These files aren't marked as executable, and shouldn't be, because they
aren't meant to be standalone executable scripts. They're meant to be
part of a test suite driven by Bats. Therefore, it doesn't make sense
for them to have shebangs, because it gives the opposite impression.
The shebangs were actually being used by external tools like Coverity to
deduce the shell when running shellcheck(1). Shellcheck's inline
'shell' directive is a more obvious way to achieve that.
https://github.com/containers/toolbox/pull/1363
The '[' and 'test' implementations from GNU coreutils don't support '-v'
as a way to check if a shell variable is set [1]. Only Bash's built-in
implementations do.
This is quite confusing and makes it difficult to find out what '-v'
actually does. eg., 'man --all test' only shows the manual for the GNU
coreutils version, which doesn't list '-v' [1], and, 'man --all [' only
shows the manual for Bash's built-ins, which also doesn't list '-v'.
One has to go to the bash(1) manual to find it [2].
Elsewhere in the code base [3], the same thing is accomplished with '-z'
and parameter substitution, which are more widely supported and, hence,
easier to find documentation for.
[1] https://manpages.debian.org/testing/coreutils/test.1.en.html
[2] https://linux.die.net/man/1/bash
[3] Commit 84ae385f33https://github.com/containers/toolbox/pull/1334https://github.com/containers/toolbox/pull/1341
'[' is a command that's the same as 'test' and they might be implemented
as standalone executables or shell built-ins. Therefore, the negation
(ie., '!') has to cover the entire command to operate on its exit code.
Instead, if it's writtten as '[ ! ... ]', then the negation becomes an
argument to '[', which isn't the same thing.
Fallout from 54a2ca1eadhttps://github.com/containers/toolbox/pull/1341
First, it's not a good idea to use awk(1) as a grep(1) replacement.
Unless one really needs the AWK programming language, it's better to
stick to grep(1) because it's simpler.
Secondly, it's better to look for a specific os-release(5) field instead
of looking for the occurrence of 'rawhide' anywhere in the file, because
it lowers the possibility of false positives.
https://github.com/containers/toolbox/pull/1332
The current approach of extracting the VERSION_ID field from
os-release(5) assumes that the value is not quoted. There's no
guarantee that this will be the case. It only happens to be so on
Fedora by chance, and is different on Ubuntu:
$ cat /etc/os-release
...
VERSION_ID="22.04"
...
This means that "22.04", including the double quotes, is read as the
value of VERSION_ID on Ubuntu, not 22.04. This is wrong because this
value can't be used as is in image and container names. There's no
image called quay.io/toolbx/ubuntu-toolbox:"22.04" and double quotes are
not allowed in container names.
Instead, use the same approach as profile.d/toolbox.sh and the old POSIX
shell implementation that doesn't rely on the quoting of the
os-release(5) values.
Fallout from b27795a03ehttps://github.com/containers/toolbox/pull/1320
The current approach of selecting all the os-release(5) fields that have
'ID' in their name (eg., ID, VERSION_ID, PLATFORM_ID, VARIANT_ID, etc.)
and then picking the first one, assumes that the ID field will always be
placed above the others in os-release(5). There's no guarantee that
this will be the case. It only happens to be so on Fedora by chance,
and is different on Ubuntu:
$ cat /etc/os-release
...
VERSION_ID="22.04"
...
ID=ubuntu
ID_LIKE=debian
...
This means that "22.04" is read as the value of ID on Ubuntu, which is
clearly wrong.
Instead, use the same approach as profile.d/toolbox.sh and the old POSIX
shell implementation that doesn't rely on the order of the os-release(5)
fields.
Fallout from 54a2ca1eadhttps://github.com/containers/toolbox/pull/1320
This allows using the 'distro' option to create and enter Arch Linux
containers. Due to Arch's rolling-release model, the 'release' option
isn't required. If 'release' is used, the accepted values are 'latest'
and 'rolling'.
https://github.com/containers/toolbox/pull/1311