This means you're able to set the bits for capabilities on files
inside the container. This is needed for e.g. many fedora packages
as they use finegrained capabilities rather than setuid binaries.
This is safe as we're not adding capabilities really, since the
container is already allowed to create setuid binaries. Setuid
binaries are strictly more powerful that any capabilities (as root implies
all capabilities).
This doesn't mean the container can *gain* capabilities that it
doesn't already have though. The actual set of caps are strictly
decreasing.
Older kernel can't handle O_PATH in open() so this will
fail on dirs and symlinks. For dirs wa can fallback to
the normal Utimes, but for symlinks there is not much to do
but ignore their timestamps.
This creates a container by copying the corresponding files
from the layers into the containers. This is not gonna be very useful
on a developer setup, as there is no copy-on-write or general diskspace
sharing. It also makes container instantiation slower.
However, it may be useful in deployment where we don't always have a lot
of containers running (long-running daemons) and where we don't
do a lot of docker commits.
Change the comparison to better handle files that are copied during
container creation but not actually changed:
* Inode - this will change during a copy
* ctime - this will change during a copy (as we can't set it back)
* blocksize - this will change for sparse files during copy
* size for directories - this can change anytime but doesn't
necessarily reflect an actual contents change
* Compare mtimes at microsecond precision (as this is what utimes has)
There are some changes here that make the file metadata better match
the layer files:
* Set the mode of the file after the chown, as otherwise the per-group/uid
specific flags and e.g. sticky bit is lost
* Use lchown instead of chown
* Delay mtime updates to after all other changes so that later file
creation doesn't change the mtime for the parent directory
* Use Futimes in combination with O_PATH|O_NOFOLLOW to set mtime on symlinks
Rather than scan the files in the old directory twice to detect the
deletions we now scan both directories twice and then do all the
diffing on the in-memory structure.
This is more efficient, but it also lets us diff more complex things
later that are not exact on-disk trees.
The init layer needs to be topmost to make sure certain files
are always there (for instance, the ubuntu:12.10 image wrongly
has /dev/shm being a symlink to /run/shm, and we need to override
that). However, previously the devmapper code implemented the
init layer by putting it in the base devmapper device, which meant
layers above it could override these files (so that ubuntu:12.10
broke).
So, instead we put the base layer in *each* images devmapper device.
This is "safe" because we still have the pristine layer data
in the layer directory. Also, it means we diff the container
against the image with the init layer applied, so it won't show
up in diffs/commits.
lxc-start requires / to be mounted private, otherwise the changes
it does inside the container (both mounts and unmounts) will propagate
out to the host.
We work around this by starting up lxc-start in its own namespace where
we set / to rprivate.
Unfortunately go can't really execute any code between clone and exec,
so we can't do this in a nice way. Instead we have a horrible hack that
use the unshare command, the shell and the mount command...
Right now this does nothing but add a new layer, but it means
that all DeviceMounts are paired with DeviceUnmounts so that we
can track (and cleanup) active mounts.
I currently need this to get the tests running, otherwise it will
mount the docker.test binary inside the containers, which doesn't
work due to the libdevmapper.so dependency.
We wrap the "real" DeviceSet for each test so that we get only
a single device-mapper pool and loopback mounts, but still
separate out the IDs in the tests. This makes the test run
much faster.
This wraps an existing DeviceSet and just adds a prefix to all ids in
it. This will be useful for reusing a single DeviceSet for all the tests
(but with separate ids)
There are a few changes:
* Callers can specify if they want recursive behaviour or not
* All file listings to tar are sent on stdin, to handle long lists better
* We can pass in a list of filenames which will be created as empty
files in the tarball
This is exactly what we want for the creation of layer tarballs given
a container fs, a set of files to add and a set of whiteout files to create.
To do diffing we just compare file metadata, so this relies
on things like size and mtime/ctime to catch any changes.
Its *possible* to trick this by updating a file without
changing the size and setting back the mtime/ctime, but
that seems pretty unlikely to happen in reality, and lets
us avoid comparing the actual file data.
Without this there is really no way to map back from the device-mapper
devices to the actual docker image/container ids in case the json file
somehow got lost