People have reported following issue with overlay
$ docker run -ti --name=foo -v /dev/:/dev fedora bash
$ docker cp foo:/bin/bash /tmp
$ exit container
Upon container exit, /dev/pts gets unmounted too. This happens because
docker cp volume mounts get propagated to /run/docker/libcontainer/....
and when container exits, it must be tearing down mount point under
/run/docker/libcontainerd/... and as these are "shared" mounts it
propagates events to /dev/pts and it gets unmounted too.
One way to solve this problem is to make sure "docker cp" volume mounts
don't become visible under /run/docker/libcontainerd/..
Here are more details of what is actually happening.
Make overlay home directory (/var/lib/docker/overlay) private mount when
docker starts and unmount it when docker stops. Following is the reason
to do it.
In fedora and some other distributions / is "shared". That means when
docker creates a container and mounts it root in /var/lib/docker/overlay/...
that mount point is "shared".
Looks like after that containerd/runc bind mounts that rootfs into
/runc/docker/libcontainerd/container-id/rootfs. And this puts both source
and destination mounts points in shared group and they both are setup
to propagate mount events to each other.
Later when "docker cp" is run it sets up container volumes under
/var/lib/dokcer/overlay/container-id/... And all these mounts propagate
to /runc/docker/libcontainerd/... Now mountVolumes() makes these new
mount points private but by that time propagation already has happened
and private only takes affect when unmount happens.
So to stop this propagation of volumes by docker cp, make
/var/lib/docker/overlay a private mount point. That means when a container
rootfs is created, that mount point will be private too (it will inherit
property from parent). And that means when bind mount happens in /runc/
dir, overlay mount point will not propagate mounts to /runc/.
Other graphdrivers like devicemapper are already doing it and they don't
face this issue.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
link feature in docker0 bridge by default provides short-id as a
container alias. With built-in SD feature, providing a container
short-id as a network alias will fill that gap.
Signed-off-by: Madhu Venugopal <madhu@docker.com>
Fixes#22030
Because the publisher uses this same value to all the
stats endpoints we need to make a copy of this as soon as we get it so
that we can make our modifications without it affecting others.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This patch did following:
1) Make filter check logic same as `docker ps ` filters
Right now docker container logic work as following:
when same filter used like below:
-f name=jack -f name=tom
it would get all containers name is jack or tom(it is or logic)
when different filter used like below:
-f name=jack -f id=7d1
it would get all containers name is jack and id contains 7d1(it is and logic)
It would make sense in many user cases, but it did lack of compliate filter cases,
like "I want to get containers name is jack or id=7d1", it could work around use
(get id=7d1 containers' name and get name=jack containers, and then construct the
final containers, they could be done in user side use shell or rest API)
2) Fix one network filter bug which could include duplicate result
when use -f name= -f id=, it would get duplicate results
3) Make id filter same as container id filter, which means match any string.
not use prefix match.
It is for consistent match logic
Closes: #21417
Signed-off-by: Kai Qiang Wu(Kennan) <wkqwu@cn.ibm.com>
The `Status` field is a `map[string]interface{}` which allows the driver to pass
back low-level details about the underlying volume.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
People have reported following problem.
- docker run -ti --name=foo -v /dev/:/dev/ fedora bash
- docker cp foo:/bin/bash /tmp
Once the cp operation is complete, it unmounted /dev/pts on the host. /dev/pts
is a submount of /dev/. This is completely unexpected. Following is the
reson for this behavior.
containerArchivePath() call mountVolumes() which goes through all the mounts
points of a container and mounts them in daemon mount namespace in
/var/lib/docker/devicemapper/mnt/<containerid>/rootfs dir. And once we have
extracted the data required, these are unmounted using UnmountVolumes().
Mounts are done using recursive bind (rbind). And these are unmounted using
lazy mount option on top level mount. (detachMounted()). That means if there
are submounts under top level mounts, these mount events will propagate and
they were "shared" mounts with host, it will unmount the submount on host
as well.
For example, try following.
- Prepare a parent and child mount point.
$ mkdir /root/foo
$ mount --bind /root/foo /root/foo
$ mount --make-rshared /root/foo
- Prepare a child mount
$ mkdir /root/foo/foo1
$ mount --bind /root/foo/foo1 /root/foo/foo1
- Bind mount foo at bar
$ mkdir /root/bar
$ mount --rbind /root/foo /root/bar
- Now lazy unmount /root/bar and it will unmount /root/foo/foo1 as well.
$ umount -l /root/bar
This is not unintended. We just wanted to unmount /root/bar and anything
underneath but did not have intentions of unmounting anything on source.
So far this was not a problem as docker daemon was running in a seprate
mount namespace where all propagation was "slave". That means any unmounts
in docker daemon namespace did not propagate to host namespace.
But now we are running docker daemon in host namespace so that it is possible
to mount some volumes "shared" with container. So that if container mounts
something it propagates to host namespace as well.
Given mountVolumes() seems to be doing only temporary mounts to read some
data, there does not seem to be a need to mount these shared/slave. Just
mount these private so that on unmount, nothing propagates and does not
have unintended consequences.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Running on kernel versions older than 3.10 has not been
supported for a while (as it's known to be unstable).
With the containerd integration, this has become more
apparent, because kernels < 3.4 don't support PR_SET_CHILD_SUBREAPER,
which is required for containerd-shim to run.
Change the previous "warning" to a "fatal" error, so
that we refuse to start.
There's still an escape-hatch for users by setting
"DOCKER_NOWARN_KERNEL_VERSION=1" so that they can
run "at their own risk".
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Using new methods from engine-api, that make it clearer which element is
required when consuming the API.
Signed-off-by: Vincent Demeester <vincent@sbr.pm>
If contaner start fail of (say) "command not found", the container
actually didn't start at all, we shouldn't log start and die event for
it, because that doesnt actually happen.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This change allow to filter events that happened in the past
without waiting for future events. Example:
docker events --since -1h --until -30m
Signed-off-by: David Calavera <david.calavera@gmail.com>
In TP5, Hyper-V containers need all image files ACLed so that the virtual
machine process can access them. This was fixed post-TP5 in Windows, but
for TP5 we need to explicitly add these ACLs.
Signed-off-by: John Starks <jostarks@microsoft.com>
Implements a `CachedPath` function on the volume plugin adapter that we
call from the volume list function instead of `Path.
If a driver does not implement `CachedPath` it will just call `Path`.
Also makes sure we store the path on Mount and remove the path on
Unmount.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This patch will allow users to specify namespace specific "kernel parameters"
for running inside of a container.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Newly created containers which are not started yet should not list
when "exited=0" filter is used with "ps -a"
Signed-off-by: Boynux <boynux@gmail.com>
If aufs is already modprobe'd but we are in a user namespace, the
aufs driver will happily load but then get eperm when it actually tries
to do something. So detect that condition.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Currently if you restart docker daemon, all the containers with restart
policy `on-failure` regardless of its `RestartCount` will be started,
this will make daemon cost more extra time for restart.
This commit will stop these containers to do unnecessary start on
daemon's restart.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This fix tries to fix the issue in #21848 where `docker stats` will not correctly
display the container stats in case the container reuse another container's
network stack.
The issue is that when `stats` is performed, the daemon will check for container
network setting's `SandboxID`. Unfortunately, for containers that reuse another
container's network stack (`NetworkMode.IsConnected()`), SandboxID is not assigned.
Therefore, the daemon thinks the id is invalid and remote API will never return.
This fix tries to resolve the SandboxID by iterating through connected containers
and identify the appropriate SandboxID.
A test case for `stats` remote API has been added to check if `stats` will return
within the timeout.
This fix fixes#21848.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
When user try to restart a restarting container, docker client report
error: "container is already active", and container will be stopped
instead be restarted which is seriously wrong.
What's more critical is that when user try to start this container
again, it will always fail.
This error can also be reproduced with a `docker stop`+`docker start`.
And this commit will fix the bug.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This fix tries to fix the discrepancy between API and CLI on hostname
validation. Previously, the hostname validation was handled at the
CLI interface in runconfig/opts/parse.go and return an error if the
hostname is invalid. However, if an end user use the remote API to
pass the hostname, the error will not be returned immediately.
Instead the error will only be thrown out when the container creation
fails. This creates behavior discrepancy between API and CLI.
In this fix, the hostname validation was moved to
verifyContainerSettings so the behavior will be the same for API and
CLI.
After the change, since CLI does not handle the hostname validation
any more, the previous unit tests about hostname validation on CLI
in runconfig/opts/parse_test.go has to be updated as well because
there is no validation at this stage. All those unit tests are moved
to integration test TestRunTooLongHostname so that the hostname
validation is still properly covered as before.
Note: Since the hostname validation moved to API, the error message
changes from `invalid hostname format for --hostname:` to
`invalid hostname format:` as well because `--hostname` is passed
to CLI only.
This fix fixes#21595.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix tries to add an additional syslog-format of `rfc5424micro` which follows
the same as rfc5424 except that it use microsecond resolution for timestamp. The
purpose is to solve the issue raised in #21793 where log events might lose its
ordering if happens on the same second.
The timestamp field in rfc5424 is derived from rfc3339, though the maximium
resolution is limited to "TIME-SECFRAC" which is 6 (microsecond resolution).
The appropriate documentation (`docs/admin/logging/overview.md`) has been updated
to reflect the change in this fix.
This fix adds a unit test to cover the newly introduced format.
This fix fixes#21793.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
When container is automatically restarted based on restart policy,
docker events can't get "start" event but only get "die" event, this is
not consistent with previous behavior. This commit will add "start"
event back.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This adds support to the Windows graph driver for ApplyDiff on a base
layer. It also adds support for hard links, which are needed because the
Windows base layers double in size without hard link support.
Signed-off-by: John Starks <jostarks@microsoft.com>
Previously, Windows only supported running with a OS-managed base image.
With this change, Windows supports normal, Linux-like layered images, too.
Signed-off-by: John Starks <jostarks@microsoft.com>
Fixes an issue that prevents nano server images from loading properly. Also updates logic for custom image loading to avoid preventing daemon start because an image failed to load.
Signed-off-by: Stefan J. Wernli <swernli@microsoft.com>
Overlay tests were failing when /var/tmp was an overlay mount with a misleading message.
Now overlay tests will be skipped when attempting to be run on overlay.
Tests will now use the TMPDIR environment variable instead of only /var/tmp
Fixes#21686
Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
On aufs, auplink is run before the Unmount. Irrespective of the
result, we proceed to issue a Unmount syscall. In which case,
demote erros on auplink to warning.
Signed-off-by: Anusha Ragunathan <anusha@docker.com>
Since the layer store was introduced, the level above the graphdriver
now differentiates between read/write and read-only layers. This
distinction is useful for graphdrivers that need to take special steps
when creating a layer based on whether it is read-only or not.
Adding this parameter allows the graphdrivers to differentiate, which
in the case of the Windows graphdriver, removes our dependence on parsing
the id of the parent for "-init" in order to infer this information.
This will also set the stage for unblocking some of the layer store
unit tests in the next preview build of Windows.
Signed-off-by: Stefan J. Wernli <swernli@microsoft.com>
This else case was lost in the migration from native execdriver to OCI
implementation via runc. There is no need to have external setkey when
--net=host.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com>
Kernel has no limit for memory reservation, but in different
kernel versions, the default behavior is different.
On kernel 3.13,
docker run --rm --memory-reservation 1k busybox cat /sys/fs/cgroup/memory/memory.soft_limit_in_bytes
the output would be 4096, but on kernel 4.1, the output is 0.
Since we have minimum limit for memory and kernel memory, we
can have this limit for memory reservation as well, to make
the behavior consistent.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This vendors in new spec/runc that supports
setting readonly and masked paths in the
configuration. Using this allows us to make an
exception for `—-privileged`.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
These fields are needed to specify the exact version of Windows that an
image can run on. They may be useful for other platforms in the future.
This also changes image.store.Create to validate that the loaded image is
supported on the current machine. This change affects Linux as well, since
it now validates the architecture and OS fields.
Signed-off-by: John Starks <jostarks@microsoft.com>
btrfs-progs-4.5 introduces device delete by devid
for this reason btrfs_ioctl_vol_args_v2's name was encapsulated
in a union
this patch is for setting btrfs_ioctl_vol_args_v2's name
using a C function in order to preserve compatibility
with all btrfs-progs versions
Signed-off-by: Julio Montes <imc.coder@gmail.com>