- Stop serializing JSONMessage in favor of events.Message.
- Keep backwards compatibility with JSONMessage for container events.
Signed-off-by: David Calavera <david.calavera@gmail.com>
This can be allowed because it should only restrict more per the seccomp docs, and multiple apps use it today.
Signed-off-by: Jessica Frazelle <acidburn@docker.com>
Block kcmp, procees_vm_readv, process_vm_writev.
All these require CAP_PTRACE, and are only used for ptrace related
actions, so are not useful as we block ptrace.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
The bpf syscall can load code into the kernel which may
persist beyond container lifecycle. Requires CAP_SYS_ADMIN
already.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
These provide an in kernel virtual machine for x86 real mode on x86
used by one very early DOS emulator. Not required for any normal use.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
The stime syscall is a legacy syscall on some architectures
to set the clock, should be blocked as time is not namespaced.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
clock_adjtime is the new posix style version of adjtime allowing
a specific clock to be specified. Time is not namespaced, so do
not allow.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
This is a new version of init_module that takes a file descriptor
rather than a file name.
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
The set_robust_list syscall sets the list of futexes which are
cleaned up on thread exit, and are needed to avoid mutexes
being held forever on thread exit.
See for example in Musl libc mutex handling:
http://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_mutex_trylock.c#n22
Signed-off-by: Justin Cormack <justin.cormack@unikernel.com>
Support restoreCustomImage for windows with a new interface to extract
the graph driver from the LayerStore.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
This provides a best effort on daemon restarts to restart containers
which have linked containers that are not up yet instead of failing.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
It's used for updating properties of one or more containers, we only
support resource configs for now. It can be extended in the future.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
When a container is created it is registered before the mount is created. This can lead to mount does not exist errors when inspecting between create and mount.
Fixes#18753
Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
RWLayer will now have more operations and be protected through a referenced type rather than always looked up by string in the layer store.
Separates creation of RWLayer (write capture layer) from mounting of the layer.
This allows mount labels to be applied after creation and allowing RWLayer objects to have the same lifespan as a container without performance regressions from requiring mount.
Signed-off-by: Derek McGowan <derek@mcgstyle.net> (github: dmcgowan)
Add filter support for `network ls` to hide predefined network,
then user can use "docker network rm `docker network ls -f type=custom`"
to delete a bundle of userdefined networks.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Don't involve code waiting for blocking channel in locked critical
section because it has potential risk of hanging forever.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
We keep only one logic to test event related behavior that will help us
diagnose flacky event errors.
Signed-off-by: David Calavera <david.calavera@gmail.com>
- Make the API client library completely standalone.
- Move windows partition isolation detection to the client, so the
driver doesn't use external types.
Signed-off-by: David Calavera <david.calavera@gmail.com>
This is a very docker concept that nobody elses need.
We only maintain it to keep the API backwards compatible.
Signed-off-by: David Calavera <david.calavera@gmail.com>
1. It's a cgroup api, fit the general defination that we take
cgroup options as kind of resource options.
2. It's common usage and very helpful as explained here:
https://github.com/docker/docker/pull/18270#issuecomment-160561316
3. It's already in `Resource` struct in
daemon/execdriver/driver_unix.go
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
The loopback logic is not technically exclusive to the devicemapper
driver. This reorganizes the code such that the loopback code is usable
outside of the devicemapper package and driver.
Signed-off-by: Vincent Batts <vbatts@redhat.com>
These filters are only use to interchange data between clients and daemons.
They don't belong to the parsers package.
Signed-off-by: David Calavera <david.calavera@gmail.com>
Really fixing 2 things:
1. Panic when any error is detected while walking the btrfs graph dir on
removal due to no error check.
2. Nested subvolumes weren't actually being removed due to passing in
the wrong path
On point 2, for a path detected as a nested subvolume, we were calling
`subvolDelete("/path/to/subvol", "subvol")`, where the last part of the
path was duplicated due to a logic error, and as such actually causing
point #1 since `subvolDelete` joins the two arguemtns, and
`/path/to/subvol/subvol` (the joined version) doesn't exist.
Also adds a test for nested subvol delete.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
- Move time json marshaling to the jsonlog package: this is a docker
internal hack that we should not promote as a library.
- Move Timestamp encoding/decoding functions to the API types: This is
only used there. It could be a standalone library but I don't this
it's worth having a separated repo for this. It could introduce more
complexity than it solves.
Signed-off-by: David Calavera <david.calavera@gmail.com>
This restores the behavior that existed prior to #16235 for setting
OOMKilled, while retaining the additional benefits it introduced around
emitting the oom event.
This also adds a test for the most obvious OOM cases which would have
caught this regression.
Fixes#18510
Signed-off-by: Euan <euank@amazon.com>
This was causing the host /dev/mqueue to be remapped to the daemon's
user namespace range root user and group. Given the perms are open on
the mqueue path, there is no need to chown this path at all.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
After the very first init of the graph `docker info` correctly shows the
base fs type under `Backing Filesystem`. This information isn't stored
anywhere. After a restart (w/o erasing `/var/lib/docker`) `docker info`
shows an empty string under `Backing Filesystem`.
This patch records the base fs type after the first run in the metadata
or, to fix old devices that don't have this info in the metadata, just
probe the fs type of the base device at graph startup.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
After addition of multi-host networking in Docker 1.9, Docker Remote
API is still returning only the network specified during creation
of the container in the “List Containers” (`/containers/json`) endpoint:
...
"HostConfig": {
"NetworkMode": "default"
},
The list of networks containers are attached to is only available at
Get Container (`/containers/<id>/json`) endpoint.
This does not allow applications utilizing multi-host networking to
be built on top of Docker Remote API.
Therefore I added a simple `"NetworkSettings"` section to the
`/containers/json` endpoint. This is not identical to the NetworkSettings
returned in Get Container (`/containers/<id>/json`) endpoint. It only
contains a single field `"Networks"`, which is essentially the same
value shown in inspect output of a container.
This change adds the following section to the `/containers/json`:
"NetworkSettings": {
"Networks": {
"bridge": {
"EndpointID": "2cdc4edb1ded3631c81f57966563e...",
"Gateway": "172.17.0.1",
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"MacAddress": "02:42:ac:11:00:02"
}
}
}
This is of type `SummaryNetworkSettings` type, a minimal version of
`api/types#NetworkSettings`.
Actually all I need is the network name and the IPAddress fields. If folks
find this addition too big, I can create a `SummaryEndpointSettings` field
as well, containing just the IPAddress field.
Signed-off-by: Ahmet Alp Balkan <ahmetalpbalkan@gmail.com>
`adaptContainerSettings` is growing up, new it's only called
when create. It'll be a problem that old containers will never
have chance to adapt the latest rule. `HostConfig` of these
containers will be obsoleted.
Add this calling to start to avoid problems like #18550 and
avoid such backward compatability in the future.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Whether a shared/slave volume propagation will work or not also depends on
where source directory is mounted on and what are the propagation properties
of that mount point. For example, for shared volume mount to work, source
mount point should be shared. For slave volume mount to work, source mount
point should be either shared/slave.
This patch determines the mount point on which directory is mounted and
checks for desired minimum propagation properties of that mount point. It
errors out of configuration does not seem right.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Allow passing mount propagation option shared, slave, or private as volume
property.
For example.
docker run -ti -v /root/mnt-source:/root/mnt-dest:slave fedora bash
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
To make docker inspect return a consistent result of networksettings
for created container and stopped container, it's bettew to update
the network settings on container creating.
Signed-off-by: Lei Jitang <leijitang@huawei.com>
registry.ResolveAuthConfig() only needs the AuthConfigs from the ConfigFile, so
this change passed just the AuthConfigs.
Signed-off-by: Daniel Nephin <dnephin@gmail.com>
Container needs to be locked when updating the fields, and
this PR also remove the redundant `parseSecurityOpt` since
it'll be done in `setHostConfig`.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This commit adds a transfer manager which deduplicates and schedules
transfers, and also an upload manager and download manager that build on
top of the transfer manager to provide high-level interfaces for uploads
and downloads. The push and pull code is modified to use these building
blocks.
Some benefits of the changes:
- Simplification of push/pull code
- Pushes can upload layers concurrently
- Failed downloads and uploads are retried after backoff delays
- Cancellation is supported, but individual transfers will only be
cancelled if all pushes or pulls using them are cancelled.
- The distribution code is decoupled from Docker Engine packages and API
conventions (i.e. streamformatter), which will make it easier to split
out.
This commit also includes unit tests for the new distribution/xfer
package. The tests cover 87.8% of the statements in the package.
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
All underlay dirs need proper remapped ownership. This bug was masked by the
fact that the setupInitLayer code was chown'ing the dirs at startup
time. Since that bug is now fixed, it revealed this permissions issue.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Each plug-in operates as a separate service, and registers with Docker
through general (plug-ins API)
[https://blog.docker.com/2015/06/extending-docker-with-plugins/]. No
Docker daemon recompilation is required in order to add / remove an
authentication plug-in. Each plug-in is notified twice for each
operation: 1) before the operation is performed and, 2) before the
response is returned to the client. The plug-ins can modify the response
that is returned to the client.
The authorization depends on the authorization effort that takes place
in parallel [https://github.com/docker/docker/issues/13697].
This is the official issue of the authorization effort:
https://github.com/docker/docker/issues/14674
(Here)[https://github.com/rhatdan/docker-rbac] you can find an open
document that discusses a default RBAC plug-in for Docker.
Signed-off-by: Liron Levin <liron@twistlock.com>
Added container create flow test and extended the verification for ps
Ubuntu 14.04 LTS is on apparmor 2.8.95.
This enables `ps` inside a container without causing
audit log entries on the host.
Signed-off-by: Joel Hansson <joel.hansson@ecraft.com>
Use better error message when user want to connect container with same
name to one network, this can help avoid confusion.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This solves a bug where /etc may have pre-existing permissions from
build time, but init layer setup (reworked for user namespaces) was
assuming root ownership. Adds a test as well to catch this situation in
the future.
Minor fix to wrong ordering of chown/close on files created during the
same initlayer setup.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
So other packages don't need to import the daemon package when they
want to use this struct.
Signed-off-by: David Calavera <david.calavera@gmail.com>
Signed-off-by: Tibor Vass <tibor@docker.com>
A TopicFunc is an interface to let the pubisher decide whether it needs
to send a message to a subscriber or not. It returns true if the
publisher must send the message and false otherwise.
Users of the pubsub package can create a subscriber with a topic
function by calling `pubsub.SubscribeTopic`.
Message delivery has also been modified to use concurrent channels per
subscriber. That way, topic verification and message delivery is not
o(N+M) anymore, based on the number of subscribers and topic verification
complexity.
Using pubsub topics, the API stops controlling the message delivery,
delegating that function to a topic generated with the filtering
provided by the user. The publisher sends every message to the
subscriber if there is no filter, but the api doesn't have to select
messages to return anymore.
Signed-off-by: David Calavera <david.calavera@gmail.com>
Improves the current filtering implementation complixity.
Currently, the best case is O(N) and worst case O(N^2) for key-value filtering.
In the new implementation, the best case is O(1) and worst case O(N), again for key-value filtering.
Signed-off-by: David Calavera <david.calavera@gmail.com>
It will Tar up contents of child directory onto tmpfs if mounted over
This patch will use the new PreMount and PostMount hooks to "tar"
up the contents of the base image on top of tmpfs mount points.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Closes#9798
@maintainers please note that this is a change to the UX. We no longer
require the -f flag on `docker tag` to move a tag from an existing image.
However, this does make us more consistent across our commands,
see https://github.com/docker/docker/issues/9798 for the history.
Signed-off-by: Doug Davis <dug@us.ibm.com>
- avoid empty Names in container list API when fails to remove
a container
- avoid dead containers when fails to create a container
Signed-off-by: Shijiang Wei <mountkin@gmail.com>
ext4 filesystem creation can take a long time on 100G thin device and
systemd might time out and kill docker service. Often user is left thinking
why docker is taking so long and logs don't give any hint. Log an info
message in journal for start and end of filesystem creation. That way
a user can look at logs and figure out that filesystem creation is
taking long time.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
libcontainer v0.0.4 introduces setting `/proc/self/oom_score_adj` to
better tune oom killing preferences for container process. This patch
simply integrates OomScoreAdj libcontainer's config option and adjust
the cli with this new option.
Signed-off-by: Antonio Murdaca <amurdaca@redhat.com>
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Currently, the resources associated with the io.Reader returned by
TarStream are only freed when it is read until EOF. This means that
partial uploads or exports (for example, in the case of a full disk or
severed connection) can leak a goroutine and open file. This commit
changes TarStream to return an io.ReadCloser. Resources are freed when
Close is called.
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Docker daemon uses kv-store as the host-discovery backend.
Discovery module tracks the liveness of a node through a simple
keepalive mechanism. The keepalive mechanism depends on every
node performing heartbeat by registering itself with the discovery
module (via KV-Store Put operation). And for every Put operation,
the discovery module in all other nodes will receive a Watch
notification. That keeps the node alive.
Any node that fails to register itself within the TTL timer is
considered dead and removed from the discovery database.
The default timer (heartbeat = 20 seconds & ttl = 60 seconds)
works fine for small clusters. But for large clusters, these
default timers are extremely aggressive and that causes high CPU
& most of the processing is spent managing the node discovery
and that impacts normal daemon operation.
Hence we need a way to make the discovery ttl and heartbeat
configurable. As the cluster size grows, the user can change
these timers to make sure the daemon scales.
Signed-off-by: Madhu Venugopal <madhu@docker.com>
Add distribution package for managing pulls and pushes. This is based on
the old code in the graph package, with major changes to work with the
new image/layer model.
Add v1 migration code.
Update registry, api/*, and daemon packages to use the reference
package's types where applicable.
Update daemon package to use image/layer/tag stores instead of the graph
package
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
Adjust the docker-default profile for when the docker daemon is running in
AppArmor confinement. To enable 'docker kill' we need to allow the container
to receive kill signals from the daemon.
Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
Remove double reference between containers and exec configurations by
keeping only the container id.
Signed-off-by: David Calavera <david.calavera@gmail.com>
This is a small configuration struct used in two scenarios:
1. To attach I/O pipes to a running containers.
2. To attach to execution processes inside running containers.
Although they are similar, keeping the struct in the same package
than exec and container can generate cycled dependencies if we
move any of them outside the daemon, like we want to do
with the container.
Signed-off-by: David Calavera <david.calavera@gmail.com>
* This commit will mark --before and --since as deprecated, but leave their behavior
unchanged until they are removed, then re-implement them as options for --filter.
* And update the related docs.
* Update the integration tests.
Fixes issue #17716
Signed-off-by: Wen Cheng Ma <wenchma@cn.ibm.com>
- Optional "--shm-size=" was added to the sub-command(run, create,and build).
- The size of /dev/shm in the container can be changed
when container is made.
- Being able to specify is a numerical value that applies number,
b, k, m, and g.
- The default value is 64MB, when this option is not set.
- It deals with both native and lxc drivers.
Signed-off-by: NIWA Hideyuki <niwa.hiedyuki@jp.fujitsu.com>
Volumes can have more mount points beneath them and unmount will fail. This
is the case when a bind mounted directory on host already had a mount point
underneath it. So use lazy unmount instead.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Our implementation of systemd cgroups is mixture of systemd api and
plain filesystem api. It's hard to keep it up to date with systemd and
it already contains some nasty bugs with new versions. Ideally it should
be replaced with some daemon flag which will allow to set parent systemd
slice.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
By removing deprecated volume structures, now that windows mount volumes we don't need a initializer per platform.
Signed-off-by: David Calavera <david.calavera@gmail.com>