From 42d9c55a1eef0d049f7b4b51d4993d8e8436cbf1 Mon Sep 17 00:00:00 2001 From: yallop Date: Fri, 5 May 2017 19:49:00 +0100 Subject: [PATCH] WIP: Osxfs caching updates (#2883) * Update the osxfs documentation with recent details about caching. Signed-off-by: Jeremy Yallop * Add a note about 'cached' to the volumes tutorial. Signed-off-by: Jeremy Yallop * Add a document giving a detailed specification of osxfs caching. Signed-off-by: Jeremy Yallop * Americanize spelling in the osxfs-caching document. Signed-off-by: Jeremy Yallop * More osxfs benchmark details for 'go list'. Signed-off-by: Jeremy Yallop * Remove mentions of the blog post. Signed-off-by: Jeremy Yallop * editorial, topic structure changes, x-refs Signed-off-by: Victoria Bialas * linked to new topic also where bind mounts are discussed in Namespaces Signed-off-by: Victoria Bialas * added more tags Signed-off-by: Victoria Bialas * Decapitalize descriptions Signed-off-by: Jeremy Yallop * "releease" ~> "release" Signed-off-by: Jeremy Yallop * Use @avsm's suggested rewording for the osxfs-caching introduction. Signed-off-by: Jeremy Yallop * Add an example showing how to use `cached`, `consistent`, etc. Signed-off-by: Jeremy Yallop * Add a link to the user-guided caching blog post. Signed-off-by: Jeremy Yallop * added examples heading, more x-refs to blog post, docker run Signed-off-by: Victoria Bialas * escaped second dash in --volume long version of command Signed-off-by: Victoria Bialas * fixed double dashes on volume option to render properly Signed-off-by: Victoria Bialas --- _data/toc.yaml | 2 + docker-for-mac/osxfs-caching.md | 226 ++++++++++++++++++++++++++++++ docker-for-mac/osxfs.md | 127 ++++++++--------- engine/tutorials/dockervolumes.md | 15 ++ 4 files changed, 304 insertions(+), 66 deletions(-) create mode 100644 docker-for-mac/osxfs-caching.md diff --git a/_data/toc.yaml b/_data/toc.yaml index a97b292efe..32c200f04b 100644 --- a/_data/toc.yaml +++ b/_data/toc.yaml @@ -2208,6 +2208,8 @@ manuals: title: Networking - path: /docker-for-mac/osxfs/ title: File system sharing + - path: /docker-for-mac/osxfs-caching/ + title: Performance tuning for volume mounts (shared filesystems) - path: /docker-for-mac/troubleshoot/ title: Logs and troubleshooting - path: /docker-for-mac/faqs/ diff --git a/docker-for-mac/osxfs-caching.md b/docker-for-mac/osxfs-caching.md new file mode 100644 index 0000000000..2eef9e49d3 --- /dev/null +++ b/docker-for-mac/osxfs-caching.md @@ -0,0 +1,226 @@ +--- +description: Osxfs caching +keywords: mac, osxfs, volume mounts, docker run -v, performance +title: Performance tuning for volume mounts (shared filesystems) +toc_max: 4 +toc_min: 2 +--- + +[Docker 17.04 CE +Edge](https://docs.docker.com/edge/#docker-ce-edge-new-features) adds support +for two new flags to the [docker run `-v`, +`--volume`](https://docs.docker.com/engine/reference/run/#volume-shared-filesystems) +option, `cached` and `delegated`, that can significantly improve the performance +of mounted volume access on Docker for Mac. These options begin to solve some of +the challenges discussed in [Performance issues, solutions, and +roadmap](/docker-for-mac/osxfs.md#performance-issues-solutions-and-roadmap). + +> **Tip:** Release notes for Docker CE Edge 17.04 are [here](https://github.com/moby/moby/releases/tag/v17.04.0-ce), and the associated pull request for the additional `docker run -v` flags is [here](https://github.com/moby/moby/pull/31047). + +## Performance implications of host-container file system consistency + +With Docker distributions now available for an increasing number of +platforms, including macOS and Windows, generalizing mount semantics +during container run is a necessity to enable workload optimizations. + +The current implementations of mounts on Linux provide a consistent +view of a host directory tree inside a container: reads and writes +performed either on the host or in the container are immediately +reflected in the other environment, and file system events (`inotify`, +`FSEvents`) are consistently propagated in both directions. + +On Linux, these guarantees carry no overhead, since the underlying VFS is +shared directly between host and container. However, on macOS (and +other non-Linux platforms) there are significant overheads to +guaranteeing perfect consistency, since messages describing file system +actions must be passed synchronously between container and host. The +current implementation is sufficiently efficient for most tasks, but +with certain types of workloads the overhead of maintaining perfect +consistency can result in significantly worse performance than a +native (non-Docker) environment. For example, + + * running `go list ./...` in the bind-mounted `docker/docker` source tree + takes around 26 seconds + + * writing 100MB in 1k blocks into a bind-mounted directory takes + around 23 seconds + + * running `ember build` on a freshly created (i.e. empty) application + involves around 70000 sequential syscalls, each of which translates + into a request and response passed between container and host. + +Optimizations to reduce latency throughout the stack have brought +significant improvements to these workloads, and a few further +optimization opportunities remain. However, even when latency is +minimized, the constraints of maintaining consistency mean that these +workloads remain unacceptably slow for some use cases. + +## Tuning with consistent, cached, and delegated configurations + +**_Fortunately, in many cases where the performance degradation is most +severe, perfect consistency between container and host is unnecessary._** +In particular, in many cases there is no need for writes performed in a +container to be immediately reflected on the host. For example, while +interactive development requires that writes to a bind-mounted directory +on the host immediately generate file system events within a container, +there is no need for writes to build artifacts within the container to +be immediately reflected on the host file system. Distinguishing between +these two cases makes it possible to significantly improve performance. + +There are three broad scenarios to consider, based on which you can dial in the level of consistency you need. In each case, the container +has an internally-consistent view of bind-mounted directories, but in +two cases temporary discrepancies are allowed between container and host. + + * `consistent`: perfect consistency + (host and container have an identical view of the mount at all times) + + * `cached`: the host's view is authoritative + (permit delays before updates on the host appear in the container) + + * `delegated`: the container's view is authoritative + (permit delays before updates on the container appear in the host) + +## Examples + +Each of these configurations (`consistent`, `cached`, `delegated`) can be specified as a suffix to the [`-v`](https://docs.docker.com/engine/reference/run/#volume-shared-filesystems) +option of [`docker run`](https://docs.docker.com/engine/reference/run.md). +For example, to bind-mount `/Users/yallop/project` in a container under +the path `/project`, you might run the following command: + +``` +docker run -v /Users/yallop/project:/project:cached alpine command +``` + +The caching configuration can be varied independently for each bind mount, +so you can mount each directory in a different mode: + +``` +docker run -v /Users/yallop/project:/project:cached \ + -v /host/another-path:/mount/another-point:consistent + alpine command +``` + +## Semantics + +The semantics of each configuration is described as a set of guarantees +relating to the observable effects of file system operations. In this +specification, "host" refers to the file system of the user's Docker +client. + +### delegated + +The `delegated` configuration provides the weakest set of guarantees. +For directories mounted with `delegated` the container's view of the +file system is authoritative, and writes performed by containers may not +be immediately reflected on the host file system. As with (e.g.) NFS +asynchronous mode, if a running container with a `delegated` bind mount +crashes, then writes may be lost. + +However, by relinquishing consistency, `delegated` mounts can offer +significantly better performance than the other configurations. Where +the data written is ephemeral or readily reproducible (e.g. scratch +space or build artifacts) `delegated` may be optimal for a user's +workload. + +A `delegated` mount offers the following guarantees, which are presented +as constraints on the container run-time: + +1. If the implementation offers file system events, the container state +as it relates to a specific event **_must_** reflect the host file system +state at the time the event was generated if no container modifications +pertain to related file system state. + +2. If flush or sync operations are performed, relevant data **_must_** be +written back to the host file system. Between flush or sync +operations containers **_may_** cache data written, metadata modifications, +and directory structure changes. + +3. All containers hosted by the same runtime **_must_** share a consistent +cache of the mount. + +4. When any container sharing a `delegated` mount terminates, changes +to the mount **_must_** be written back to the host file system. If this +writeback fails, the container's execution **_must_** fail via exit code +and/or Docker event channels. + +5. If a `delegated` mount is shared with a `cached` or a `consistent` +mount, those portions that overlap **_must_** obey `cached` or `consistent` +mount semantics, respectively. + + Besides these constraints, the `delegated` configuration offers the +container runtime a degree of flexibility: + +6. Containers **_may_** retain file data and metadata (including directory +structure, existence of nodes, etc) indefinitely and this cache **_may_** +desynchronize from the file system state of the host. Implementors are +encouraged to expire caches when host file system changes occur but, +due to platform limitations, may be unable to do this in any specific +timeframe. + +7. If changes to the mount source directory are present on the host +file system, those changes **_may_** be lost when the `delegated` mount +synchronizes with the host source directory. + +8. Behaviors 6-7 **do not** apply to the file types of socket, pipe, or device. + +### cached + +The `cached` configuration provides all the guarantees of the `delegated` +configuration, and some additional guarantees around the visibility of writes +performed by containers. As such, `cached` typically improves the performance +of read-heavy workloads, at the cost of some temporary inconsistency between the +host and the container. + +For directories mounted with `cached`, the host's view of +the file system is authoritative; writes performed by containers are immediately +visible to the host, but there may be a delay before writes performed on the +host are visible within containers. + +>**Tip:** To learn more about `cached`, see the article on +[User-guided caching in Docker for Mac](https://blog.docker.com/2017/05/user-guided-caching-in-docker-for-mac/). + +1. Implementations **_must_** obey `delegated` Semantics 1-5. + +2. If the implementation offers file system events, the container state +as it relates to a specific event **_must_** reflect the host file system +state at the time the event was generated. + +3. Container mounts **_must_** perform metadata modifications, directory +structure changes, and data writes consistently with the host file +system, and **_must not_** cache data written, metadata modifications, or +directory structure changes. + +4. If a `cached` mount is shared with a `consistent` mount, those portions +that overlap **_must_** obey `consistent` mount semantics. + + Some of the flexibility of the `delegated` configuration is retained, +namely: + +5. Implementations **_may_** permit `delegated` Semantics 6. + +### consistent + +The `consistent` configuration places the most severe restrictions on +the container run-time. For directories mounted with `consistent` the +container and host views are always synchronized: writes performed +within the container are immediately visible on the host, and writes +performed on the host are immediately visible within the container. + +The `consistent` configuration most closely reflects the behavior of +bind mounts on Linux. However, the overheads of providing strong +consistency guarantees make it unsuitable for a few use cases, where +performance is a priority and maintaining perfect consistency has low +priority. + +1. Implementations **_must_** obey `cached` Semantics 1-4. + +2. Container mounts **_must_** reflect metadata modifications, directory +structure changes, and data writes on the host file system immediately. + +### default + +The `default` configuration is identical to the `consistent` +configuration except for its name. Crucially, this means that `cached` +Semantics 4 and `delegated` Semantics 5 that require strengthening +overlapping directories do not apply to `default` mounts. This is the +default configuration if no `state` flags are supplied. diff --git a/docker-for-mac/osxfs.md b/docker-for-mac/osxfs.md index 0dac0a5f04..e641163dba 100644 --- a/docker-for-mac/osxfs.md +++ b/docker-for-mac/osxfs.md @@ -1,5 +1,5 @@ --- -description: OSXFS +description: Osxfs keywords: mac, osxfs redirect_from: - /mackit/osxfs/ @@ -27,8 +27,8 @@ Mac software dubiously relies on case-insensitivity to function. ### Access control `osxfs`, and therefore Docker, can access only those file system resources that -the Docker for Mac user has access to. `osxfs` does not run as `root`. If the OS -X user is an administrator, `osxfs` inherits those administrator privileges. We +the Docker for Mac user has access to. `osxfs` does not run as `root`. If the macOS +user is an administrator, `osxfs` inherits those administrator privileges. We are still evaluating which privileges to drop in the file system process to balance security and ease-of-use. `osxfs` performs no additional permissions checks and enforces no extra access control on accesses made through it. All @@ -69,6 +69,8 @@ VM, an attempt to bind mount it will fail rather than create it in the VM. Paths that already exist in the VM and contain files are reserved by Docker and cannot be exported from macOS. +>Please see **[Performance tuning for volume mounts (shared filesystems)](/docker-for-mac/osxfs-caching.md)** to learn about new configuration options available with the Docker 17.04 CE Edge release. + ### Ownership Initially, any containerized process that requests ownership metadata of @@ -149,17 +151,18 @@ between macOS userspace processes and the macOS kernel. ### Performance issues, solutions, and roadmap +>Please see **[Performance tuning for volume mounts (shared filesystems)](/docker-for-mac/osxfs-caching.md)** to learn about new configuration options available with the Docker 17.04 CE Edge release. + With regard to reported performance issues ([GitHub issue 77: File access in mounted volumes extremely slow](https://github.com/docker/for-mac/issues/77)), and a similar thread on [Docker for Mac forums on topic: File access in mounted volumes extremely slow](https://forums.docker.com/t/file-access-in-mounted-volumes-extremely-slow-cpu-bound/), -this topic provides an explanation of the issues, what we are doing to -address them, how the community can help us, and what you can expect in the -future. This explanation is a slightly re-worked version of an [understanding -performance -post](https://forums.docker.com/t/file-access-in-mounted-volumes-extremely-slow-cpu-bound/8076/158?u=orangesnap) -from David Sheets (@dsheets) on the [Docker development +this topic provides an explanation of the issues, recent progress in addressing +them, how the community can help us, and what you can expect in the +future. This explanation derives from a [post about understanding +performance](https://forums.docker.com/t/file-access-in-mounted-volumes-extremely-slow-cpu-bound/8076/158?u=orangesnap) +by David Sheets (@dsheets) on the [Docker development team](https://forums.docker.com/groups/Docker) to the forum topic just mentioned. We want to surface it in the documentation for wider reach. @@ -172,7 +175,7 @@ file system server in Docker for Mac. File system APIs are very wide (20-40 message types) with many intricate semantics involving on-disk state, in-memory cache state, and concurrent access by multiple processes. Additionally, `osxfs` integrates a mapping between macOS's FSEvents API and Linux's `inotify` API -which is implemented inside of the file system itself complicating matters +which is implemented inside of the file system itself, complicating matters further (cache behavior in particular). At the highest level, there are two dimensions to file system performance: @@ -186,65 +189,64 @@ Latency is the time it takes for a file system call to complete. For instance, the time between a thread issuing write in a container and resuming with the number of bytes written. With a classical block-based file system, this latency is typically under 10μs (microseconds). With `osxfs`, latency is presently -around 200μs for most operations or 20x slower. For workloads which demand many -sequential roundtrips, this results in significant observable slowdown. To -reduce the latency, we need to shorten the data path from a Linux system call to +around 130μs for most operations or 13× slower. For workloads which demand many +sequential roundtrips, this results in significant observable slowdown. +Reducing the latency requires shortening the data path from a Linux system call to macOS and back again. This requires tuning each component in the data path in turn -- some of which require significant engineering effort. Even if we achieve -a huge latency reduction of 100μs/roundtrip, we will still "only" see a doubling +a huge latency reduction of 65μs/roundtrip, we will still "only" see a doubling of performance. This is typical of performance engineering, which requires significant effort to analyze slowdowns and develop optimized components. We -know how we can likely halve the roundtrip time but we haven't implemented those -improvements yet (more on this below in +know a number of approaches that will probably reduce the roundtrip time but we +haven't implemented all those improvements yet (more on this below in [What you can do](osxfs.md#what-you-can-do)). -There is hope for significant performance improvement in the near term despite -these fundamental communication channel properties, which are difficult to -overcome (latency in particular). This hope comes in the form of increased -caching (storing "recent" values closer to their use to prevent roundtrips -completely). The Linux kernel's VFS layer contains a number of caches which can -be used to greatly improve performance by reducing the required communication -with the file system. Using this caching comes with a number of trade-offs: +A second approach to improving performance is to reduce the number of +roundtrips by caching data. Recent versions of Docker for Mac (17.04 onwards) +include caching support that brings significant (2-4×) improvements to many +applications. Much of the overhead of osxfs arises from the requirement to +keep the container's and the host's view of the file system consistent, but +full consistency is not necessary for all applications and relaxing the +constraint opens up a number of opportunities for improved performance. -* It requires understanding the cache behavior in detail in order to write -correct, stateful functionality on top of those caches. - -* It harms the coherence or consistency of the file system as observed -from Linux containers and the macOS file system interfaces. +At present there is support for read caching, with which the container's view +of the file system can temporarily drift apart from the authoritative view on +the host. Further caching developments, including support for write caching, +are planned. +A [detailed description of the behavior in various caching configurations](osxfs-caching) +is available. #### What we are doing -We are actively working on both increasing caching while mitigating the -associated issues and on reducing the file system data path latency. This -requires significant analysis of file system traces and speculative development -of system improvements to try to address specific performance issues. Perhaps -surprisingly, application workload can have a huge effect on performance. As an -example, here are two different use cases contributed on the [forum topic](https://forums.docker.com/t/file-access-in-mounted-volumes-extremely-slow-cpu-bound/) +We continue to actively work on increasing caching and on reducing the +file system data path latency. This requires significant analysis of file +system traces and speculative development of system improvements to try to +address specific performance issues. Perhaps surprisingly, application +workload can have a huge effect on performance. As an example, here are two +different use cases contributed on the +[forum topic](https://forums.docker.com/t/file-access-in-mounted-volumes-extremely-slow-cpu-bound/) and how their performance differs and suffers due to latency, caching, and coherence: 1. A rake example (see below) appears to attempt to access 37000+ -different files that don't exist on the shared volume. We can work very hard to -speed up all use cases by 2x via latency reduction but this use case will still -seem "slow". The ultimate solution for rake is to use a "negative dcache" that +different files that don't exist on the shared volume. Even with a 2× speedup +via latency reduction this use case will still seem "slow". +With caching enabled the performance increases around 3.5×, as described in +the [user-guided caching post](link-TODO). +We expect to see further performance improvements for rake with a "negative dcache" that keeps track of, in the Linux kernel itself, the files that do not exist. -Unfortunately, even this is not sufficient for the first time rake is run on a +However, even this is not sufficient for the first time rake is run on a shared directory. To handle that case, we actually need to develop a Linux kernel patch which negatively caches all directory entries not in a -specified set -- and this cache must be kept up-to-date in real-time with the OS -X file system state even in the presence of missing macOS FSEvents messages and +specified set -- and this cache must be kept up-to-date in real-time with the macOS +file system state even in the presence of missing macOS FSEvents messages and so must be invalidated if macOS ever reports an event delivery failure. -2. Running ember build in a shared file system results in ember creating many +2. Running `ember build` in a shared file system results in ember creating many different temporary directories and performing lots of intermediate activity within them. An empty ember project is over 300MB. This usage pattern does not -require coherence between Linux and macOS but, because we cannot distinguish this -fact at run-time, we maintain coherence during its hundreds of thousands of file -system accesses to manipulate temporary state. There is no "correct" solution in -this case. Either ember needs to change, the volume mount needs to have -coherence properties specified on it somehow, some heuristic needs to be -introduced to detect this access pattern and compensate, or the behavior needs -to be indicated via, e.g., extended attributes in the macOS file system. +require coherence between Linux and macOS, and will be significantly improved by +write caching. These two examples come from performance use cases contributed by users and they are incredibly helpful in prioritizing aspects of file system performance to @@ -254,24 +256,17 @@ work on next. Under development, we have: -1. A Linux kernel patch to reduce data path latency by 2/7 copies and 2/5 -context switches - -2. Increased macOS integration to reduce the latency between the hypervisor and -the file system server - -3. A server-side directory read cache to speed up traversal of large directories - -4. User-facing file system tracing capabilities so that you can send us -recordings of slow workloads for analysis - -5. A growing performance test suite of real world use cases (more on this below +1. A growing performance test suite of real world use cases (more on this below in What you can do) -6. Experimental support for using Linux's inode, writeback, and page caches +2. Further caching improvements, including negative, structural, and write-back +caching, and lazy cache invalidation. -7. End-user controls to configure the coherence of subsets of cross-OS bind -mounts without exposing all of the underlying complexity +3. A Linux kernel patch to reduce data path latency by 2/7 copies and 2/5 +context switches + +4. Increased macOS integration to reduce the latency between the hypervisor and +the file system server #### What you can do @@ -310,16 +305,16 @@ can be easily tracked. #### What you can expect We will continue to work toward an optimized shared file system implementation -on the Beta channel of Docker for Mac. +on the Edge channel of Docker for Mac. You can expect some of the performance improvement work mentioned above to reach -the Beta channel in the coming release cycles. +the Edge channel in the coming release cycles. In due course, we will open source all of our shared file system components. At that time, we would be very happy to collaborate with you on improving the implementation of `osxfs` and related software. -We still have on the slate to write up and publish details of shared file system +We also plan to write up and publish further details of shared file system performance analysis and improvement on the Docker blog. Look for or nudge @dsheets about those articles, which should serve as a jumping off point for understanding the system, measuring it, or contributing to it. diff --git a/engine/tutorials/dockervolumes.md b/engine/tutorials/dockervolumes.md index c2c77bee09..1a553cec9a 100644 --- a/engine/tutorials/dockervolumes.md +++ b/engine/tutorials/dockervolumes.md @@ -151,6 +151,21 @@ $ docker run -d -P --name web -v /src/webapp:/webapp:ro training/webapp python a Here you've mounted the same `/src/webapp` directory but you've added the `ro` option to specify that the mount should be read-only. +You can also relax the consistency requirements of a mounted directory +to improve performance by adding the `cached` option: + +```bash +$ docker run -d -P --name web -v /src/webapp:/webapp:cached training/webapp python app.py +``` + +The `cached` option typically improves the performance of read-heavy workloads +on Docker for Mac, at the cost of some temporary inconsistency between the host +and the container. On other platforms, `cached` currently has no effect. The +article [User-guided caching in Docker for +Mac](https://blog.docker.com/2017/05/user-guided-caching-in-docker-for-mac/) +gives more details about the behavior of `cached` on macOS. + + >**Note**: The host directory is, by its nature, host-dependent. For this >reason, you can't mount a host directory from `Dockerfile`, the `VOLUME` instruction does not support passing a `host-dir`, because built images