WIP: TLS Docs for Swarm

Struct edit pass thru conceptual material Updating with comments from Mike Tweaking menu layout Updating for Nigel Updating with local images, formatting fixes Updating with the comments from review Signed-off-by: Mary Anthony <mary@docker.com>
2016-01-27 13:17:35 -08:00 · 2016-01-27 13:17:35 -08:00 · f93d787e3b
parent 342611313e
commit f93d787e3b
24 changed files with 1887 additions and 1 deletions
--- a/docs/configure-tls.md
+++ b/docs/configure-tls.md
@ -0,0 +1,594 @@
+<!--[metadata]>
+++
+title = "Configure Docker Swarm for TLS"
+description = "Swarm and transport layer security"
+keywords = ["docker, swarm, TLS, discovery, security,  certificates"]
+[menu.main]
+parent="workw_swarm"
+weight=55
+++
+<![end-metadata]-->
+
+# Configure Docker Swarm for TLS
+
+In this procedure you create a two-node Swarm cluster, a Docker Engine CLI, a
+Swarm Manager, and a Certificate Authority as shown below. All the Docker Engine
+hosts (`client`, `swarm`, `node1`, and `node2`) have a copy of the
+CA's certificate as well as their own key-pair signed by the CA.
+
+![](images/tls-1.jpg)
+
+You will complete the following steps in this procedure:
+
+
+- [Step 1: Set up the prerequisites](#step-1-set-up-the-prerequisites)
+- [Step 2: Create a Certificate Authority (CA) server](#step-2-create-a-certificate-authority-ca-server)
+- [Step 3: Create and sign keys](#step-3-create-and-sign-keys)
+- [Step 4: Install the keys](#step-4-install-the-keys)
+- [Step 5: Configure the Engine daemon for TLS](#step-5-configure-the-engine-daemon-for-tls)
+- [Step 6: Create a Swarm cluster](#step-6-create-a-swarm-cluster)
+- [Step 7: Create the Swarm Manager using TLS](#step-7-create-the-swarm-manager-using-tls)
+- [Step 8: Test the Swarm manager configuration](#step-8-test-the-swarm-manager-configuration)
+- [Step 9: Configure the Engine CLI to use TLS](#step-9-configure-the-engine-cli-to-use-tls)
+
+### Before you begin
+The article includes steps to create your own CA using OpenSSL. This is similar
+to operating your own internal corporate CA and PKI. However, this `must not`
+be used as a guide to building a production-worthy internal CA and PKI. These
+steps are included for demonstration purposes only - so that readers without
+access to an existing CA and set of certificates can follow along and configure
+Docker Swarm to use TLS.
+
+
+## Step 1: Set up the prerequisites
+
+To complete this procedure you must stand up 5 (five) Linux servers. These
+servers can be any mix of physical and virtual servers; they may be on premises
+or in the public cloud. The following table lists each server name and its purpose.
+
+| Server name | Description                                    |
+|-------------|------------------------------------------------|
+| `ca`      | Acts as the Certificate Authority (CA) server. |
+| `swarm`   | Acts as the Swarm Manager.                     |
+| `node1`   | Act as a Swarm node.                           |
+| `node2`   | Act as a Swarm node.                           |
+| `client`  | Acts as a remote Docker Engine client          |
+
+Make sure that you have SSH access to all 5 servers and that they can communicate with each other using DNS name resolution. In particular:
+
+- Open TCP port 2376 between the Swarm Manager and Swarm nodes
+- Open TCP port 3376 between the Docker Engine client and the Swarm Manager
+
+You can choose different ports if these are already in use. This example assumes
+you use these ports though.
+
+Each server must run an operating system compatible with Docker Engine. For
+simplicity, the steps that follow assume all servers are running Ubuntu 14.04
+LTS.
+
+## Step 2: Create a Certificate Authority (CA) server
+
+>**Note**:If you already have access to a CA and certificates, and are comfortable working with them, you should skip this step and go to the next.
+
+In this step, you configure a Linux server as a CA. You use this CA to create
+and sign keys. This step included so that readers without access to an existing
+CA (external or corpoate) and certificates can follow along and complete the
+later steps that require installing and using certificates. It is `not`
+intended as a model for how to deploy production-worthy CA.
+
+1. Logon to the terminal of your CA server and elevate to root.
+
+        $ sudo su
+
+2. Create a private key called `ca-priv-key.pem` for the CA:
+
+        $ sudo openssl genrsa -out ca-priv-key.pem 2048
+        Generating RSA private key, 2048 bit long modulus
+        ...........................................................+++
+        .....+++
+        e is 65537 (0x10001)
+
+3. Create a public key called `ca.pem` for the CA.
+
+    The public key is based on the private key created in the previous step.
+
+        $ sudo openssl req -config /usr/lib/ssl/openssl.cnf -new -key ca-priv-key.pem -x509 -days 1825 -out ca.pem
+        You are about to be asked to enter information that will be incorporated
+        into your certificate request.
+        What you are about to enter is what is called a Distinguished Name or a DN.
+        There are quite a few fields but you can leave some blank
+        For some fields there will be a default value,
+        If you enter '.', the field will be left blank.
+        -----
+        Country Name (2 letter code) [AU]:US
+        <output truncated>
+
+You have now configured a CA server with a public and private keypair. You can inspect the contents of each key. To inspect the private key:
+
+```
+$ sudo openssl rsa -in ca-priv-key.pem -noout -text
+```
+
+To inspect the public key (cert): `
+
+```
+$ sudo openssl x509 -in ca.pem -noout -text`
+```
+
+The following command shows the partial contents of the CA's public key.
+
+    $ sudo openssl x509 -in ca.pem -noout -text
+    Certificate:
+        Data:
+            Version: 3 (0x2)
+            Serial Number: 17432010264024107661 (0xf1eaf0f9f41eca8d)
+        Signature Algorithm: sha256WithRSAEncryption
+            Issuer: C=US, ST=CA, L=Sanfrancisco, O=Docker Inc
+            Validity
+                Not Before: Jan 16 18:28:12 2016 GMT
+                Not After : Jan 13 18:28:12 2026 GMT
+            Subject: C=US, ST=CA, L=San Francisco, O=Docker Inc
+            Subject Public Key Info:
+                Public Key Algorithm: rsaEncryption
+                    Public-Key: (2048 bit)
+                    Modulus:
+                        00:d1:fe:6e:55:d4:93:fc:c9:8a:04:07:2d:ba:f0:
+                        55:97:c5:2c:f5:d7:1d:6a:9b:f0:f0:55:6c:5d:90:
+    <output truncated>
+
+Later, you'll use this to certificate to sign keys for other servers in the
+infrastructure.
+
+## Step 3: Create and sign keys
+
+Now that you have a working CA, you need to create key pairs for the Swarm
+Manager, Swarm nodes, and remote Docker Engine client. The commands and process
+to create key pairs is identical for all servers.  You'll create the following keys:
+
+<table>
+  <tr>
+    <th></th>
+    <th></th>
+  </tr>
+  <tr>
+    <td><code>ca-priv-key.pem</td>
+    <td>The CA's private key and must be kept secure. It is used later to sign new keys for the other nodes in the environment. Together with the <code>ca.pem</code> file, this makes up the CA's key pair.</td>
+  </tr>
+  <tr>
+    <td><code>ca.pem</td>
+    <td>The CA's public key (also called certificate). This is installed on all nodes in the environment so that all nodes trust certificates signed by the CA. Together with the <code>ca-priv-key.pem</code> file, this makes up the CA's key pair.</td>
+  </tr>
+  <tr>
+    <td><code><i>node</i>.csr</code></td>
+    <td>A certificate signing request (CSR). A CSR is effectively an application to the CA to create a new key pair for a particular node. The CA takes the information provided in the CSR and generates the public and private key pair for that node.</td>
+  </tr>
+  <tr>
+    <td><code><i>node</i>-priv.key</code></td>
+    <td>A private key signed by the CA. The node uses this key to authenticate itself with remote Docker Engines. Together with the <code><i>node</i>-cert.pem</code> file, this makes up a node's key pair.</td>
+  </tr>
+  <tr>
+    <td><code><i>node</i>-cert.pem</code></td>
+    <td>A certificate signed by the CA. This is not used in this example. Together with the <code><i>node</i>-priv.key</code> file, this makes up a node's key pair</td>
+  </tr>
+</table>
+
+The commands below show how to create keys for all of your nodes. You perform this procedure in a working directory located on your CA server.
+
+1. Logon to the terminal of your CA server and elevate to root.
+
+        $ sudo su
+
+2. Create a private key `swarm-priv-key.pem` for your Swarm Manager
+
+        $ sudo openssl genrsa -out swarm-priv-key.pem 2048
+        Generating RSA private key, 2048 bit long modulus
+        ............................................................+++
+        ........+++
+        e is 65537 (0x10001)
+
+2. Generate a certificate signing request (CSR) `swarm.csr` using the private key you create in the previous step.
+
+        $ sudo openssl req -subj "/CN=swarm" -new -key swarm-priv-key.pem -out swarm.csr
+
+    Remember, this is only for demonstration purposes. The process to create a
+    CSR will be slightly different in real-world production environments.
+
+3. Create the certificate `swarm-cert.pem` based on the CSR created in the previous step.
+
+        $ sudo openssl x509 -req -days 1825 -in swarm.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out swarm-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
+        <snip>
+        $ sudo openssl rsa -in swarm-priv-key.pem -out swarm-priv-key.pem
+
+        You now have a keypair for the Swarm Manager.
+
+4. Repeat the steps above for the remaining nodes in your infrastructure (`node1`, `node2`, and `client`).
+
+    Remember to replace the `swarm` specific values with the values relevant to the node you are creating the key pair for.
+
+    <table>
+    <tr>
+    <th>Server name</th>
+    <th>Private key</th>
+    <th>CSR</th>
+    <th>Certificate</th>
+    </tr>
+    <tr>
+    <td><code>node1 </code></td>
+    <td><code>node1-priv-key.pem</code></td>
+    <td><code>node1.csr</code></td>
+    <td><code>node1-cert.pem</code></td>
+    </tr>
+    <tr>
+    <td><code>node2</code></td>
+    <td><code>node2-priv-key.pem</code></td>
+    <td><code>node2.csr</code></td>
+    <td><code>node2-cert.pem</code></td>
+    </tr>
+    <tr>
+    <td><code>client</code></td>
+    <td><code>client-priv-key.pem</td>
+    <td><code>client.csr</code></td>
+    <td><code>client-cert.pem</code></td>
+    </tr>
+    </table>
+
+5. Verify that your working directory contains the following files:
+
+        # ls -l
+        total 64
+        -rw-r--r-- 1 root   root   1679 Jan 16 18:27 ca-priv-key.pem
+        -rw-r--r-- 1 root   root   1229 Jan 16 18:28 ca.pem
+        -rw-r--r-- 1 root   root     17 Jan 18 09:56 ca.srl
+        -rw-r--r-- 1 root   root   1086 Jan 18 09:56 client-cert.pem
+        -rw-r--r-- 1 root   root    887 Jan 18 09:55 client.csr
+        -rw-r--r-- 1 root   root   1679 Jan 18 09:56 client-priv-key.pem
+        -rw-r--r-- 1 root   root   1082 Jan 18 09:44 node1-cert.pem
+        -rw-r--r-- 1 root   root    887 Jan 18 09:43 node1.csr
+        -rw-r--r-- 1 root   root   1675 Jan 18 09:44 node1-priv-key.pem
+        -rw-r--r-- 1 root   root   1082 Jan 18 09:49 node2-cert.pem
+        -rw-r--r-- 1 root   root    887 Jan 18 09:49 node2.csr
+        -rw-r--r-- 1 root   root   1675 Jan 18 09:49 node2-priv-key.pem
+        -rw-r--r-- 1 root   root   1082 Jan 18 09:42 swarm-cert.pem
+        -rw-r--r-- 1 root   root    887 Jan 18 09:41 swarm.csr
+        -rw-r--r-- 1 root   root   1679 Jan 18 09:42 swarm-priv-key.pem
+
+You can inspect the contents of each of the keys. To inspect a private key:
+
+```
+openssl rsa -in <key-name> -noout -text
+```
+
+To inspect a public key (cert):
+
+```
+openssl x509 -in <key-name> -noout -text
+```
+
+The following commands shows the partial contents of the Swarm Manager's public
+ `swarm-cert.pem` key.
+
+```
+$ sudo openssl x509 -in ca.pem -noout -text
+Certificate:
+Data:
+    Version: 3 (0x2)
+    Serial Number: 9590646456311914051 (0x8518d2237ad49e43)
+Signature Algorithm: sha256WithRSAEncryption
+    Issuer: C=US, ST=CA, L=Sanfrancisco, O=Docker Inc
+    Validity
+        Not Before: Jan 18 09:42:16 2016 GMT
+        Not After : Jan 15 09:42:16 2026 GMT
+    Subject: CN=swarm
+
+<output truncated>
+```
+
+## Step 4: Install the keys
+
+In this step, you install the keys on the relevant servers in the
+infrastructure. Each server needs three files:
+
+- A copy of the Certificate Authority's public key (`ca.pem`)
+- It's own private key
+- It's own public key (cert)
+
+The procedure below shows you how to copy these files from the CA server to each
+server using `scp`. As part of the copy procedure, you'll rename each file as
+follows on each node:
+
+| Original name           | Copied name |
+|-------------------------|-------------|
+| `ca.pem`                | `ca.pem`    |
+| `<server>-cert.pem`     | `cert.pem`  |
+| `<server>-priv-key.pem` | `key.pem`   |
+
+1. Logon to the terminal of your CA server and elevate to root.
+
+        $ sudo su
+
+2. Create a` ~/.certs` directory on the Swarm manager.
+
+        $ ssh ubuntu@swarm 'mkdir -p /home/ubuntu/.certs'
+
+2. Copy the keys from the CA to the Swarm Manager server.
+
+        $ scp ./ca.pem ubuntu@swarm:/home/ubuntu/.certs/ca.pem
+        $ scp ./swarm-cert.pem ubuntu@swarm:/home/ubuntu/.certs/cert.pem
+        $ scp ./swarm-key.pem ubuntu@swarm:~/.certs/key.pem
+
+
+    >**Note**: You may need to provide authentication for the `scp` commands to work. For example, AWS EC2 instances use certificate-based authentication. To copy the files to an EC2 instance associated with a public key called `nigel.pem`, modify the `scp` command as follows: `scp -i /path/to/nigel.pem ./ca.pem ubuntu@swarm:/home/ubuntu/.certs/ca.pem`.
+
+3. Repeat step 2 for each remaining  server in the infrastructure.
+
+    * `node1`
+    * `node2`
+    * `client`
+
+4. Verify your work.
+
+    When the copying is complete, each machine should have the following keys.
+
+    ![](images/tls-2.jpeg)
+
+    Each node in your infrastructure should have the following files in the
+    `/home/ubuntu/.certs/` directory:
+
+        # ls -l /home/ubuntu/.certs/
+        total 16
+        -rw-r--r-- 1 ubuntu ubuntu 1229 Jan 18 10:03 ca.pem
+        -rw-r--r-- 1 ubuntu ubuntu 1082 Jan 18 10:06 cert.pem
+        -rw-r--r-- 1 ubuntu ubuntu 1679 Jan 18 10:06 key.pem
+
+## Step 5: Configure the Engine daemon for TLS
+
+In the last step, you created and installed the necessary keys on each of your
+Swarm nodes. In this step, you configure them to listen on the network and only
+accept connections using TLS. Once you complete this step, your Swarm nodes will
+listen on TCP port 2376, and only accept connections using TLS.
+
+On `node1` and `node2` (your Swarm nodes), do the following:
+
+1. Open a terminal on `node1` and elevate to root.
+
+        $ sudo su
+
+2. Edit Docker Engine configuration file.
+
+    If you are following along with these instructions and using Ubuntu 14.04
+    LTS, the configuration file is `/etc/default/docker`. The Docker Engine
+    configuration file may be different depending on the Linux distribution you
+    are using.
+
+3. Add the following options to the `DOCKER_OPTS` line.
+
+         -H tcp://0.0.0.0:2376 --tlsverify --tlscacert=/home/ubuntu/.certs/ca.pem --tlscert=/home/ubuntu/.certs/cert.pem --tlskey=/home/ubuntu/.certs/key.pem
+
+2. Restart the Docker Engine daemon.
+
+         $ service docker restart
+
+3. Repeat the procedure on `node2` as well.
+
+
+## Step 6: Create a Swarm cluster
+
+Next create a Swarm cluster. In this procedure you create a two-node Swarm
+cluster using the default *hosted discovery* backend. The default hosted
+discovery backend uses Docker Hub and is not recommended for production use.
+
+
+1. Logon to the terminal of your Swarm manager node.
+
+2. Create the cluster and export it's unique ID to the `TOKEN` environment variable.
+
+        $ sudo export TOKEN=$(docker run --rm swarm create)
+        Unable to find image 'swarm:latest' locally
+        latest: Pulling from library/swarm
+        d681c900c6e3: Pulling fs layer
+        <snip>
+        986340ab62f0: Pull complete
+        a9975e2cc0a3: Pull complete
+        Digest: sha256:c21fd414b0488637b1f05f13a59b032a3f9da5d818d31da1a4ca98a84c0c781b
+        Status: Downloaded newer image for swarm:latest
+
+3. Join `node1` to the cluster.
+
+    Be sure to specify TCP port `2376` and not `2375`.
+
+        $ sudo docker run -d swarm join --addr=node1:2376 token://$TOKEN
+        7bacc98536ed6b4200825ff6f4004940eb2cec891e1df71c6bbf20157c5f9761
+
+4. Join `node2` to the cluster.
+
+        $ sudo docker run -d swarm join --addr=node2:2376 token://$TOKEN
+        db3f49d397bad957202e91f0679ff84f526e74d6c5bf1b6734d834f5edcbca6c
+
+
+## Step 7: Create the Swarm Manager using TLS
+
+To configure and run a containerized Swarm Manager process using TLS, you
+need to create a custom Swarm image that contains the Swarm Manager's keys and
+the CA's trusted public key.
+
+1. Logon to the terminal of your Swarm manager node.
+
+2. Create a build directory and change into it
+
+        $ mkdir build && cd build
+
+3. Copy the Swarm manager's keys in the build directory
+
+        $ cp /home/ubuntu/.certs/{ca,cert,key}.pem /home/ubuntu/build
+
+4. Create a new `Dockerfile` file with the following contents:
+
+        FROM swarm
+        COPY ca.pem /etc/tlsfiles/ca.pem
+        COPY cert.pem /etc/tlsfiles/cert.pem
+        COPY key.pem /etc/tlsfiles/key.pem
+
+    This Dockerfile creates a new image called, `swarm-tls` based on the
+    official `swarm` image. This new image has copies of the required keys in it.
+
+5. Build a new image from the `Dockerfile`.
+
+        $ sudo docker build -t nigel/swarm-tls:latest .
+
+6. Launch a new container with you new `swarm-tls:latest` image.
+
+    The command runs the `swarm manage` command:
+
+        $ docker run -d -p 3376:2376 nigel/swarm-tls manage --tlsverify --tlscacert=/etc/tlsfiles/ca.pem --tlscert=/etc/tlsfiles/cert.pem --tlskey=/etc/tlsfiles/key.pem --host=0.0.0.0:2376 token://$TOKEN
+
+    The command above launches a new container based on the `swarm-tls:latest`
+    image. It also maps port `3376` on the server to port `2376` inside the
+    container. This mapping ensures that Docker Engine commands sent to the host
+    on port `3376` are passed on to port `2376` inside the container. The
+    container runs the Swarm `manage` process with the `--tlsverify`,
+    `--tlscacert`, `--tlscert` and `--tlskey` options specified. These options
+    force TLS verification and specify the location of the Swarm manager's TLS
+    keys.
+
+7. Run a `docker ps` command to verify that your Swarm manager container is up
+and running.
+
+        $ docker ps
+        CONTAINER ID   IMAGE               COMMAND                  CREATED          STATUS          PORTS                              NAMES
+        035dbf57b26e   nigel/swarm-tls     "/swarm manage --tlsv"   7 seconds ago    Up 7 seconds    2375/tcp, 0.0.0.0:3376->2376/tcp   compassionate_lovelace
+
+Your Swarm cluster is now configured to use TLS.
+
+## Step 8: Test the Swarm manager configuration
+
+Now that you have a Swarm cluster built and configured to use TLS, you'll test that it works with a Docker Engine CLI.
+
+1. Open a terminal onto your `client` server.
+
+2. Issue the `docker version` command.
+
+    When issuing the command, you must pass it the location of the clients certifications.
+
+        $ sudo docker --tlsverify --tlscacert=/home/ubuntu/.certs/ca.pem --tlscert=/home/ubuntu/.certs/cert.pem --tlskey=/home/ubuntu/.certs/key.pem -H swarm:3376 version
+        Client:
+         Version:      1.9.1
+         API version:  1.21
+         Go version:   go1.4.2
+         Git commit:   a34a1d5
+         Built:        Fri Nov 20 13:12:04 UTC 2015
+         OS/Arch:      linux/amd64
+
+        Server:
+         Version:      swarm/1.0.1
+         API version:  1.21
+         Go version:   go1.5.2
+         Git commit:   744e3a3
+         Built:
+         OS/Arch:      linux/amd64
+
+    The output above shows the `Server` version as "swarm/1.0.1". This means
+    that the command was successfully issued against the Swarm manager.
+
+2. Verify that the same command does not work without TLS.
+
+    This time, do not pass your certs to the Swarm manager.
+
+        $ sudo docker -H swarm:3376 version
+        :
+         Version:      1.9.1
+         API version:  1.21
+         Go version:   go1.4.2
+         Git commit:   a34a1d5
+         Built:        Fri Nov 20 13:12:04 UTC 2015
+         OS/Arch:      linux/amd64
+        Get http://swarm:3376/v1.21/version: malformed HTTP response "\x15\x03\x01\x00\x02\x02".
+        * Are you trying to connect to a TLS-enabled daemon without TLS?
+
+    The output above shows that the command was rejected by the server. This is
+    because the server (Swarm manager) is configured to only accept connections
+    from authenticated clients using TLS.
+
+
+## Step 9: Configure the Engine CLI to use TLS
+
+You can configure the Engine so that you don't have to pass the TLS options when
+you issue a command. To do this, you'll configure the `Docker Engine host` and
+`TLS` settings as defaults on your Docker Engine client.
+
+To do this, you place the client's keys in your `~/.docker` configuration folder. If you have other users on your system using the Engine command line, you'll need to configure their account's `~/.docker` as well. The procedure below shows how to do this for the `ubuntu` user on
+your Docker Engine client.
+
+1. Open a terminal onto your `client` server.
+
+2. If it doesn't exist, create a `.docker` directory in the `ubuntu` user's home directory.
+
+        $ mkdir /home/ubuntu/.docker
+
+4. Copy the Docker Engine client's keys from `/home/ubuntu/.certs` to
+`/home/ubuntu/.docker`
+
+        $ cp /home/ubuntu/.certs/{ca,cert,key}.pem /home/ubuntu/.docker
+
+5. Edit the account's `~/.bash_profile`.
+
+6. Set the following variables:
+
+    <table>
+    <tr>
+    <th>Variable</th>
+    <th>Description</th>
+    </tr>
+    <tr>
+    <td><code>DOCKER_HOST</code></td>
+    <td>Sets the Docker host and TCP port to send all Engine commands to.</td>
+    </tr>
+    <tr>
+    <td><code>DOCKER_TLS_VERIFY</code></td>
+    <td>Tell's Engine to use TLS.</td>
+    </tr>
+    <tr>
+    <td><code>DOCKER_CERT_PATH</code></td>
+    <td>Specifies the location of TLS keys.</td>
+    </tr>
+    </table>
+
+    For example:
+
+            export DOCKER_HOST=tcp://swarm:3376
+            export DOCKER_TLS_VERIFY=1
+            export DOCKER_CERT_PATH=/home/ubuntu/.docker/
+
+6. Save and close the file.
+
+7. Source the file to pick up the new variables.
+
+            $ source ~/.bash_profile
+
+8. Verify that the procedure worked by issuing a `docker version` command
+
+        $ docker version
+        Client:
+         Version:      1.9.1
+         API version:  1.21
+         Go version:   go1.4.2
+         Git commit:   a34a1d5
+         Built:        Fri Nov 20 13:12:04 UTC 2015
+         OS/Arch:      linux/amd64
+
+        Server:
+         Version:      swarm/1.0.1
+         API version:  1.21
+         Go version:   go1.5.2
+         Git commit:   744e3a3
+         Built:
+         OS/Arch:      linux/amd64
+
+    The server portion of the output above command shows that your Docker
+    client is issuing commands to the Swarm Manager and using TLS.
+
+Congratulations! You have configured a Docker Swarm cluster to use TLS.
+
+## Related Information
+
+* [Secure Docker Swarm with TLS](secure-swarm-tls.md)
+* [Docker security](https://docs.docker.com/engine/security/security/)
--- a/docs/images/app-architecture.jpg
+++ b/docs/images/app-architecture.jpg
--- a/docs/images/aws-infrastructure.jpg
+++ b/docs/images/aws-infrastructure.jpg
--- a/docs/images/balance-examples.jpg
+++ b/docs/images/balance-examples.jpg
--- a/docs/images/cloud-formation-tmp.jpg
+++ b/docs/images/cloud-formation-tmp.jpg
--- a/docs/images/deployed-across.jpg
+++ b/docs/images/deployed-across.jpg
--- a/docs/images/final-result.jpg
+++ b/docs/images/final-result.jpg
--- a/docs/images/fully-deployed.jpg
+++ b/docs/images/fully-deployed.jpg
--- a/docs/images/infrastructure-failures.jpg
+++ b/docs/images/infrastructure-failures.jpg
--- a/docs/images/interlock.jpg
+++ b/docs/images/interlock.jpg
--- a/docs/images/overlay-review.jpg
+++ b/docs/images/overlay-review.jpg
--- a/docs/images/poll-results.jpg
+++ b/docs/images/poll-results.jpg
--- a/docs/images/proxy-test.jpg
+++ b/docs/images/proxy-test.jpg
--- a/docs/images/review-work.jpg
+++ b/docs/images/review-work.jpg
--- a/docs/images/swarm-cluster-arch.jpg
+++ b/docs/images/swarm-cluster-arch.jpg
--- a/docs/images/tls-1.jpg
+++ b/docs/images/tls-1.jpg
--- a/docs/images/tls-2.jpeg
+++ b/docs/images/tls-2.jpeg
--- a/docs/images/trust-diagram.jpg
+++ b/docs/images/trust-diagram.jpg
--- a/docs/images/vote-app-test.jpg
+++ b/docs/images/vote-app-test.jpg
--- a/docs/networking.md
+++ b/docs/networking.md
@ -5,6 +5,7 @@ description = "Swarm and container networks"
 keywords = ["docker, swarm, clustering,  networking"]
 [menu.main]
 parent="workw_swarm"
+weight=3
 +++
 <![end-metadata]-->

--- a/docs/plan-for-production.md
+++ b/docs/plan-for-production.md
@ -0,0 +1,353 @@
+<!--[metadata]>
+++
+title = "Plan for Swarm in production"
+description = "Plan for Swarm in production"
+keywords = ["docker, swarm, scale, voting, application,  plan"]
+[menu.main]
+parent="workw_swarm"
+weight=70
+++
+<![end-metadata]-->
+
+# Plan for Swarm in production
+
+This article provides guidance to help you plan, deploy, and manage Docker
+Swarm clusters in business critical production environments. The following high
+ level topics are covered:
+
+- [Security](#security)
+- [High Availability](#high-availability)
+- [Performance](#performance)
+- [Cluster ownership](#cluster-ownership)
+
+## Security
+
+There are many aspects to securing a Docker Swarm cluster. This section covers:
+
+- Authentication using TLS
+- Network access control
+
+These topics are not exhaustive. They form part of a wider security architecture
+that includes: security patching, strong password policies, role based access
+control, technologies such as SELinux and AppArmor, strict auditing, and more.
+
+### Configure Swarm for TLS
+
+All nodes in a Swarm cluster must bind their Docker Engine daemons to a network
+port. This brings with it all of the usual network related security
+implications such as man-in-the-middle attacks. These risks are compounded when
+ the network in question is untrusted such as the internet. To mitigate these
+risks, Swarm and the Engine support Transport Layer Security(TLS) for
+authentication.
+
+The Engine daemons, including the Swarm manager, that are configured to use TLS
+will only accept commands from Docker Engine clients that sign their
+communications. The Engine and Swarm support external 3rd party Certificate
+Authorities (CA) as well as internal corporate CAs.
+
+The default Engine and Swarm ports for TLS are:
+
+- Engine daemon: 2376/tcp
+- Swarm manager: 3376/tcp
+
+For more information on configuring Swarm for TLS, see the **need link to
+securing swarm article**
+
+### Network access control
+
+Production networks are complex, and usually locked down so that only allowed
+traffic can flow on the network. The list below shows the network ports that
+the different components of a Swam cluster listen on. You should use these to
+configure your firewalls and other network access control lists.
+
+- **Swarm manager.**
+    - **Inbound 80/tcp (HTTP)**. This is allows `docker pull` commands to work. If you will be pulling from Docker Hub you will need to allow connections on port 80 from the internet.
+    - **Inbound 2375/tcp**. This allows Docker Engine CLI commands direct to the Engine daemon.
+    - **Inbound 3375/tcp**. This allows Engine CLI commands to the Swarm manager.
+    - **Inbound 22/tcp**. This allows remote management via SSH
+- **Service Discovery**:
+    - **Inbound 80/tcp (HTTP)**. This is allows `docker pull` commands to work. If you will be pulling from Docker Hub you will need to allow connections on port 80 from the internet.
+    - **Inbound *Discovery service port***. This needs setting to the port that the backend discovery service listens on (consul, etcd, or zookeeper).
+    - **Inbound 22/tcp**. This allows remote management via SSH
+- **Swarm nodes**:
+    - **Inbound 80/tcp (HTTP)**. This is allows `docker pull` commands to work. If you will be pulling from Docker Hub you will need to allow connections on port 80 from the internet.
+    - **Inbound 2375/tcp**. This allows Engine CLI commands direct to the Docker daemon.
+    - **Inbound 22/tcp**. This allows remote management via SSH.
+- **Custom, cross-host container networks**:
+    - **Inbound 7946/tcp** Allows for discovering other container networks.
+    - **Inbound 7946/udp** Allows for discovering other container networks.
+    - **Inbound <store-port>/tcp** Network key-value store service port.
+    - **4789/udp** For the container overlay network.
+
+
+If your firewalls and other network devices are connection state aware, they
+will allow responses to established TCP connections. If your devices are not
+state aware, you will need to open up ephemeral ports from 32768-65535. For
+added security you can configure the ephemeral port rules to only allow
+connections from interfaces on known Swarm devices.
+
+If your Swarm cluster is configured for TLS, replace `2375` with `2376`, and
+`3375` with `3376`.
+
+The ports listed above are just for Swarm cluster operations  such as; cluster
+creation, cluster management, and scheduling of containers against the cluster.
+ You may need to open additional network ports for application-related
+communications.
+
+It is possible for different components of a Swarm cluster to exist on separate
+networks. For example, many organizations operate separate management and
+production networks. Some Docker Engine clients may exist on a management
+network, while Swarm managers, discovery service instances, and nodes might
+exist on one or more production networks. To offset against network failures,
+you can deploy Swarm managers, discovery services, and nodes across multiple
+production networks. In all of these cases you can use the list of ports above
+to assist the work of your network infrastructure teams to efficiently and
+securely configure your network.
+
+## High Availability (HA)
+
+All production environments should be highly available, meaning they are
+continuously operational over long periods of time. To achieve high
+availability, an environment must the survive failures of its individual
+component parts.
+
+The following sections discuss some technologies and best practices that can
+enable you to build resilient, highly-available Swarm clusters. You can then use
+these cluster to run your most demanding production applications and workloads.
+
+### Swarm manager HA
+
+The Swarm manager is responsible for accepting all commands coming in to a Swarm
+cluster, and scheduling resources against the cluster. If the Swarm manager
+becomes unavailable, some cluster operations cannot be performed until the Swarm
+manager becomes available again. This is unacceptable in large-scale business
+critical scenarios.
+
+Swarm provides HA features to mitigate against possible failures of the Swarm
+manager. You can use Swarm's HA feature to configure multiple Swarm managers for
+a single cluster. These Swarm managers operate in an active/passive formation
+with a single Swarm manager being the *primary*, and all others being
+*secondaries*.
+
+Swarm secondary managers operate as *warm standby's*, meaning they run in the
+background of the primary Swarm manager. The secondary Swarm managers are online
+and accept commands issued to the cluster, just as the primary Swarm manager.
+However, any commands received by the secondaries are forwarded to the primary
+where they are executed. Should the primary Swarm manager fail, a new primary is
+elected from the surviving secondaries.
+
+When creating HA Swarm managers, you should take care to distribute them over as
+many *failure domains* as possible. A failure domain is a network section that
+can be negatively affected if a critical device or service experiences problems.
+For example, if your cluster is running in the Ireland Region of Amazon Web
+Services (eu-west-1) and you configure three Swarm managers (1 x primary, 2 x
+secondary), you should place one in each availability zone as shown below.
+
+![](http://farm2.staticflickr.com/1657/24581727611_0a076b79de_b.jpg)
+
+In this configuration, the Swarm cluster can survive the loss of any two
+availability zones. For your applications to survive such failures, they must be
+architected across as many failure domains as well.
+
+For Swarm clusters serving high-demand, line-of-business applications, you
+should have 3 or more Swarm managers. This configuration allows you to take one
+manager down for maintenance, suffer an unexpected failure, and still continue
+to manage and operate the cluster.
+
+### Discovery service HA
+
+The discovery service is a key component of a Swarm cluster. If the discovery
+service becomes unavailable, this can prevent certain cluster operations. For
+example, without a working discovery service, operations such as adding new
+nodes to the cluster and making queries against the cluster configuration fail.
+This is not acceptable in business critical production environments.
+
+Swarm supports four backend discovery services:
+
+- Hosted (not for production use)
+- Consul
+- etcd
+- Zookeeper
+
+Consul, etcd, and Zookeeper are all suitable for production, and should be
+configured for high availability. You should use each service's existing tools
+and best practices to configure these for HA.
+
+For Swarm clusters serving high-demand, line-of-business applications, it is
+recommended to have 5 or more discovery service instances. This due to the
+replication/HA technologies they use (such as Paxos/Raft) requiring a strong
+quorum. Having 5 instances allows you to take one down for maintenance, suffer
+an unexpected failure, and still be able to achieve a strong quorum.
+
+When creating a highly available Swarm discovery service, you should take care
+to distribute each discovery service instance over as many failure domains as
+possible. For example, if your cluster is running in the Ireland Region of
+Amazon Web Services (eu-west-1) and you configure three discovery service
+ instances, you should place one in each availability zone.
+
+The diagram below shows a Swarm cluster configured for HA. It has three Swarm
+managers and three discovery service instances spread over three failure
+domains (availability zones). It also has Swarm nodes balanced across all three
+ failure domains. The loss of two availability zones in the configuration shown
+ below does not cause the Swarm cluster to go down.
+
+![](http://farm2.staticflickr.com/1675/24380252320_999687d2bb_b.jpg)
+
+It is possible to share the same Consul, etcd, or Zookeeper containers between
+the Swarm discovery and Engine container networks. However, for best
+performance and availability you should deploy dedicated instances &ndash; a
+discovery instance for Swarm and another for your container networks.
+
+### Multiple clouds
+
+You can architect and build Swarm clusters that stretch across multiple cloud
+providers, and even across public cloud and on premises infrastructures. The
+diagram below shows an example Swarm cluster stretched across AWS and Azure.
+
+![](http://farm2.staticflickr.com/1493/24676269945_d19daf856c_b.jpg)
+
+While such architectures may appear to provide the ultimate in availability,
+there are several factors to consider. Network latency can be problematic, as
+can partitioning. As such, you should seriously consider technologies that
+provide reliable, high speed, low latency connections into these cloud
+platforms &ndash; technologies such as AWS Direct Connect and Azure
+ExpressRoute.
+
+If you are considering a production deployment across multiple infrastructures
+like this, make sure you have good test coverage over your entire system.
+
+### Isolated production environments
+
+It is possible to run multiple environments, such as development, staging, and
+production, on a single Swarm cluster. You accomplish this by tagging Swarm
+nodes and using constraints to filter containers onto nodes tagged as
+`production` or `staging` etc. However, this is not recommended. The recommended
+approach is to air-gap production environments, especially high performance
+business critical production environments.
+
+For example, many companies not only deploy dedicated isolated infrastructures
+for production &ndash; such as networks, storage, compute and other systems.
+They also deploy separate management systems and policies. This results in
+things like users having separate accounts for logging on to production systems
+etc. In these types of environments, it is mandatory to deploy dedicated
+production Swarm clusters that operate on the production hardware infrastructure
+and follow thorough production management, monitoring, audit and other policies.
+
+### Operating system selection
+
+You should give careful consideration to the operating system that your Swarm
+infrastructure relies on. This consideration is vital for production
+environments.
+
+It is not unusual for a company to use one operating system in development
+environments, and a different one in production. A common example of this is to
+use CentOS in development environments, but then to use Red Hat Enterprise Linux
+(RHEL) in production. This decision is often a balance between cost and support.
+CentOS Linux can be downloaded and used for free, but commercial support options
+are few and far between. Whereas RHEL has an associated support and license
+cost, but comes with world class commercial support from Red Hat.
+
+When choosing the production operating system to use with your Swarm clusters,
+you should choose one that closely matches what you have used in development and
+staging environments. Although containers abstract much of the underlying OS,
+some things are mandatory. For example, Docker container networks require Linux
+kernel 3.16 or higher. Operating a 4.x kernel in development and staging and
+then 3.14 in production will certainly cause issues.
+
+You should also consider procedures and channels for deploying and potentially
+patching your production operating systems.
+
+## Performance
+
+Performance is critical in environments that support business critical line of
+business applications. The following sections discuss some technologies and
+best practices that can help you build high performance Swarm clusters.
+
+### Container networks
+
+Docker Engine container networks are overlay networks and can be created across
+multiple Engine hosts. For this reason, a container network requires a key-value
+(KV) store to maintain network configuration and state. This KV store can be
+shared in common with the one used by the Swarm cluster discovery service.
+However, for best performance and fault isolation, you should deploy individual
+KV store instances for container networks and Swarm discovery. This is
+especially so in demanding business critical production environments.
+
+Engine container networks also require version 3.16 or higher of the Linux
+kernel. Higher kernel versions are usually preferred, but carry an increased
+risk of instability because of the newness of the kernel. Where possible, you
+should use a kernel version that is already approved for use in your production
+environment. If you do not have a 3.16 or higher Linux kernel version approved
+for production, you should begin the process of getting one as early as
+possible.
+
+### Scheduling strategies
+
+<!-- NIGEL: This reads like an explanation of specific scheduling strategies rather than guidance on which strategy to pick for production or with consideration of a production architecture choice.  For example, is spread a problem in a multiple clouds or random not good for XXX type application for YYY reason?
+
+Or perhaps there is nothing to consider when it comes to scheduling strategy and network / HA architecture, application, os choice etc. that good?
+
+-->
+
+Scheduling strategies are how Swarm decides which nodes on a cluster to start
+containers on. Swarm supports the following strategies:
+
+- spread
+- binpack
+- random (not for production use)
+
+You can also write your own.
+
+**Spread** is the default strategy. It attempts to balance the number of
+containers evenly across all nodes in the cluster. This is a good choice for
+high performance clusters, as it spreads container workload across all
+resources in the cluster. These resources include CPU, RAM, storage, and
+network bandwidth.
+
+If your Swarm nodes are balanced across multiple failure domains, the spread
+strategy evenly balance containers across those failure domains. However,
+spread on its own is not aware of the roles of any of those containers, so has
+no inteligence to spread multiple instances of the same service across failure
+domains. To achieve this you should use tags and constraints.
+
+The **binpack** strategy runs as many containers as possible on a node,
+effectively filling it up, before scheduling containers on the next node.
+
+This means that binpack does not use all cluster resources until the cluster
+fills up. As a result, applications running on Swarm clusters that operate the
+binpack strategy might not perform as well as those that operate the spread
+strategy. However, binpack is a good choice for minimizing infrastructure
+requirements and cost. For example, imagine you have a 10-node cluster where
+each node has 16 CPUs and 128GB of RAM. However, your container workload across
+ the entire cluster is only using the equivalent of 6 CPUs and 64GB RAM. The
+spread strategy would balance containers across all nodes in the cluster.
+However, the binpack strategy would fit all containers on a single node,
+potentially allowing you turn off the additional nodes and save on cost.
+
+## Ownership of Swarm clusters
+
+The question of ownership is vital in production environments. It is therefore
+vital that you consider and agree on all of the following when planning,
+documenting, and deploying your production Swarm clusters.
+
+- Who's budget does the production Swarm infrastructure come out of?
+- Who owns the accounts that can administer and manage the production Swarm
+cluster?
+- Who is responsible for monitoring the production Swarm infrastructure?
+- Who is responsible for patching and upgrading the production Swarm
+infrastructure?
+- On-call responsibilities and escalation procedures?
+
+The above is not a complete list, and the answers to the questions will vary
+depending on how your organization's and team's are structured. Some companies
+are along way down the DevOps route, while others are not. Whatever situation
+your company is in, it is important that you factor all of the above into the
+planning, deployment, and ongoing management of your production Swarm clusters.
+
+
+## Related information
+
+* [Try Swarm at scale](swarm_at_scale.md)
+* [Swarm and container networks](networking.md)
+* [High availability in Docker Swarm](multi-manager-setup.md)
+* [Universal Control plane](https://www.docker.com/products/docker-universal-control-plane)
--- a/docs/scheduler/index.md
+++ b/docs/scheduler/index.md
@ -6,7 +6,7 @@ keywords = ["docker, swarm, clustering, scheduling"]
 [menu.main]
 identifier="swarm_sched"
 parent="workw_swarm"
-weight=80
+weight=5
 +++
 <![end-metadata]-->

--- a/docs/secure-swarm-tls.md
+++ b/docs/secure-swarm-tls.md
@ -0,0 +1,167 @@
+<!--[metadata]>
+++
+title = "Overview Docker Swarm with TLS"
+description = "Swarm and transport layer security"
+keywords = ["docker, swarm, TLS, discovery, security,  certificates"]
+[menu.main]
+parent="workw_swarm"
+weight=50
+++
+<![end-metadata]-->
+
+# Overview Swarm with TLS
+
+All nodes in a Swarm cluster must bind their Docker daemons to a network port.
+This has obvious security implications. These implications are compounded when
+the network in question is untrusted such as the internet. To mitigate these
+risks, Docker Swarm and the Docker Engine daemon support Transport Layer Security
+(TLS).
+
+> **Note**: TLS is the successor to SSL (Secure Sockets Layer) and the two
+> terms are often used interchangeably. Docker uses TLS, this
+> term is used throughout this article.
+
+## Learn the TLS concepts
+
+Before going further, it is important to understand the basic concepts of TLS
+and public key infrastructure (PKI).
+
+Public key infrastructure is a combination of security-related technologies,
+policies, and procedures, that are used to create and manage digital
+certificates. These certificates and infrastructure secure digital
+communication using mechanisms such as authentication and encryption.
+
+The following analogy may be useful. It is common practice that passports are
+used to verify an individual's identity. Passports usually contain a photograph
+and biometric information that identify the owner. A passport also lists the
+country that issued it, as well as *valid from* and *valid to* dates. Digital
+certificates are very similar. The text below is an extract from a a digital
+certificate:
+
+```
+Certificate:
+Data:
+    Version: 3 (0x2)
+    Serial Number: 9590646456311914051 (0x8518d2237ad49e43)
+Signature Algorithm: sha256WithRSAEncryption
+    Issuer: C=US, ST=CA, L=Sanfrancisco, O=Docker Inc
+    Validity
+        Not Before: Jan 18 09:42:16 2016 GMT
+        Not After : Jan 15 09:42:16 2026 GMT
+    Subject: CN=swarm
+```
+
+This certificate identifies a computer called **swarm**. The certificate is valid between January 2016 and January 2026 and was issued by Docker Inc based in the state of California in the US.
+
+Just as passports authenticate individuals as they board flights and clear
+customs, digital certificates authenticate computers on a network.
+
+Public key infrastructure (PKI) is the combination of technologies, policies,
+and procedures that work behind the scenes to enable digital certificates. Some
+of the technologies, policies and procedures provided by PKI include:
+
+- Services to securely request certificates
+- Procedures to authenticate the entity requesting the certificate
+- Procedures to determine the entity's eligibility for the certificate
+- Technologies and processes to issue certificates
+- Technologies and processes to revoke certificates
+
+## How does Docker Engine authenticate using TLS
+
+In this section, you'll learn how Docker Engine and Swarm use PKI and
+certificates to increase security.
+
+<!--[metadata]>Need to know about encryption too<![end-metadata]-->
+
+You can configure both the Docker Engine CLI and the Engine daemon to require
+TLS for authentication.  Configuring TLS means that all communications between
+the Engine CLI and the Engine daemon must be accompanied with, and signed by a
+trusted digital certificate. The Engine CLI must provide its digital certificate
+before the Engine daemon will accept incoming commands from it.
+
+The Engine daemon must also trust the certificate that the Engine CLI uses.
+This trust is usually established by way of a trusted third party. The Engine
+CLI and daemon in the diagram below are configured to require TLS
+authentication.
+
+![](images/trust-diagram.jpg)
+
+The trusted third party in this diagram is the the Certificate Authority (CA)
+server. Like the country in the passport example, a CA creates, signs, issues,
+revokes certificates. Trust is established by installing the CA's root
+certificate on the host running the Engine daemon. The Engine CLI then requests
+its own certificate from the CA server, which the CA server signs and issues to
+the client.
+
+The Engine CLI  sends its certificate to the Engine daemon before issuing
+commands. The daemon inspects the certificate, and because daemon trusts the CA,
+the daemon automatically trusts any certificates signed by the CA. Assuming the
+certificate is in order (the certificate has not expired or been revoked etc.)
+the Engine daemon accepts commands from this trusted Engine CLI.
+
+The Docker Engine CLI is simply a client that uses the Docker Remote API to
+communicate with the Engine daemon. Any client that uses this Docker Remote API can use
+TLS. For example, other Engine clients such as Docker Universal Control Plane
+(UCP) have TLS support built-in. Other, third party products, that use Docker's
+Remote API, can also be configured this way.
+
+## TLS modes with Docker and Swarm
+
+Now that you know how certificates are used by Docker Engine for authentication,
+it's important to be aware of the three TLS configurations possible with Docker
+Engine and its clients:
+
+- External 3rd party CA
+- Internal corporate CA
+- Self-signed certificates
+
+These configurations are differentiated by the type of entity acting as the  Certificate Authority (CA).
+
+### External 3rd party CA
+
+An external CA is a trusted 3rd party company that provides a means of creating,
+issuing, revoking, and otherwise managing certificates. They are *trusted* in
+the sense that they have to fulfill specific conditions and maintain high levels
+of security and business practices to win your business. You also have to
+install the external CA's root certificates for you computers and services to
+*trust* them.
+
+When you use an external 3rd party CA, they create, sign, issue, revoke and
+otherwise manage your certificates. They normally charge a fee for these
+services, but are considered an enterprise-class scalable solution that
+provides a high degree of trust.
+
+### Internal corporate CA
+
+Many organizations choose to implement their own Certificate Authorities and
+PKI. Common examples are using OpenSSL and Microsoft Active Directory. In this
+case, your company is its own Certificate Authority with all the work it
+entails. The benefit is, as your own CA, you have more control over your PKI.
+
+Running your own CA and PKI requires you to provide all of the services offered
+by external 3rd party CAs. These include creating, issuing, revoking, and
+otherwise managing certificates. Doing all of this yourself has its own costs
+and overheads. However, for a large corporation, it still may reduce costs in
+comparison to using an external 3rd party service.
+
+Assuming you operate and manage your own internal CAs and PKI properly, an
+internal, corporate CA  can be a highly scalable and highly secure option.
+
+### Self-signed certificates
+
+As the name suggests, self-signed certificates are certificates that are signed
+with their own private key rather than a trusted CA. This is a low cost and
+simple to use option. If you implement and manage self-signed certificates
+correctly, they can be better than using no certificates.
+
+Because self-signed certificates lack of a full-blown PKI, they do not scale
+well and lack many of the advantages offered by the other options. One of their
+disadvantages is you cannot revoke self-signed certificates. Due to this, and
+other limitations, self-signed certificates are considered the least secure of
+the three options. Self-signed certificates are not recommended for public
+facing production workloads exposed to untrusted networks.
+
+## Related information
+
+* [Configure Docker Swarm for TLS](configuire-tls.md)
+* [Docker security](https://docs.docker.com/engine/security/security/)
--- a/docs/swarm_at_scale.md
+++ b/docs/swarm_at_scale.md
@ -0,0 +1,771 @@
+<!--[metadata]>
+++
+title = "Try Swarm at scale"
+description = "Try Swarm at scale"
+keywords = ["docker, swarm, scale, voting, application,  certificates"]
+[menu.main]
+parent="workw_swarm"
+weight=75
+++
+<![end-metadata]-->
+
+# Try Swarm at scale
+
+Using this example, you'll deploy a voting application on a Swarm cluster. The
+example walks you through creating a Swarm cluster and deploying the application
+against the cluster. This walk through is intended to illustrate one example of
+a typical development process.
+
+After building and manually deploying the voting application, you'll construct a
+Docker Compose file. You (or others) can use the file to deploy and scale the
+application further. The article also provides a troubleshooting section you can
+use while developing or deploying the voting application.
+
+
+## About the example
+
+Your company is a pet food company that has bought an commercial during the
+Superbowl. The commercial drives viewers to a web survey that asks users to vote &ndash; cats or dogs. You are developing the web survey. Your survey must ensure that
+millions of people can vote concurrently without your website becoming
+unavailable. You don't need real-time results, a company press release announces
+the results. However, you do need confidence that every vote is counted.
+
+The example assumes you are deploying the application to a Docker Swarm cluster
+running on top of Amazon Web Services (AWS). AWS is an example only. There is
+nothing about this application or deployment that requires it. You could deploy
+the application to a Docker Swarm cluster running on; a different cloud provider
+such as Microsoft Azure, on premises in your own physical data center, or in a
+development environment on your laptop.
+
+The example requires you to perform the following high-level steps:
+
+- [Deploy your infrastructure](#deploy-your-infrastructure)
+- [Create the Swarm cluster](#create-the-swarm-cluster)
+- [Overlay a container network on the cluster](#overlay-a-container-network-on-the-cluster)
+- [Deploy the voting application](#deploy-the-voting-application)
+- [Test the application](#test-the-application)
+
+Before working through the sample, make sure you understand the application and Swarm cluster architecture.
+
+### Application architecture
+
+The voting application is a Dockerized microservice application. It uses a
+parallel web frontend that sends jobs to asynchronous background workers. The
+application's design can accommodate arbitrarily large scale. The diagram below
+shows the high level architecture of the application.
+
+![](images/app-architecture.jpg)
+
+The application is fully Dockerized with all services running inside of
+containers.
+
+The frontend consists of an Interlock load balancer with *n* frontend web
+servers and associated queues. The load balancer can handle an arbitrary number
+of web containers behind it (`frontend01`- `frontendN`). The web containers run
+a simple Python Flask application. Each container accepts votes and queues them
+to a Redis container on the same node. Each web container and Redis queue pair
+operates independently.  
+
+The load balancer together with the independent pairs allows the entire
+application to scale to an arbitrary size as needed to meet demand.
+
+Behind the frontend is a worker tier which runs on separate nodes. This tier:
+
+* scans the Redis containers
+* dequeues votes
+* deduplicates votes to prevent double voting
+* commits the results to a Postgres container running on a separate node
+
+Just like the front end, the worker tier can also scale arbitrarily.
+
+### Swarm Cluster Architecture
+
+To support the application the design calls for a Swarm cluster that with a single Swarm manager and 4 nodes as shown below.
+
+![](images/swarm-cluster-arch.jpg)
+
+All four nodes in the cluster are running the Docker daemon, as is the Swarm
+manager and the Interlock load balancer. The Swarm manager exists on a Docker
+host that is not part of the cluster and is considered out of band for the
+application. The Interlock load balancer could be placed inside of the cluster,
+but for this demonstration it is not.
+
+The diagram below shows the application architecture overlayed on top of the
+Swarm cluster architecture. After completing the example and deploying your
+application, this is what your environment should look like.
+
+![](images/final-result.jpg)
+
+As the previous diagram shows, each node in the cluster runs the following containers:
+
+- `frontend01`:
+    - Container: Pyhton flask web app (frontend01)
+    - Container: Redis (redis01)
+- `frontend02`:
+    - Container: Python flask web app (frontend02)
+    - Container: Redis (redis02)
+- `worker01`: vote worker app (worker01)
+- `store`:
+    - Container: Postgres (pg)
+    - Container: results app (results-app)
+
+## Deploy your infrastructure
+
+As previously stated, this article will walk you through deploying the
+application to a Swam cluster in an AWS Virtual Private Cloud (VPC). However,
+you can reproduce the environment design on whatever platform you wish. For
+example, you could place the application on another public cloud platform such
+as DigitalOcean, on premises in your data center, or even in in a test
+environment on your laptop.
+
+Deploying the AWS infrastructure requires that you first build the VPC and then
+apply apply the [CloudFormation
+template](https://github.com/docker/swarm-demo-voting-app/blob/master/AWS/cloudformation.json).
+While you cloud create the entire VPC and all instances via a CloudFormation
+template, splitting the deployment into two steps allows the CloudFormation
+template to be easily used to build instances in *existing VPCs*.
+
+The diagram below shows the VPC infrastructure required to run the
+CloudFormation template.
+
+![](images/cloud-formation-tmp.jpg)
+
+The AWS configuration is a single VPC with a single public subnet. The VPC must
+be in the `us-west-1` Region (N. California). This Region is required for this
+particular CloudFormation template to work. The VPC network address space is
+`192.168.0.0/16` and single 24-bit public subnet is carved out as
+192.168.33.0/24. The subnet must be configured with a default route to the
+internet via the VPC's internet gateway. All 6 EC2 instances are deployed into
+this public subnet.
+
+Once the VPC is created you can deploy the EC2 instances using the
+CloudFormation template located
+[here](https://github.com/docker/swarm-demo-voting-app/blob/master/AWS/cloudformation.json).
+
+>**Note**: If you are not deploying to AWS, or are not using the CloudFormation template mentioned above, make sure your Docker hosts are running a 3.16 or higher kernel. This kernel is required by Docker's container networking feature.
+
+### Step 1. Build and configure the VPC
+
+This step assumes you know [how to configure a VPC](link here) either manually
+or using the VPC wizard on Amazon. You can build the VPC manually or by using
+using the VPC Wizard. If you use the wizard, be sure to choose the **VPC with a
+Single Public Subnet** option.
+
+Configure your VPC with the following values:
+
+- **Region**: N. California (us-west-1)
+- **VPC Name**: Swarm-scale
+- **VPC Network (CIDR)**: 192.168.0.0/16
+    - **DNS resolution**: Yes
+- **Subnet name**: PublicSubnet
+    - **Subnet type**: Public (with route to the internet)
+    - **Subnet network (CIDR)**: 192.168.33.0/24
+    - **Auto-assign public IP**: Yes
+    - **Availability Zone**: Any
+- **Router**: A single router with a route for *local* traffic and default route for traffic to the internet
+- **Internet gateway**: A single internet gateway used as default route for the subnet's routing table
+
+You'll configure the remaining AWS settings in the next section as part of the
+CloudFormation template.
+
+
+### Step 2. Apply the CloudFormation template
+
+Before you can apply the CloudFormation template, you will need to have created
+a VPC as per instructions in the previous section. You will also need access to
+the private key of an EC2 KeyPair associated with your AWS account in the
+`us-west-1` Region. Follow the steps below to build the remainder of the AWS
+infrastructure using the CloudFormation template.
+
+1. Choose **Create Stack** from the CloudFormation page in the AWS Console
+2. Click the **Choose file** button under the **Choose a template** section
+3. Select the **swarm-scale.json** CloudFormation template available from the [application's GitHub repo](https://github.com/docker/swarm-demo-voting-app/blob/master/AWS/cloudformation.json)
+4. Click **Next**
+5. Give the Stack a name. You can name the stack whatever you want, though it is recommended to use a meaningful name
+6. Select a KeyPair form the dropdown list
+7. Select the correct **Subnetid** (PublicSubnet) and **Vpcid** (SwarmCluster) from the dropdowns
+8. Click **Next**
+9. Click **Next** again
+10. Review your settings and click **Create**
+    AWS displays the progress of your stack being created
+
+### Step 3. Check your deployment
+
+When completed, the CloudFormation populates your VPC with the following six EC2 instances:
+
+- `manager`:     t2.micro  / 192.168.33.11
+- `interlock`:   t2.micro  / 192.168.33.12
+- `frontend01`:  t2.micro  / 192.168.33.20
+- `frontend02`:  t2.micro  / 192.168.33.21
+- `worker01`:    t2.micro  / 192.168.33.200
+- `store`:       m3.medium / 192.168.33.250
+
+Your AWS infrastructure should look like this.
+
+![](images/aws-infrastructure.jpg)
+
+All instances are based on the `ami-56f59e36` AMI. This is an Ubuntu 14.04
+image with a 3.16 kernel and 1.9.1 of the Docker Engine installed. It also has
+the following parameters added to the `DOCKER_OPTS` line in
+`/etc/default/docker`:
+
+```
+--cluster-store=consul://192.168.33.11:8500 --cluster-advertise=eth0:2375 -H=tcp://0.0.0.0:2375 -H=unix:///var/run/docker.sock\
+```
+
+Once your stack is created successfully you are ready to progress to the next
+step and build the Swarm cluster. From this point, the instructions refer to the
+AWS EC2 instances as "nodes".
+
+## Create the Swarm cluster
+
+Now that your underlying network infrastructure is built, you are ready to build and configure the Swarm cluster.
+
+### Step 1: Construct the cluster
+
+The steps below construct a Swarm cluster by:
+
+* using Consul as the discovery backend
+* join the `frontend`, `worker`
+* `store` EC2 instances to the cluster
+* use the `spread` scheduling strategy.
+
+Perform all of the following commands from the `manager` node.
+
+1. Start a new Consul container that listens on TCP port 8500
+
+        $ sudo docker run --restart=unless-stopped -d -p 8500:8500 -h consul progrium/consul -server -bootstrap
+
+    This starts a Consul container for use as the Swarm discovery service. This
+    backend is also used as the K/V store for the container network that you
+    overlay on the Swarm cluster in a later step.
+
+2. Start a Swarm manager container.
+
+    This command maps port 3375 on the `manager` node to port 2375 in the
+    Swarm manager container
+
+        $ sudo docker run --restart=unless-stopped -d -p 3375:2375 swarm manage consul://192.168.33.11:8500/
+
+    This Swarm manager container is the heart of your Swarm cluster. It is
+    responsible for receiving all Docker commands sent to the cluster, and for
+    scheduling resources against the cluster. In a real-world production
+    deployment you would configure additional replica Swarm managers as
+    secondaries for high availability (HA).
+
+3. Set the `DOCKER_HOST` environment variable.
+
+    This ensures that the default endpoint for Docker commands is the Docker daemon running on the `manager` node
+
+        $ export DOCKER_HOST="tcp://192.168.33.11:3375"
+
+4. While still on the `manager` node, join the nodes to the cluster.
+
+    You can run these commands form the `manager` node because the `-H` flag
+    sends the commands to the Docker daemons on the nodes. The command joins a
+    node to the cluster and registers it with the Consul discovery service.
+
+        sudo docker -H=tcp://<node-private-ip>:2375 run -d swarm join --advertise=<node-private-ip>:2375 consul://192.168.33.11:8500/
+
+    Substitute `<node-private-ip` in the command with the private IP of the
+    you are adding. Repeat step 4 for every node you are adding to the cluster -
+    `frontend01`, `frontend02`, `worker01`, and `store`.
+
+
+### Step 2: Review your work
+
+The diagram below shows the Swarm cluster that you created.
+
+![](images/review-work.jpg)
+
+The diagram shows the `manager` node is running two containers: `consul` and
+`swarm`. The `consul` container is providing the Swarm discovery service. This
+is where nodes and services register themselves and discover each other. The
+`swarm` container is running the `swarm manage` process which makes it act as
+the cluster manager. The manager is responsible for accepting Docker commands
+issued against the cluster and scheduling resources on the cluster.
+
+You mapped port 3375 on the `manager` node to port 2375 inside the `swarm`
+container. As a result, Docker clients (for example the CLI) wishing to issue
+commands against the cluster must send them to the `manager` node on port
+3375. The `swarm` container then executes those commands against the relevant
+node(s) in the cluster over port 2375.
+
+Now that you have your Swarm cluster configured, you'll overlay the container
+network that the application containers will be part of.
+
+## Overlay a container network on the cluster
+
+All containers that are part of the voting application belong to a container
+network called `mynet`. This will be an overlay network that allows all
+application containers to easily communicate irrespective of the underlying
+network that each node is on.
+
+### Step 1: Create the network
+
+You can create the network and join the containers from any node in your VPC
+that is running Docker Engine. However, best practice when using Docker Swarm is
+to execute commands from the `manager` node, as this is where all management
+tasks happen.
+
+1. Open a terminal on your `manager` node.
+2. Create the overlay network with the `docker network` command
+
+        $ sudo docker network create --driver overlay mynet
+
+    An overlay container network is visible to all Docker daemons that use the
+    same discovery backend. As all Swarm nodes in your environment are
+    configured to use the Consul discovery service at
+    `consul://192.168.33.11:8500`, they all should see the new overlay network.
+    Verify this with the next step.
+
+3. Log onto each node in your Swarm cluster and verify the `mynet` network is running.
+
+        $ sudo docker network ls
+        NETWORK ID          NAME                DRIVER
+        72fa20d0663d        mynet               overlay
+        bd55c57854b8        host                host
+        25e34427f6ff        bridge              bridge
+        8eee5d2130ab        none                null
+
+    You should see an entry for the `mynet` network using the `overlay` driver as shown above.
+
+### Step 2: Review your work
+
+The diagram below shows the complete cluster configuration including the overlay
+container network, `mynet`. The `mynet` is shown as red and is available to all
+Docker hosts using the Consul discovery backend. Later in the procedure you will
+connect containers to this network.
+
+![](images/overlay-review.jpg)
+
+> **Note**: The `swarm` and `consul` containers on the `manager` node are not attached to the `mynet` overlay network.
+
+Your cluster is now built and you are ready to build and run your application on
+it.
+
+## Deploy the voting application
+
+Now it's time to configure the application.
+
+Some of the containers in the application are launched from custom images you
+must build. Others are launched form existing images pulled directly from Docker
+Hub. Deploying the application requires that you:
+
+- Understand the custom images
+- Build custom images
+- Pull stock images from Docker Hub
+- Launch application containers
+
+### Step 1: Understand the custom images
+
+The list below shows which containers use custom images and which do not:
+
+- Web containers: custom built image
+- Worker containers: custom built image
+- Results containers: custom built image
+- Load balancer container: stock image (`ehazlett/interlock`)
+- Redis containers: stock image (official `redis` image)
+- Postgres (PostgreSQL) containers: stock image (official `postgres` image)
+
+All custom built images are built using Dockerfile's pulled from the [application's public GitHub repository](https://github.com/docker/swarm-demo-voting-app).
+
+1. Log into the Swarm manager node.
+2. Clone the [application's GitHub repo](https://github.com/docker/swarm-demo-voting-app)
+
+        $ sudo git clone https://github.com/docker/swarm-demo-voting-app
+
+    This command creates a new directory structure inside of your working
+    directory. The new directory contains all of the files and folders required
+    to build the voting application images.
+
+    The `AWS` directory contains the `cloudformation.json` file used to deploy
+    the EC2 instances. The `Vagrant` directory contains files and instructions
+    required to deploy the application using Vagrant. The `results-app`,
+    `vote-worker`, and `web-vote-app` directories contain the Dockerfiles and
+    other files required to build the custom images for those particular
+    components of the application.
+
+
+3. Change directory into the `swarm-demo-voting-app/web-vote-app` directory and inspect the contents of the `Dockerfile`
+
+        $ cd swarm-demo-voting-app/web-vote-app/
+
+        $ cat Dockerfile
+        FROM python:2.7
+        WORKDIR /app
+        ADD requirements.txt /app/requirements.txt
+        RUN pip install -r requirements.txt
+        ADD . /app
+        EXPOSE 80
+        CMD ["python", "app.py"]
+
+    As you can see, the image is based on the official `Python:2.7` tagged
+    image, adds a requirements file into the `/app` directory, installs
+    requirements, copies files from the build context into the container,
+    exposes port `80` and tells the container which command to run.
+
+### Step 2. Build custom images
+
+1. Log into the swarm manager node if you haven't already.
+2. Change to the root of your swarm-demo-voting app clone.
+
+2. Build the `web-votes-app` image on `frontend01` and `frontend02`
+
+        $ sudo docker -H tcp://192.168.33.20:2375 build -t web-vote-app ./web-vote-app
+        $ sudo docker -H tcp://192.168.33.21:2375 build -t web-vote-app ./web-vote-app
+
+    These commands build the  `web-vote-app` image on the `frontend01` and
+    `frontend02` nodes. To accomplish the operation, each command copies the
+    contents of the `swarm-demo-voting-app/web-vote-app` sub-directory from the
+    `manager` node to each frontend node. The command then instructs the
+    Docker daemon on each frontend node to build the image and store it locally.
+
+    It may take a minute or so for each image to build. Wait for the builds to finish.
+
+3. Build `vote-worker` image on the `worker01` node
+
+        $ sudo docker -H tcp://192.168.33.200:2375 build -t vote-worker ./vote-worker
+
+    It may take a minute or so for the image to build. Wait for the build to finish.
+
+5. Build the `results app` on the `store` node
+
+        $ sudo docker -H tcp://192.168.33.250:2375 build -t results-app ./results-app
+
+Each of the *custom images* required by the application is now built and stored locally on the nodes that will use them.
+
+### Step 3. Pull stock images from Docker Hub
+
+For performance reasons, it is always better to pull any required Docker Hub images locally on each instance that needs them. This ensures that containers based on those images can start quickly.
+
+1. Log into the Swarm `manager` node.
+
+2. Pull the `redis` image to `frontend01` and `frontend02`
+
+        $ sudo docker -H tcp://192.168.33.20:2375 pull redis
+        $ sudo docker -H tcp://192.168.33.21:2375 pull redis
+
+2. Pull the `postgres` image to the `store` node
+
+        $ sudo docker -H tcp://192.168.33.250:2375 pull postgres
+
+3. Pull the `ehazlett/interlock` image to the `interlock` node
+
+        $ sudo docker -H tcp://192.168.33.12:2375 pull ehazlett/interlock
+
+Each node in the cluster, as well as the `interlock` node, now has the required images stored locally as shown below.
+
+![](images/interlock.jpg)
+
+Now that all images are built, pulled, and stored locally, the next step is to start the application.
+
+### Step 4. Start the voting application
+
+The following steps will guide you through the process of starting the application
+
+* Start the `interlock` load balancer container on `interlock`
+* Start the `redis` containers on `frontend01` and `frontend02`
+* Start the `web-vote-app` containers on `frontend01` and `frontend02`
+* Start the `postgres` container on `store`
+* Start the `worker` container on `worker01`
+* Start the `results-app` container on `store`
+
+Do the following:
+
+1. Log into the Swarm `manager` node.
+
+2. Start the `interlock` container on the `interlock` node
+
+        $ sudo docker -H tcp://192.168.33.12:2375 run --restart=unless-stopped -p 80:80 --name interlock -d ehazlett/interlock --swarm-url tcp://192.168.33.11:3375 --plugin haproxy start
+
+    This command is issued against the `interlock` instance and maps port 80 on the instance to port 80 inside the container. This allows the container to load balance connections coming in over port 80 (HTTP). The command also applies the `--restart=unless-stopped` policy to the container, telling Docker to restart the container if it exits unexpectadly.
+
+2. Start a `redis` container on `frontend01` and `frontend02`
+
+        $ sudo docker run --restart=unless-stopped --env="constraint:node==frontend01" -p 6379:6379 --name redis01 --net mynet -d redis
+
+        $ sudo docker run --restart=unless-stopped --env="constraint:node==frontend02" -p 6379:6379 --name redis02 --net mynet -d redis
+
+    These two commands are issued against the Swarm cluster. The commands specify *node constraints*, forcing Swarm to start the contaienrs on `frontend01` and `frontend02`. Port 6379 on each instance is mapped to port 6379 inside of each container for debugging purposes. The command also applies the `--restart=unless-stopped` policy to the containers and attaches them to the `mynet` overlay network.
+
+3. Start a `web-vote-app` container on `frontend01` and `frontend02`
+
+        $ sudo docker run --restart=unless-stopped --env="constraint:node==frontend01" -d -p 5000:80 -e WEB_VOTE_NUMBER='01' --name frontend01 --net mynet --hostname votingapp.local web-vote-app
+
+        $ sudo docker run --restart=unless-stopped --env="constraint:node==frontend02" -d -p 5000:80 -e WEB_VOTE_NUMBER='02' --name frontend02 --net mynet --hostname votingapp.local web-vote-app
+
+    These two commands are issued against the Swarm cluster. The commands specify *node constraints*, forcing Swarm to start the contaienrs on `frontend01` and `frontend02`. Port 5000 on each node is mapped to port 80 inside of each container. This allows connections to come in to each node on port 5000 and be forwarded to port 80 inside of each container. Both containers are attached to the `mynet` overlay network and both containers are given the `votingapp-local` hostname. The `--restart=unless-stopped` policy is also applied to these containers.
+
+4. Start the `postgres` container on the `store` node
+
+        $ sudo docker run --restart=unless-stopped --env="constraint:node==store" --name pg -e POSTGRES_PASSWORD=pg8675309 --net mynet -p 5432:5432 -d postgres
+
+    This command is issued against the Swarm cluster and starts the container on `store`. It maps port 5432 on the `store` node to port 5432 inside the container and attaches the container to the `mynet` overlay network. It also inserts the database password into the container via the POSTGRES_PASSWORD environment variable and applies the `--restart=unless-stopped` policy to the container. Sharing passwords like this is not recommended for production use cases.
+
+5. Start the `worker01` container on the `worker01` node
+
+        $ sudo docker run --restart=unless-stopped --env="constraint:node==worker01" -d -e WORKER_NUMBER='01' -e FROM_REDIS_HOST=1 -e TO_REDIS_HOST=2 --name worker01 --net mynet vote-worker
+
+    This command is issued against the Swarm manager and uses a constraint to start the container on the `worker01` node. It passes configuration data into the container via environment variables, telling the worker container to clear the queues on `frontend01` and `frontend02`. It adds the container to the `mynet` overlay network and applies the `--restart=unless-stopped` policy to the container.
+
+6. Start the `results-app` container on the `store` node
+
+        $ sudo docker run --restart=unless-stopped --env="constraint:node==store" -p 80:80 -d --name results-app --net mynet results-app
+
+    This command starts the results-app container on the `store` node by means of a *node constraint*. It maps port 80 on the `store` node to port 80 inside the container. It adds the container to the `mynet` overlay network and applies the `--restart=unless-stopped` policy to the container.
+
+The application is now fully deployed as shown in the diagram below.
+
+![](images/fully-deployed.jpg)
+
+## Test the application
+
+Now that the application is deployed and running, it's time to test it.
+
+1. Configure a DNS mapping on your local machine for browsing.
+
+    You configure a DNS mapping on the machine where you are running your web browser. This maps the "votingapp.local" DNS name to the public IP address of the `interlock` node.
+
+    - On Windows machines this is done by adding `votingapp.local <interlock-public-ip>` to the `C:\Windows\System32\Drivers\etc\hosts file`. Modifying this file requires administrator privileges. To open the file with administrator privileges, right-click `C:\Windows\System32\notepad.exe` and select `Run as administrator`. Once Notepad is open, click `file` > `open` and open the file and make the edit.
+    - On OSX machines this is done by adding `votingapp.local <interlock-public-ip>` to `/private/etc/hosts`.
+    - On most Linux machines this is done by adding `votingapp.local <interlock-public-ip>` to `/etc/hosts`.
+
+    Be sure to replace <interlock-public-ip> with the public IP address of your `interlock` node. You can find the `interlock` node's Public IP by selecting your `interlock` EC2 Instance from within the AWS EC2 console.
+
+2. Verify the mapping worked with a ping command from your web browsers machine
+
+        C:\Users\nigelpoulton>ping votingapp.local
+        Pinging votingapp.local [54.183.164.230] with 32 bytes of data:
+        Reply from 54.183.164.230: bytes=32 time=164ms TTL=42
+        Reply from 54.183.164.230: bytes=32 time=163ms TTL=42
+        Reply from 54.183.164.230: bytes=32 time=169ms TTL=42
+
+3. Now that name resolution is configured and you have successfully pinged `votingapp.local`, point your web browser to [http://votingapp.local](http://votingapp.local)
+
+    ![](images/vote-app-test.jpg)
+
+    Notice the text at the bottom of the web page. This shows which web
+    container serviced the request. In the diagram above, this is `frontend02`.
+    If you refresh your web browser you should see this change as the Interlock
+    load balancer shares incoming requests across both web containers.
+
+    To see more detailed load balancer data from the Interlock service, point your web browser to [http://stats:interlock@votingapp.local/haproxy?stats](http://stats:interlock@votingapp.local/haproxy?stats)
+
+    ![](images/proxy-test.jpg)
+
+4. Cast your vote. It is recommended to choose "Dogs" ;-)
+
+5. To see the results of the poll, you can point your web browser at the public IP of the `store` node
+
+    ![](images/poll-results.jpg)
+
+Congratulations. You have successfully walked through manually deploying a microservice-based application to a Swarm cluster.
+
+## Troubleshooting the application
+
+It's a fact of life that things fail. With this in mind, it's important to
+understand what happens when failures occur and how to mitigate them. The
+following sections cover different failure scenarios:
+
+- [Swarm manager failures](#swarm-manager-failures)
+- [Consul (discovery backend) failures](#consul-discovery-backend-failures)
+- [Interlock load balancer failures](#interlock-load-balancer-failures)
+- [Web (web-vote-app) failures](#web-web-vote-app-failures)
+- [Redis failures](#redis-failures)
+- [Worker (vote-worker) failures](#worker-vote-worker-failures)
+- [Postgres failures](#postgres-failures)
+- [Results-app failures](#results-app-failures)
+- [Infrastructure failures](#infrastructure-failures)
+
+### Swarm manager failures
+
+In it's current configuration, the Swarm cluster only has single manager
+container running on a single node. If the container exits or the node fails,
+you will not be able to administer the cluster until you either; fix it, or
+replace it.
+
+If the failure is the Swarm manager container unexpectedly exiting, Docker will
+automatically attempt to restart it. This is because the container was started
+with the `--restart=unless-stopped` switch.
+
+While the Swarm manager is unavailable, the application will continue to work in
+its current configuration. However, you will not be able to provision more nodes
+or containers until you have a working Swarm manager.
+
+Docker Swarm supports high availability for Swarm managers. This allows a single
+Swarm cluster to have two or more managers. One manager is elected as the
+primary manager and all others operate as secondaries. In the event that the
+primary manager fails, one of the secondaries is elected as the new primary, and
+cluster operations continue gracefully. If you are deploying multiple Swarm
+managers for high availability, you should consider spreading them across
+multiple failure domains within your infrastructure.
+
+### Consul (discovery backend) failures
+
+The Swarm cluster that you have deployed has a single Consul container on a
+single node performing the cluster discovery service. In this setup, if the
+Consul container exits or the node fails, the application will continue to
+operate in its current configuration. However, certain cluster management
+operations will fail. These include registering new containers in the cluster
+and making lookups against the cluster configuration.
+
+If the failure is the `consul` container unexpectedly exiting, Docker will
+automatically attempt to restart it. This is because the container was started
+with the `--restart=unless-stopped` switch.
+
+The `Consul`, `etcd`, and `Zookeeper` discovery service backends support various
+options for high availability. These include Paxos/Raft quorums. You should
+follow existing best practices for deploying HA configurations of your chosen
+discover service backend. If you are deploying multiple discovery service
+instances for high availability, you should consider spreading them across
+multiple failure domains within your infrastructure.
+
+If you operate your Swarm cluster with a single discovery backend service and
+this service fails and is unrecoverable, you can start a new empty instance of
+the discovery backend and the Swarm agents on each node in the cluster will
+repopulate it.
+
+
+#### Handling failures
+
+There are many reasons why containers can fail. However, Swarm does not attempt
+to restart failed containers.
+
+One way to automatically restart failed containers is to explicitly start them
+with the `--restart=unless-stopped` flag. This will tell the local Docker daemon
+to attempt to restart the container if it unexpectedly exits. This will only
+work in situations where the node hosting the container and it's Docker daemon
+are still up. This cannot restart a container if the node hosting it has failed,
+or if the Docker daemon itself has failed.
+
+Another way is to have an external tool (external to the cluster) monitor the
+state of your application, and make sure that certain service levels are
+maintained. These service levels can include things like "have at least 10 web
+server containers running". In this scenario, if the number of web containers
+drops below 10, the tool will attempt to start more.
+
+In our simple voting-app example, the front-end is scalable and serviced by a
+load balancer. In the event that on the of the two web containers fails (or the
+AWS instance that is hosting it), the load balancer will stop routing requests
+to it and send all requests the surviving web container. This solution is highly
+scalable meaning you can have up to *n* web containers behind the load balancer.
+
+### Interlock load balancer failures
+
+The environment that you have provisioned has a single
+[interlock](https://github.com/ehazlett/interlock) load balancer container
+running on a single node. In this setup, if the container exits or node fails,
+the application will no longer be able to service incoming requests and the
+application will be unavailable.
+
+If the failure is the `interlock` container unexpectedly exiting, Docker will
+automatically attempt to restart it. This is because the container was started
+with the `--restart=unless-stopped` switch.
+
+It is possible to build an HA Interlock load balancer configuration. One such
+way is to have multiple Interlock containers on multiple nodes. You can then use
+DNS round robin, or other technologies, to load balance across each Interlock
+container. That way, if one Interlock container or node goes down, the others
+will continue to service requests.
+
+If you deploy multiple interlock load balancers, you should consider spreading
+them across multiple failure domains within your infrastructure.
+
+### Web (web-vote-app) failures
+
+The environment that you have configured has two web-vote-app containers running
+on two separate nodes. They operate behind an Interlock load balancer that
+distributes incoming connections across both.
+
+In the event that one of the web containers or nodes fails, the load balancer
+will start directing all incoming requests to surviving instance. Once the
+failed instance is back up, or a replacement is added, the load balancer will
+add it to the configuration and start sending a portion of the incoming requests
+to it.
+
+For highest availability you should deploy the two frontend web services
+(`frontend01` and `frontend02`) in different failure zones within your
+infrastructure. You should also consider deploying more.
+
+### Redis failures
+
+If the a `redis` container fails, it's partnered `web-vote-app` container will
+not function correctly. The best solution in this instance might be to configure
+health monitoring that verifies the ability to write to each Redis instance. If
+an unhealthy `redis` instance is encountered, remove the `web-vote-app` and
+`redis` combination and attempt remedial actions.
+
+### Worker (vote-worker) failures
+
+If the worker container exits, or the node that is hosting it fails, the redis
+containers will queue votes until the worker container comes back up. This
+situation can prevail indefinitely, though a worker needs to come back at some
+point and process the votes.
+
+If the failure is the `worker01` container unexpectedly exiting, Docker will
+automatically attempt to restart it. This is because the container was started
+with the `--restart=unless-stopped` switch.
+
+### Postgres failures
+
+This application does not implement any for of HA or replication for Postgres.
+Therefore losing the Postgres container would cause the application to fail and
+potential lose or corrupt data. A better solution would be to implement some
+form of Postgres HA or replication.
+
+### Results-app failures
+
+If the results-app container exits, you will not be able to browse to the
+results of the poll until the container is back up and running. Results will
+continue to be collected and counted, you will just not be able to view results
+until the container is back up and running.
+
+The results-app container was started with the `--restart=unless-stopped` flag
+meaning that the Docker daemon will automatically attempt to restart it unless
+it was administratively stopped.
+
+### Infrastructure failures
+
+There are many ways in which the infrastructure underpinning your applications
+can fail. However, there are a few best practices that can be followed to help
+mitigate and offset these failures.
+
+One of these is to deploy infrastructure components over as many failure domains
+as possible. On a service such as AWS, this often translates into balancing
+infrastructure and services across multiple AWS Availability Zones (AZ) within a
+Region.
+
+To increase the availability of our Swarm cluster you could:
+
+* Configure the Swarm manager for HA and deploy HA nodes in different AZs
+* Configure the Consul discovery service for HA and deploy HA nodes in different AZs
+* Deploy all scalable components of the application across multiple AZs
+
+This configuration is shown in the diagram below.
+
+![](images/infrastructure-failures.jpg)
+
+This will allow us to lose an entire AZ and still have our cluster and
+application operate.
+
+But it doesn't have to stop there. Some applications can be balanced across AWS
+Regions. In our example we might deploy parts of our cluster and application in
+the `us-west-1` Region and the rest in `us-east-1`. It's even becoming possible
+to deploy services across cloud providers, or have balance services across
+public cloud providers and your on premises date ceters!
+
+The diagram below shows parts of the application and infrastructure deployed
+across AWS and Microsoft Azure. But you could just as easily replace one of
+those cloud providers with your own on premises data center. In these scenarios,
+network latency and reliability is key to a smooth and workable solution.  
+
+![](images/deployed-across.jpg)
+
+## Related information
+
+The application in this example could be deployed on Docker Universal Control Plane (UPC) which is currently in Beta release. To try the application on UPC in your environment, [request access to the UPC Beta release](https://www.docker.com/products/docker-universal-control-plane). Other useful documentation:
+
+* [Plan for Swarm in production](plan-for-production.md)
+* [Swarm and container networks](networking.md)
+* [High availability in Docker Swarm](multi-manager-setup.md)