From 551b30f31982d92e93b6f4936c59bc32a213a973 Mon Sep 17 00:00:00 2001 From: Misty Stanley-Jones Date: Wed, 8 Feb 2017 18:53:55 -0800 Subject: [PATCH] Add info about backing up swarms A little reorganization and clean-up along the way --- engine/swarm/admin_guide.md | 137 ++++++++++++++++++++++++++---------- 1 file changed, 100 insertions(+), 37 deletions(-) diff --git a/engine/swarm/admin_guide.md b/engine/swarm/admin_guide.md index 73b2b0c0aa..8e6e7a38a2 100644 --- a/engine/swarm/admin_guide.md +++ b/engine/swarm/admin_guide.md @@ -11,24 +11,11 @@ for managing the swarm and storing the swarm state. It is important to understand some key features of manager nodes in order to properly deploy and maintain the swarm. -This article covers the following swarm administration tasks: - -* [Using a static IP for manager node advertise address](#use-a-static-ip-for-manager-node-advertise-address) -* [Adding manager nodes for fault tolerance](#add-manager-nodes-for-fault-tolerance) -* [Distributing manager nodes](#distribute-manager-nodes) -* [Running manager-only nodes](#run-manager-only-nodes) -* [Backing up the swarm state](#back-up-the-swarm-state) -* [Monitoring the swarm health](#monitor-swarm-health) -* [Troubleshooting a manager node](#troubleshoot-a-manager-node) -* [Forcefully removing a node](#force-remove-a-node) -* [Recovering from disaster](#recover-from-disaster) -* [Forcing the swarm to rebalance](#forcing-the-swarm-to-rebalance) - Refer to [How nodes work](/engine/swarm/how-swarm-mode-works/nodes.md) for a brief overview of Docker Swarm mode and the difference between manager and worker nodes. -## Operating manager nodes in a swarm +## Operate manager nodes in a swarm Swarm manager nodes use the [Raft Consensus Algorithm](/engine/swarm/raft.md) to manage the swarm state. You only need to understand some general concepts of Raft in @@ -45,7 +32,7 @@ Raft requires a majority of managers, also called the quorum, to agree on proposed updates to the swarm, such as node additions or removals. Membership operations are subject to the same constraints as state replication. -### Maintaining the quorum of managers +### Maintain the quorum of managers If the swarm loses the quorum of managers, the swarm cannot perform management tasks. If your swarm has multiple managers, always have more than two. In order @@ -63,7 +50,7 @@ updated. See [Recovering from losing the quorum](#recovering-from-losing-the-quorum) for troubleshooting steps if you do lose the quorum of managers. -## Use a static IP for manager node advertise address +## Configure the manager to advertise on a static IP address When initiating a swarm, you have to specify the `--advertise-addr` flag to advertise your address to other manager nodes in the swarm. For more @@ -107,7 +94,7 @@ While it is possible to scale a swarm down to a single manager node, it is impossible to demote the last manager node. This ensures you maintain access to the swarm and that the swarm can still process requests. Scaling down to a single manager is an unsafe operation and is not recommended. If -the last node leaves the swarm unexpetedly during the demote operation, the +the last node leaves the swarm unexpectedly during the demote operation, the swarm will become unavailable until you reboot the node or restart with `--force-new-cluster`. @@ -115,7 +102,7 @@ You manage swarm membership with the `docker swarm` and `docker node` subsystems. Refer to [Add nodes to a swarm](/engine/swarm/join-nodes.md) for more information on how to add worker nodes and promote a worker node to be a manager. -## Distribute manager nodes +### Distribute manager nodes In addition to maintaining an odd number of manager nodes, pay attention to datacenter topology when placing managers. For optimal fault-tolerance, distribute @@ -131,7 +118,7 @@ available to process requests and rebalance workloads. | 7 | 3-2-2 | | 9 | 3-3-3 | -## Run manager-only nodes +### Run manager-only nodes By default manager nodes also act as a worker nodes. This means the scheduler can assign tasks to a manager node. For small and non-critical swarms @@ -154,18 +141,15 @@ When you drain a node, the scheduler reassigns any tasks running on the node to other available worker nodes in the swarm. It also prevents the scheduler from assigning tasks to the node. -## Back up the swarm state +## Add worker nodes for load balancing -Docker manager nodes store the swarm state and manager logs in the following -directory: - -```bash -/var/lib/docker/swarm/raft -``` - -Back up the `raft` data directory often so that you can use it in case of -[disaster recovery](#recover-from-disaster). Then you can take the `raft` -directory of one of the manager nodes to restore to a new swarm. +[Add nodes to the swarm](/engine/swarm/join-nodes.md) to balance your swarm's +load. Replicated service tasks will be distributed across the swarm as evenly as +possible over time, as long as the worker nodes are matched to the requirements +of the services. When limiting a service to run on only specific types of nodes, +such as nodes with a specific number of CPUs or amount of memory, remember that +worker nodes that do not meet these requirements will not be able to run these +tasks. ## Monitor swarm health @@ -232,12 +216,14 @@ To cleanly re-join a manager node to a cluster: For more information on joining a manager node to a swarm, refer to [Join nodes to a swarm](/engine/swarm/join-nodes.md). -## Force remove a node +## Forcibly remove a node -In most cases, you should shut down a node before removing it from a swarm with the `docker node rm` command. If a node becomes unreachable, unresponsive, or compromised you can forcefully remove the node without shutting it down by passing the `--force` flag. For instance, if `node9` becomes compromised: +In most cases, you should shut down a node before removing it from a swarm with +the `docker node rm` command. If a node becomes unreachable, unresponsive, or +compromised you can forcefully remove the node without shutting it down by +passing the `--force` flag. For instance, if `node9` becomes compromised: - -``` +```none $ docker node rm node9 Error response from daemon: rpc error: code = 9 desc = node node9 is not down and can't be removed @@ -251,8 +237,87 @@ Before you forcefully remove a manager node, you must first demote it to the worker role. Make sure that you always have an odd number of manager nodes if you demote or remove a manager +## Back up the swarm + +Docker manager nodes store the swarm state and manager logs in the +`/var/lib/docker/swarm/` directory. In 1.13 and higher, this data includes the +keys used to encrypt the Raft logs. Without these keys, you will not be able +to restore the swarm. + +You can back up the swarm using any manager. Use the following procedure. + +1. If the swarm has auto-lock enabled, you will need the unlock key in order + to restore the swarm from backup. Retrieve the unlock key if necessary and + store it in a safe location. If you are unsure, read + [Lock your swarm to protect its encryption key](/engine/swarm/swarm_manager_locking.md). + +2. Stop Docker on the manager before backing up the data, so that no data is + being changed during the backup. It is possible to take a backup while the + manager is running (a "hot" backup), but this is not recommended and your + results will be less predictable when restoring. While the manager is down, + other nodes will continue generating swarm data that will not be part of + this backup. + + > **Note**: Be sure to maintain the quorum of swarm managers. During the + > time that a manager is shut down, your swarm is more vulnerable to + > losing the quorum if further nodes are lost. The number of managers you + > run is a trade-off. If you regularly take down managers to do backups, + > consider running a 5-manager swarm, so that you can lose an additional + > manager while the backup is running, without disrupting your services. + +3. Back up the entire `/var/lib/docker/swarm` directory. + +4. Restart the manager. + +To restore, see [Restore from a backup](#restore-from-a-backup). + ## Recover from disaster +### Restore from a backup + +After backing up the swarm as described in +[Backing up the swarm](#backing-up-the-swarm), use the following procedure to +restore the data to a new swarm. + +1. Shut down Docker on the target host machine where the swarm will be restored. + +3. Remove the contents of the `/var/lib/docker/swarm` directory on the new + swarm. + +4. Restore the `/var/lib/docker/swarm` directory with the contents of the + backup. + + > **Note**: The new node will use the same encryption key for on-disk + > storage as the old one. It is not possible to change the on-disk storage + > encryption keys at this time. + > + > In the case of a swarm with auto-lock enabled, the unlock key is also the + > same as on the the old swarm, and the unlock key will be needed to + > restore. + +5. Start Docker on the new node. Unlock the swarm if necessary. Re-initialize + the swarm using the following command, so that this node does not attempt + to connect to nodes that were part of the old swarm, and presumably no + longer exist. + + ```bash + $ docker swarm init --force-new-cluster + ``` + +6. Verify that the state of the swarm is as expected. This may include + application-specific tests or simply checking the output of + `docker service ls` to be sure that all expected services are present. + +7. If you use auto-lock, + [rotate the unlock key](/engine/swarm/swarm_manager_locking.md#rotate-the-unlock-key). + +8. Add manager and worker nodes to bring your new swarm up to operating + capacity. + +9. Reinstate your previous backup regimen on the new swarm. + +### Recover from losing the quorum + Swarm is resilient to failures and the swarm can recover from any number of temporary node failures (machine reboots or crash with restart) or other transient errors. However, a swarm cannot automatically recover if it loses a @@ -269,8 +334,6 @@ tolerate up to `(N-1)/2` permanent failures beyond which requests involving swarm management cannot be processed. These types of failures include data corruption or hardware failures. -### Recovering from losing the quorum - If you lose the quorum of managers, you cannot administer the swarm. If you have lost the quorum and you attempt to perform any management operation on the swarm, an error occurs: @@ -300,7 +363,7 @@ re-add manager nodes to achieve your previous task distribution and ensure that you have enough managers to maintain high availability and prevent losing the quorum. -## Forcing the swarm to rebalance +## Force the swarm to rebalance Generally, you do not need to force the swarm to rebalance its tasks. When you add a new node to a swarm, or a node reconnects to the swarm after a