Add to swarm admin docs

- Add details about maintaining quorum - Add details about pending tasks Relates to #26017 Relates to #25069 Signed-off-by: Misty Stanley-Jones <misty@docker.com>
2016-10-20 16:01:13 -07:00 · 2016-10-20 16:01:13 -07:00 · c4c047f5f2
parent 69b3a3d434
commit c4c047f5f2
4 changed files with 136 additions and 43 deletions
--- a/engine/swarm/admin_guide.md
+++ b/engine/swarm/admin_guide.md
@ -40,10 +40,27 @@ fault-tolerant. However, additional manager nodes reduce write performance
 because more nodes must acknowledge proposals to update the swarm state.
 This means more network round-trip traffic.
-Raft requires a majority of managers, also called a quorum, to agree on proposed
+Raft requires a majority of managers, also called the quorum, to agree on
-updates to the swarm. A quorum of managers must also agree on node additions
+proposed updates to the swarm, such as node additions or removals. Membership
-and removals. Membership operations are subject to the same constraints as state
+operations are subject to the same constraints as state replication.
-replication.
+
 ### Maintaining the quorum of managers
 If the swarm loses the quorum of managers, the swarm cannot perform management
 tasks. If your swarm has multiple managers, always have more than two. In order
 to maintain quorum, a majority of managers must be available. An odd number of
 managers is recommended, because the next even number does not make the quorum
 easier to keep. For instance, whether you have 3 or 4 managers, you can still
 only lose 1 manager and maintain the quorum. If you have 5 or 6 managers, you
 can still only lose two.
 Even if a swarm loses the quorum of managers, swarm tasks on existing worker
 nodes continue to run. However, swarm nodes cannot be added, updated, or
 removed, and new or existing tasks cannot be started, stopped, moved, or
 updated.
 See [Recovering from losing the quorum](recovering-from-losing-the-quorum) for
 troubleshooting steps if you do lose the quorum of managers.
 ## Use a static IP for manager node advertise address
@ -64,8 +81,8 @@ Dynamic IP addresses are OK for worker nodes.
 You should maintain an odd number of managers in the swarm to support manager
 node failures. Having an odd number of managers ensures that during a network
-partition, there is a higher chance that a quorum remains available to process
+partition, there is a higher chance that the quorum remains available to process
-requests if the network is partitioned into two sets. Keeping a quorum is not
+requests if the network is partitioned into two sets. Keeping the quorum is not
 guaranteed if you encounter more than two network partitions.
 | Swarm Size |  Majority  |  Fault Tolerance  |
@ -103,7 +120,7 @@ In addition to maintaining an odd number of manager nodes, pay attention to
 datacenter topology when placing managers. For optimal fault-tolerance, distribute
 manager nodes across a minimum of 3 availability-zones to support failures of an
 entire set of machines or common maintenance scenarios. If you suffer a failure
-in any of those zones, the swarm should maintain a quorum of manager nodes
+in any of those zones, the swarm should maintain the quorum of manager nodes
 available to process requests and rebalance workloads.
 | Swarm manager nodes |  Repartition (on 3 Availability zones) |
@ -231,29 +248,51 @@ you demote or remove a manager
 ## Recover from disaster
 Swarm is resilient to failures and the swarm can recover from any number
-of temporary node failures (machine reboots or crash with restart).
+of temporary node failures (machine reboots or crash with restart) or other
 transient errors. However, a swarm cannot automatically recover if it loses a
 quorum. Tasks on existing worker nodes will continue to run, but administrative
 tasks are not possible, including scaling or updating services and joining or
 removing nodes from the swarm. The best way to recover is to bring the missing
 manager nodes back online. If that is not possible, continue reading for some
 options for recovering your swarm.
-In a swarm of `N` managers, there must be a quorum of manager nodes greater than
+In a swarm of `N` managers, a quorum (a majority) of manager nodes must always
-50% of the total number of managers (or `(N/2)+1`) in order for the swarm to
+be available. For example, in a swarm with 5 managers, a minimum of 3 must be
-process requests and remain available. This means the swarm can tolerate up to
+operational and in communication with each other. In other words, the swarm can
-`(N-1)/2` permanent failures beyond which requests involving swarm management
+tolerate up to `(N-1)/2` permanent failures beyond which requests involving
-cannot be processed. These types of failures include data corruption or hardware
+swarm management cannot be processed. These types of failures include data
-failures.
+corruption or hardware failures.
-Even if you follow the guidelines here, it is possible that you can lose a
+### Recovering from losing the quorum
-quorum of manager nodes. If you can't recover the quorum by conventional
+
-means such as restarting faulty nodes, you can recover the swarm by running
+If you lose the quorum of managers, you cannot administer the swarm. If you have
-`docker swarm init --force-new-cluster` on a manager node.
+lost the quorum and you attempt to perform any management operation on the swarm,
 an error occurs:
 ```no-highlight
 Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
 ```
 The best way to recover from losing the quorum is to bring the failed nodes back
 online. If you can't do that, the only way to recover from this state is to use
 the `--force-new-cluster` action from a manager node. This removes all managers
 except the manager the command was run from. The quorum is achieved because
 there is now only one manager. Promote nodes to be managers until you have the
 desired number of managers.
 ```bash
 # From the node to recover
 docker swarm init --force-new-cluster --advertise-addr node01:2377
 ```
-The `--force-new-cluster` flag puts the Docker Engine into swarm mode as a
+When you run the `docker swarm init` command with the `--force-new-cluster`
-manager node of a single-node swarm. It discards swarm membership information
+flag, the Docker Engine where you run the command becomes the manager node of a
-that existed before the loss of the quorum but it retains data necessary to the
+single-node swarm which is capable of managing and running services. The manager
-Swarm such as services, tasks and the list of worker nodes.
+has all the previous information about services and tasks, worker nodes are
 still part of the swarm, and services are still running. You will need to add or
 re-add  manager nodes to achieve your previous task distribution and ensure that
 you have enough managers to maintain high availability and prevent losing the
 quorum.
 ### Forcing the swarm to rebalance
@ -267,11 +306,16 @@ balance across the swarm. When new tasks start, or when a node with running
 tasks becomes unavailable, those tasks are given to less busy nodes. The goal
 is eventual balance, with minimal disruption to the end user.
-If you are concerned about an even balance of load and don't mind disrupting
+In Docker 1.13 and higher, you can use the `--force` or `-f` flag with the
-running tasks, you can force your swarm to re-balance by temporarily scaling
+`docker service update` command to force the service to redistribute its tasks
-the service upward.
+across the available worker nodes. This will cause the service tasks to restart.
 Client applications may be disrupted. If you have configured it, your service
 will use a [rolling update](swarm-tutorial.md#rolling-update).
-Use `docker service inspect --pretty <servicename>` to see the configured scale
+If you use an earlier version and you want to achieve an even balance of load
 across workers and don't mind disrupting running tasks, you can force your swarm
 to re-balance by temporarily scaling the service upward. Use
 `docker service inspect --pretty <servicename>` to see the configured scale
 of a service. When you use `docker service scale`, the nodes with the lowest
 number of tasks are targeted to receive the new workloads. There may be multiple
 under-loaded nodes in your swarm. You may need to scale the service up by modest
@ -283,4 +327,4 @@ balance of your service across nodes.
 See also
 [`docker service scale`](../reference/commandline/service_scale.md) and
-[`docker service ps`](../reference/commandline/service_ps.md).
+[`docker service ps`](../reference/commandline/service_ps.md).
--- a/engine/swarm/how-swarm-mode-works/services.md
+++ b/engine/swarm/how-swarm-mode-works/services.md
@ -68,6 +68,36 @@ schedules tasks to worker nodes.
 ![services flow](../images/service-lifecycle.png)
 ### Pending services
 A service may be configured in such a way that no node currently in the
 swarm can run its tasks. In this case, the service remains in state `pending`.
 Here are a few examples of when a service might remain in state `pending`.
 **Note**: If your only intention is to prevent a service from
 being deployed, scale the service to 0 instead of trying to configure it in
 such a way that it will remain in `pending`.
 - If all nodes are paused or drained, and you create a service, it will be
  pending until a node becomes available. In reality, the first node to become
  available will get all of the tasks, so this is not a good thing to do in a
  production environment.
 - You can reserve a specific amount of memory for a service. If no node in the
  swarm has the required amount of memory, the service will remain in a pending
  state until a node is available which can run its tasks. If you specify a very
  large value, such as 500 GB, the task will be pending forever, unless you
  really have a node which can satisfy it.
 - You can impose placement constraints on the service, and the constraints may
  not be able to be honored at a given time.
 This behavior illustrates that the requirements and configuration of your tasks
 are not tightly tied to the current state of the swarm. As the administrator of
 a swarm, you declare the desired state of your swarm, and the manager works with
 the nodes in the swarm to create that state. You do not need to micro-manage the
 tasks on the swarm.
 ## Replicated and global services
 There are two types of service deployments, replicated and global.
@ -91,4 +121,4 @@ in gray.
 ## Learn More
 * Read about how swarm mode [nodes](nodes.md) work.
-* Learn how [PKI](pki.md) works in swarm mode.
+* Learn how [PKI](pki.md) works in swarm mode.
--- a/engine/swarm/manage-nodes.md
+++ b/engine/swarm/manage-nodes.md
@ -87,9 +87,9 @@ Engine Version:         1.12.0-dev
 You can modify node attributes as follows:
-* [change node availability](manage-nodes.md#change-node-availability)
+* [change node availability](#change-node-availability)
-* [add or remove label metadata](manage-nodes.md#add-or-remove-label-metadata)
+* [add or remove label metadata](#add-or-remove-label-metadata)
-* [change a node role](manage-nodes.md#promote-or-demote-a-node)
+* [change a node role](#promote-or-demote-a-node)
 ### Change node availability
@ -109,7 +109,7 @@ $ docker node update --availability drain node-1
 node-1
 ```
-See [list nodes](manage-nodes.md#list-nodes) for descriptions of the different availability
+See [list nodes](#list-nodes) for descriptions of the different availability
 options.
 ### Add or remove label metadata
@ -143,9 +143,9 @@ You can promote a worker node to the manager role. This is useful when a
 manager node becomes unavailable or if you want to take a manager offline for
 maintenance. Similarly, you can demote a manager node to the worker role.
-Regardless of your reason to promote or demote a node, you should always
+>**Note: Maintaining a quorum** Regardless of your reason to promote or demote
-maintain an odd number of manager nodes in the swarm. For more information refer
+a node, you must always maintain a quorum of manager nodes in the
-to the [Swarm administration guide](admin_guide.md).
+swarm. For more information refer to the [Swarm administration guide](admin_guide.md).
 To promote a node or set of nodes, run `docker node promote` from a manager
 node:
@ -209,4 +209,4 @@ node-2
 * [Swarm administration guide](admin_guide.md)
 * [Docker Engine command line reference](../reference/commandline/index.md)
-* [Swarm mode tutorial](swarm-tutorial/index.md)
+* [Swarm mode tutorial](swarm-tutorial/index.md)
--- a/engine/swarm/services.md
+++ b/engine/swarm/services.md
@ -70,7 +70,19 @@ $ docker service create --name helloworld alpine ping docker.com
 9uk4639qpg7npwf3fn2aasksr
 ```
-## Configure the runtime environment
+## Configuring services
 When you create a service, you can specify many different configuration options
 and constraints. See the output of `docker service create --help` for a full
 listing of them. Some common configuration options are described below.
 Created services do not always run right away. A service can be in a pending
 state if its image is unavailable, no node meets the requirements you configure
 for the service, or other reasons. See
 [Pending services](how-swarm-mode-works/services.md#pending-services) for more
 information.
 ### Configure the runtime environment
 You can configure the following options for the runtime environment in the
 container:
@ -91,7 +103,7 @@ $ docker service create --name helloworld \
 9uk4639qpg7npwf3fn2aasksr
 ```
-## Control service scale and placement
+### Control service scale and placement
 Swarm mode has two types of services, replicated and global. For replicated
 services, you specify the number of replica tasks for the swarm manager to
@ -121,15 +133,22 @@ deploys a service to the node. You can apply constraints to the
 service based upon node attributes and metadata or engine metadata. For more
 information on constraints, refer to the `docker service create` [CLI  reference](../reference/commandline/service_create.md).
 ### Reserving memory or number of CPUs for a service
-## Configure service networking options
+To reserve a given amount of memory or number of CPUs for a service, use the
 `--reserve-memory` or `--reserve-cpu` flags. If no available nodes can satisfy
 the requirement (for instance, if you request 4 CPUs and no node in the swarm
 has 4 CPUs), the service remains in a pending state until a node is available to
 run its tasks.
 ### Configure service networking options
 Swarm mode lets you network services in a couple of ways:
 * publish ports externally to the swarm using ingress networking
 * connect services and tasks within the swarm using overlay networks
-### Publish ports externally to the swarm
+#### Publish ports externally to the swarm
 You publish service ports externally to the swarm using the `--publish<TARGET-PORT>:<SERVICE-PORT>`
 flag. When you publish a service port, the swarm
@ -178,7 +197,7 @@ Commercial support is available at
 </html>
 ```
-### Add an overlay network
+#### Add an overlay network
 Use overlay networks to connect one or more services within the swarm.
@ -213,7 +232,7 @@ For more information on overlay networking and service discovery, refer to
 [Attach services to an overlay network](networking.md). See also
 [Docker swarm mode overlay network security model](../userguide/networking/overlay-security-model.md).
-## Configure update behavior
+### Configure update behavior
 When you create a service, you can specify a rolling update behavior for how the
 swarm should apply changes to the service when you run `docker service update`.
@ -251,7 +270,7 @@ $ docker service create \
 0u6a4s31ybk7yw2wyvtikmu50
 ```
-## Configure mounts
+### Configure mounts
 You can create two types of mounts for services in a swarm, `volume` mounts or
 `bind` mounts. You pass the `--mount` flag when you create a service. The