Enable High Availability for Orca

This makes the necessary changes in the Orca server to enable
HA deployments.
This commit is contained in:
Daniel Hiltgen 2015-11-20 16:12:59 -08:00 committed by Joao Fernandes
parent c0912ad303
commit 7b1888c219
3 changed files with 63 additions and 2 deletions

63
high_availability.md Normal file
View File

@ -0,0 +1,63 @@
# Orca High Availability
This document outlines how Orca high availability works, and general
guidelines for deploying a highly available Orca in production.
When adding nodes to your cluster, you decide which nodes you want to
be replicas, and which nodes are simply additional engines for extra
capacity. If you are planning an HA deployment, you should have a
minimum of 3 nodes (primary + two replicas)
It is **highly** recommended that you deploy your initial 3 controller
nodes (primary + at least 2 replicas) **before** you start adding
non-replica nodes or start running workloads on your cluster. When adding
the first replica, if an error occurrs, the cluster will be come unusable.
## Architecture
* **Primary Controller** This is the first node you run the `install` against. It runs the following containers/services:
* **orca-kv** This etcd container runs the replicated KV store
* **orca-swarm-manger** This Swarm Manager uses the replicated KV store for leader election and cluster membership tracking
* **orca-controller** This container runs the Orca server, using the replicated KV store for configuration state
* **orca-swarm-join** Runs the swarm join command to periodically publish this nodes existence to the KV store. If the node goes down, this publishing stops, and the registration times out, and the node is automatically dropped from the cluster
* **orca-proxy** Runs a local TLS proxy for the docker socket to enable secure access of the local docker daemon
* **orca-swarm-ca[-proxy]** These **unreplicated** containers run the Swarm CA used for admin certificate bundles, and adding new nodes
* **orca-ca[-proxy]** These **unreplicated** containers run the (optional) Orca CA used for signing user bundles.
* **Replica Node** This is a node you `join` to the primary using the `--replica` flag and it contributes to the availability of the cluster
* **orca-kv** This etcd container runs the replicated KV store
* **orca-swarm-manger** This Swarm Manager uses the replicated KV store for leader election and cluster membership tracking
* **orca-controller** This container runs the Orca server, using the replicated KV store for configuration state
* **orca-swarm-join** Runs the swarm join command to periodically publish this nodes existence to the KV store. If the node goes down, this publishing stops, and the registration times out, and the node is automatically dropped from the cluster
* **orca-proxy** Runs a local TLS proxy for the docker socket to enable secure access of the local docker daemon
* **Non-Replica Node** These nodes provide additional capacity, but do not enhance the availability of the Orca/Swarm infrastructure
* **orca-swarm-join** Runs the swarm join command to periodically publish this nodes existence to the KV store. If the node goes down, this publishing stops, and the registration times out, and the node is automatically dropped from the cluster
* **orca-proxy** Runs a local TLS proxy for the docker socket to enable secure access of the local docker daemon
Notes:
* At present, Orca does not include a load-balancer. Users may provide one exernally and load balance between the primary and replica nodes on port 443 for web access to the system via a single IP/hostname if desired. If no external load balancer is used, admins should note the IP/hostname of the primary and all replicas so they can access them when needed.
* Backups:
* Users should always back up their volumes (see the other guides for a complete list of named volumes)
* The CAs (swarm and orca) are not currently replicated.
* Swarm CA:
* Used for admin cert bundle generation
* Used for adding hosts to the cluster
* During an outage, no new admin cert bundles can be downloaded, but existing ones will still work.
* During an outage, no new nodes can be added to the cluster, but existing nodes will continue to operate
* Orca CA:
* Used for user bundle generation
* Used to sign certs for new replica nodes
* During an outage, no new user cert bundles can be downloaded, but existing ones will still work
* During an outage, no new replica nodes can be joined to the cluster
**WARNING** You should never run a cluster with only the primary
controller and a single replica. This will result in an HA configuration
of "2-nodes" where quorum is also "2-nodes" (to prevent split-brain.)
If either the primary or single replica were to fail, the cluster will be
unusable until they are repaired. (So you actually have a higher failure
probability than if you just ran a non-HA setup with no replica.) You
should have a minimum of 2 replicas (aka, "3-nodes") so that you can
tolerate at least a single failure.
**TODO** In the future this document should describe best practices for layout,
target number of nodes, etc. For now, that's an exercise for the reader
based on etcd/raft documentation.

View File

@ -133,7 +133,6 @@ If you choose this option, create your volumes prior to installing Orca. The vol
| `orca-swarm-node-certs` | The Swarm certificates for the current node (repeated on every node in the cluster). | | `orca-swarm-node-certs` | The Swarm certificates for the current node (repeated on every node in the cluster). |
| `orca-swarm-kv-certs` | The Swarm KV client certificates for the current node (repeated on every node in the cluster). | | `orca-swarm-kv-certs` | The Swarm KV client certificates for the current node (repeated on every node in the cluster). |
| `orca-swarm-controller-certs` | The Orca Controller Swarm client certificates for the current node. | | `orca-swarm-controller-certs` | The Orca Controller Swarm client certificates for the current node. |
| `orca-config` | Orca server configuration settings (ID, locations of key services). |
| `orca-kv` | Key value store persistence. | | `orca-kv` | Key value store persistence. |

View File

@ -55,7 +55,6 @@ can pre-create volumes prior to installing Orca.
* **orca-swarm-node-certs** - The swarm certificates for the current node (repeated on every node in the cluster) * **orca-swarm-node-certs** - The swarm certificates for the current node (repeated on every node in the cluster)
* **orca-swarm-kv-certs** The Swarm KV client certificates for the current node (repeated on every node in the cluster) * **orca-swarm-kv-certs** The Swarm KV client certificates for the current node (repeated on every node in the cluster)
* **orca-swarm-controller-certs** The Orca Controller Swarm client certificates for the current node * **orca-swarm-controller-certs** The Orca Controller Swarm client certificates for the current node
* **orca-config** - Orca server configuration settings (ID, locations of key services)
* **orca-kv** - KV store persistence * **orca-kv** - KV store persistence