Add docs for backup of etcd in a cluster

ref #1506
2017-01-17 20:56:54 +01:00 · 2017-01-17 20:56:54 +01:00 · 5e216ca57a
parent 186e23b5bf
commit 5e216ca57a
1 changed files with 51 additions and 0 deletions
--- a/docs/etcd_backup.md
+++ b/docs/etcd_backup.md
@ -0,0 +1,51 @@
+# Backing up etcd
+
+Kubernetes is relying on etcd for state storage. More details about the usage
+can be found [here](https://kubernetes.io/docs/admin/etcd/) and
+[here](https://coreos.com/etcd/docs/2.3.7/index.html).
+
+## Backup requirement
+
+A Kubernetes cluster deployed with kops stores the etcd state in two different
+AWS EBS volumes per master node. One volume is used to store the Kubernetes
+main data, the other one for events. For a HA master with three nodes this will
+result in six volumes for etcd data (one in each AZ). An EBS volume is designed
+to have a [failure rate](https://aws.amazon.com/ebs/details/#AvailabilityandDurability)
+of 0.1%-0.2% per year.
+
+## Create volume backups
+
+Kubernetes does currently not provide any option to do regular backups of etcd
+out of the box.
+
+Therefore we have to either manually backup the etcd volumes regularly or use
+other AWS services to do this in a automated, scheduled way. You can for example
+use CloudWatch to trigger an AWS Lamda with a defined schedule (e.g. once per
+hour). The Lamda will then create a new snapshot of all etcd volumes. A complete
+guide on how to setup automated snapshots can be found [here](https://serverlesscode.com/post/lambda-schedule-ebs-snapshot-backups/).
+
+Note: this is one of many examples on how to do scheduled snapshots.
+
+## Restore volume backups
+
+In case the Kubernetes cluster fails in a way that too many master nodes can't
+access their etcd volumes it is impossible to get a etcd quorum.
+
+In this case it is now possible to restore the volume from a snapshot we created
+earlier. Details about creating a volume from a snaphot can be found in the
+[AWS documentation](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-restoring-volume.html).
+
+Kubernetes uses protokube to identify the right volumes for etcd. Therefore it
+is important to tag the EBS volumes with the correct tags after restoring them
+from a EBS snapshot.
+
+protokube will look for the following tags:
+
+* `KubernetesCluster` containing the cluster name (e.g. `k8s.mycompany.tld`)
+* `Name` containing the volume name (e.g. `eu-central-1a.etcd-main.k8s.mycompany.tld`)
+* `k8s.io/etcd/main` containg the availability zone of the volume (e.g. `eu-central-1a/eu-central-1a`)
+* `k8s.io/role/master` with the value `1`
+
+After fully restoring the volume ensure that the old volume is no longer there,
+or you've removed the tags from the old volume. After restarting the master node
+Kubernetes should pick up the new volume and start running again.