Merge pull request #14158 from olemarkus/fix-etcd-docs

Update and clean up etcdcli and etcd backup documentation
2022-08-21 02:45:42 -07:00 · 2022-08-21 02:45:42 -07:00 · dc79885536
parent 45dc1f1cfe c0114f0f29
commit dc79885536
3 changed files with 35 additions and 160 deletions
--- a/docs/operations/etcd_administration.md
+++ b/docs/operations/etcd_administration.md
@ -2,17 +2,10 @@

 ## etcd-manager

-etcd-manager is a kubernetes-associated project that kOps uses to manage
+[etcd-manager](https://github.com/kubernetes-sigs/etcdadm/tree/master/etcd-manager) is a kubernetes-sigs project that kOps uses to manage
 etcd.

-etcd-manager uses many of the same ideas as the existing etcd implementation
-built into kOps, but it addresses some limitations also:
-
-* separate from kOps - can be used by other projects
-* allows etcd2 -> etcd3 upgrade (along with minor upgrades)
-* allows cluster resizing (e.g. going from 1 to 3 nodes)
-
-When using kubernetes >= 1.12 etcd-manager will be used by default. See [etcd3-migration.md](../etcd3-migration.md) for upgrades from older clusters.
+It handles graceful upgrades of etcd, TLS, and backups. If a Kubernetes cluster needs more redundant control plane, it also takes care of resizing the etcd cluster.

 ## Backups

@ -22,89 +15,37 @@ Backups and restores of etcd on kOps are covered in [etcd_backup_restore_encrypt

 It's not typically necessary to view or manipulate the data inside of etcd directly with etcdctl, because all operations usually go through kubectl commands. However, it can be informative during troubleshooting, or just to understand kubernetes better. Here are the steps to accomplish that on kOps.

-1\. Connect to an etcd-manager pod
+1\. Determine which version of etcd is running
+
+```bash
+kops get cluster --full -o yaml
+```
+
+Look at the `etcdCluster` configuration's `version` for the given cluster.
+
+
+2\. Connect to an etcd-manager pod

 ```bash
 CONTAINER=$(kubectl get pods -n kube-system | grep etcd-manager-main | head -n 1 | awk '{print $1}')
-kubectl exec -it -n kube-system $CONTAINER bash
+kubectl exec -it -n kube-system $CONTAINER -- sh
 ```

-2\. Determine which version of etcd is running
-
-```bash
-DIRNAME=$(ps -ef | grep --color=never /opt/etcd | head -n 1 | awk '{print $8}' | xargs dirname)
-echo $DIRNAME
-```
+``

 3\. Run etcdctl

 ```bash
-ETCDCTL_API=3 $DIRNAME/etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001 get --prefix / | tee output.txt
+ETCD_VERSION=3.5.1
+ETCDDIR=/opt/etcd-v$ETCD_VERSION-linux-amd64 # Replace with arm64 if you are running an arm control plane
+CERTDIR=/rootfs/srv/kubernetes/kube-apiserver/
+alias etcdctl="ETCDCTL_API=3 $ETCDDIR/etcdctl --cacert=$CERTDIR/etcd-ca.crt --cert=$CERTDIR/etcd-client.crt --key=$CERTDIR/etcd-client.key --endpoints=https://127.0.0.1:4001"
 ```

-The contents of etcd are now in output.txt. 
-
-You may run any other etcdctl commands by replacing the "get --prefix /" with a different command.
-
-The contents of the etcd dump are often garbled. See the next section for a better way to view the results.
-
-## Dump etcd contents in clear text
-
-Openshift's etcdhelper is a good way of exporting the contents of etcd in a readable format. Here are the steps.
-
-1\. SSH into a master node
-
-You can view the IP addresses of the nodes
-
-```
-kubectl get nodes -o wide
-```
-
-and then
-
-```
-ssh admin@<IP-of-master-node>
-```
-
-2\. Install golang
-
-in whatever manner you prefer. Here is one example.
+Test the client by running the following:

 ```bash
-cd /usr/local
-sudo wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
-sudo tar -xvf go1.13.3.linux-amd64.tar.gz
-cat <<EOT >> $HOME/.profile
-export GOROOT=/usr/local/go
-export GOPATH=\$HOME/go
-export PATH=\$GOPATH/bin:\$GOROOT/bin:\$PATH
-EOT
-source $HOME/.profile
-which go
+etcdctl member list
 ```

-3\. Install etcdhelper
-
-```bash
-mkdir -p ~/go/src/github.com/
-cd ~/go/src/github.com/
-git clone https://github.com/openshift/origin openshift
-cd openshift/tools/etcdhelper
-go build .
-sudo cp etcdhelper /usr/local/bin/etcdhelper
-which etcdhelper
-```
-
-4\. Run etcdhelper
-
-```
-sudo etcdhelper -key /etc/kubernetes/pki/kube-apiserver/etcd-client.key -cert /etc/kubernetes/pki/kube-apiserver/etcd-client.crt  -cacert /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt -endpoint https://127.0.0.1:4001 dump | tee output.txt
-```
-
-The output of the command is now available in output.txt
-
-Other etcdhelper commands are possible, like "ls":
-
-```
-sudo etcdhelper -key /etc/kubernetes/pki/kube-apiserver/etcd-client.key -cert /etc/kubernetes/pki/kube-apiserver/etcd-client.crt  -cacert /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt -endpoint https://127.0.0.1:4001 ls
-```
+If successful, this should output the members of the etcd cluster.
--- a/docs/operations/etcd_backup_restore_encryption.md
+++ b/docs/operations/etcd_backup_restore_encryption.md
@ -15,8 +15,6 @@ result in six volumes for etcd data (one in each AZ). An EBS volume is designed
 to have a [failure rate](https://aws.amazon.com/ebs/details/#AvailabilityandDurability)
 of 0.1%-0.2% per year.

-## Backup and restore using etcd-manager
-
 ## Taking backups

 Backups are done periodically and before cluster modifications using [etcd-manager](etcd_administration.md)
@ -67,32 +65,30 @@ You can follow the progress by reading the etcd logs (`/var/log/etcd(-events).lo
 on the master that is the leader of the cluster (you can find this out by checking the etcd logs on all masters).
 Note that the leader might be different for the `main` and `events` clusters.

-After the restore, you will probably face an intermittent connection to apiserver.
-If you look at your kubernetes endpoint, you should have more addresses than masters. The restore brings back the addresses of the old masters and you should clean this up.
+## Verify master lease consistency

-To verify this, check the endpoints resource of the kubernetes apiserver, like this:
+[This bug](https://github.com/kubernetes/kubernetes/issues/86812) causes old apiserver leases to get stuck. In order to recover from this you need to remove the leases from etcd directly. 
+
+To verify if you are affect by this bug, check the endpoints resource of the kubernetes apiserver, like this:
 ```
 kubectl get endpoints/kubernetes -o yaml
 ```

-If you see more address than masters, you will need to remove it manually inside the etcd.
+If you see more address than masters, you will need to remove it manually inside the etcd cluster.

-Check again (this time inside the etcd) if you have more IPs than masters at the `/registry/masterleases/` path, e.g.:
+See [etcd administation](etcd-administration.md) how to obtain access to the etcd cluster.
+
+Once you have a working etcd client, run the following:
 ```
-ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001 get --prefix --keys-only /registry/masterleases
+etcdctl get --prefix --keys-only /registry/masterleases
 ```

-To restore the stability within cluster you should delete the old master records and keep only the running ones:
+Also you can delete all of the leases in one go... 
 ```
-ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001 del /registry/masterleases/<OLD-IP>
+etcdctl del --prefix /registry/masterleases/
 ```
-NOTE: You will need to run it multiple times for each old IP, regarding the size of your master pool.

-After that, you can check again the endpoint and everything should be fixed.
-
-After the restore is complete, api server should come back up, and you should have a working cluster.
-Note that the api server might be very busy for a while as it changes the cluster back to the state of the backup. 
-You might consider temporarily increasing the instance size of your control plane.
+The remaining api servers will immediately recreate their own leases. Check again the above-mentioned endpoint to verify the problem has been solved.

 Because the state on each of the Nodes may differ from the state in etcd, it is also a good idea to do a rolling-update of the entire cluster:

@ -100,42 +96,7 @@ Because the state on each of the Nodes may differ from the state in etcd, it is
 kops rolling-update cluster --force --yes
 ```

-For more information and troubleshooting, please check the [etcd-manager documentation](https://github.com/kopeio/etcd-manager).
-
-## Backup and restore using legacy etcd
-
-### Volume backups
-
-If you are running your cluster in legacy etcd mode (without etcd-manager),
-backups can be done through snapshots of the etcd volumes.
-
-You can for example use CloudWatch to trigger an AWS Lambda with a defined schedule (e.g. once per
-hour). The Lambda will then create a new snapshot of all etcd volumes. A complete
-guide on how to setup automated snapshots can be found [here](https://serverlesscode.com/post/lambda-schedule-ebs-snapshot-backups/).
-
-Note: this is one of many examples on how to do scheduled snapshots.
-
-### Restore volume backups
-
-If you're using legacy etcd (without etcd-manager), it is possible to restore the volume from a snapshot we created
-earlier. Details about creating a volume from a snapshot can be found in the
-[AWS documentation](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-restoring-volume.html).
-
-Kubernetes uses protokube to identify the right volumes for etcd. Therefore it
-is important to tag the EBS volumes with the correct tags after restoring them
-from a EBS snapshot.
-
-protokube will look for the following tags:
-
-* `KubernetesCluster` containing the cluster name (e.g. `k8s.mycompany.tld`)
-* `Name` containing the volume name (e.g. `eu-central-1a.etcd-main.k8s.mycompany.tld`)
-* `k8s.io/etcd/main` containing the availability zone of the volume (e.g. `eu-central-1a/eu-central-1a`)
-* `k8s.io/role/master` with the value `1`
-
-After fully restoring the volume ensure that the old volume is no longer there,
-or you've removed the tags from the old volume. After restarting the master node
-Kubernetes should pick up the new volume and start running again.
-
+For more information and troubleshooting, please check the [etcd-manager documentation](https://github.com/kubernetes-sigs/etcdadm/etcd-manager).

 ## Etcd Volume Encryption

--- a/docs/operations/troubleshoot.md
+++ b/docs/operations/troubleshoot.md
@ -40,34 +40,7 @@ Often the issue is obvious such as passing incorrect CLI flags.
 After resizing an etcd cluster or restoring backup, the kubernetes API can contain too many endpoints.
 You can confirm this by running `kubectl get endpoints -n default kubernetes`. This command should list exactly as many IPs as you have control plane nodes.

-[This bug](https://github.com/kubernetes/kubernetes/issues/86812) causes old apiserver leases to get stuck. In order to recover from this you need to remove the leases from etcd directly:
-
-```
-CONTAINER=$(kubectl get pods -n kube-system | grep etcd-manager-main | head -n 1 | awk '{print $1}')
-kubectl exec -it -n kube-system $CONTAINER -- sh
-```
-etcd and etcdctl are installed into directories in /opt - look for the latest version eg 3.5.1 
-
-```
-# DIRNAME=/opt/etcd-v3.5.1-linux-amd64
-# ETCDCTL_API=3
-# alias etcdctl='$DIRNAME/etcdctl --cacert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --cert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --key=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key --endpoints=https://127.0.0.1:4001'
-```
-You can get a list of the leases eg:
-```
-etcdctl get --prefix /registry/masterleases
-```
-And delete with:
-```
-etcdctl del /registry/masterleases/$IP_ADDRESS
-```
-
-Also you can delete all of the leases in one go... 
-```
-./etcdctl --cacert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --cert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key --key=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt  --endpoints=https://127.0.0.1:4001 del --prefix /registry/masterleases/
-```
-
-The remaining api servers will immediately recreate their own leases.
+Check the [backup and restore documentation](etcd_backup_restore_encryption.md) for more details about this problem.

 ## etcd