Merge pull request #14158 from olemarkus/fix-etcd-docs

Update and clean up etcdcli and etcd backup documentation
This commit is contained in:
Kubernetes Prow Robot 2022-08-21 02:45:42 -07:00 committed by GitHub
commit dc79885536
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 35 additions and 160 deletions

View File

@ -2,17 +2,10 @@
## etcd-manager
etcd-manager is a kubernetes-associated project that kOps uses to manage
[etcd-manager](https://github.com/kubernetes-sigs/etcdadm/tree/master/etcd-manager) is a kubernetes-sigs project that kOps uses to manage
etcd.
etcd-manager uses many of the same ideas as the existing etcd implementation
built into kOps, but it addresses some limitations also:
* separate from kOps - can be used by other projects
* allows etcd2 -> etcd3 upgrade (along with minor upgrades)
* allows cluster resizing (e.g. going from 1 to 3 nodes)
When using kubernetes >= 1.12 etcd-manager will be used by default. See [etcd3-migration.md](../etcd3-migration.md) for upgrades from older clusters.
It handles graceful upgrades of etcd, TLS, and backups. If a Kubernetes cluster needs more redundant control plane, it also takes care of resizing the etcd cluster.
## Backups
@ -22,89 +15,37 @@ Backups and restores of etcd on kOps are covered in [etcd_backup_restore_encrypt
It's not typically necessary to view or manipulate the data inside of etcd directly with etcdctl, because all operations usually go through kubectl commands. However, it can be informative during troubleshooting, or just to understand kubernetes better. Here are the steps to accomplish that on kOps.
1\. Connect to an etcd-manager pod
1\. Determine which version of etcd is running
```bash
kops get cluster --full -o yaml
```
Look at the `etcdCluster` configuration's `version` for the given cluster.
2\. Connect to an etcd-manager pod
```bash
CONTAINER=$(kubectl get pods -n kube-system | grep etcd-manager-main | head -n 1 | awk '{print $1}')
kubectl exec -it -n kube-system $CONTAINER bash
kubectl exec -it -n kube-system $CONTAINER -- sh
```
2\. Determine which version of etcd is running
```bash
DIRNAME=$(ps -ef | grep --color=never /opt/etcd | head -n 1 | awk '{print $8}' | xargs dirname)
echo $DIRNAME
```
``
3\. Run etcdctl
```bash
ETCDCTL_API=3 $DIRNAME/etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001 get --prefix / | tee output.txt
ETCD_VERSION=3.5.1
ETCDDIR=/opt/etcd-v$ETCD_VERSION-linux-amd64 # Replace with arm64 if you are running an arm control plane
CERTDIR=/rootfs/srv/kubernetes/kube-apiserver/
alias etcdctl="ETCDCTL_API=3 $ETCDDIR/etcdctl --cacert=$CERTDIR/etcd-ca.crt --cert=$CERTDIR/etcd-client.crt --key=$CERTDIR/etcd-client.key --endpoints=https://127.0.0.1:4001"
```
The contents of etcd are now in output.txt.
You may run any other etcdctl commands by replacing the "get --prefix /" with a different command.
The contents of the etcd dump are often garbled. See the next section for a better way to view the results.
## Dump etcd contents in clear text
Openshift's etcdhelper is a good way of exporting the contents of etcd in a readable format. Here are the steps.
1\. SSH into a master node
You can view the IP addresses of the nodes
```
kubectl get nodes -o wide
```
and then
```
ssh admin@<IP-of-master-node>
```
2\. Install golang
in whatever manner you prefer. Here is one example.
Test the client by running the following:
```bash
cd /usr/local
sudo wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
sudo tar -xvf go1.13.3.linux-amd64.tar.gz
cat <<EOT >> $HOME/.profile
export GOROOT=/usr/local/go
export GOPATH=\$HOME/go
export PATH=\$GOPATH/bin:\$GOROOT/bin:\$PATH
EOT
source $HOME/.profile
which go
etcdctl member list
```
3\. Install etcdhelper
```bash
mkdir -p ~/go/src/github.com/
cd ~/go/src/github.com/
git clone https://github.com/openshift/origin openshift
cd openshift/tools/etcdhelper
go build .
sudo cp etcdhelper /usr/local/bin/etcdhelper
which etcdhelper
```
4\. Run etcdhelper
```
sudo etcdhelper -key /etc/kubernetes/pki/kube-apiserver/etcd-client.key -cert /etc/kubernetes/pki/kube-apiserver/etcd-client.crt -cacert /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt -endpoint https://127.0.0.1:4001 dump | tee output.txt
```
The output of the command is now available in output.txt
Other etcdhelper commands are possible, like "ls":
```
sudo etcdhelper -key /etc/kubernetes/pki/kube-apiserver/etcd-client.key -cert /etc/kubernetes/pki/kube-apiserver/etcd-client.crt -cacert /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt -endpoint https://127.0.0.1:4001 ls
```
If successful, this should output the members of the etcd cluster.

View File

@ -15,8 +15,6 @@ result in six volumes for etcd data (one in each AZ). An EBS volume is designed
to have a [failure rate](https://aws.amazon.com/ebs/details/#AvailabilityandDurability)
of 0.1%-0.2% per year.
## Backup and restore using etcd-manager
## Taking backups
Backups are done periodically and before cluster modifications using [etcd-manager](etcd_administration.md)
@ -67,32 +65,30 @@ You can follow the progress by reading the etcd logs (`/var/log/etcd(-events).lo
on the master that is the leader of the cluster (you can find this out by checking the etcd logs on all masters).
Note that the leader might be different for the `main` and `events` clusters.
After the restore, you will probably face an intermittent connection to apiserver.
If you look at your kubernetes endpoint, you should have more addresses than masters. The restore brings back the addresses of the old masters and you should clean this up.
## Verify master lease consistency
To verify this, check the endpoints resource of the kubernetes apiserver, like this:
[This bug](https://github.com/kubernetes/kubernetes/issues/86812) causes old apiserver leases to get stuck. In order to recover from this you need to remove the leases from etcd directly.
To verify if you are affect by this bug, check the endpoints resource of the kubernetes apiserver, like this:
```
kubectl get endpoints/kubernetes -o yaml
```
If you see more address than masters, you will need to remove it manually inside the etcd.
If you see more address than masters, you will need to remove it manually inside the etcd cluster.
Check again (this time inside the etcd) if you have more IPs than masters at the `/registry/masterleases/` path, e.g.:
See [etcd administation](etcd-administration.md) how to obtain access to the etcd cluster.
Once you have a working etcd client, run the following:
```
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001 get --prefix --keys-only /registry/masterleases
etcdctl get --prefix --keys-only /registry/masterleases
```
To restore the stability within cluster you should delete the old master records and keep only the running ones:
Also you can delete all of the leases in one go...
```
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001 del /registry/masterleases/<OLD-IP>
etcdctl del --prefix /registry/masterleases/
```
NOTE: You will need to run it multiple times for each old IP, regarding the size of your master pool.
After that, you can check again the endpoint and everything should be fixed.
After the restore is complete, api server should come back up, and you should have a working cluster.
Note that the api server might be very busy for a while as it changes the cluster back to the state of the backup.
You might consider temporarily increasing the instance size of your control plane.
The remaining api servers will immediately recreate their own leases. Check again the above-mentioned endpoint to verify the problem has been solved.
Because the state on each of the Nodes may differ from the state in etcd, it is also a good idea to do a rolling-update of the entire cluster:
@ -100,42 +96,7 @@ Because the state on each of the Nodes may differ from the state in etcd, it is
kops rolling-update cluster --force --yes
```
For more information and troubleshooting, please check the [etcd-manager documentation](https://github.com/kopeio/etcd-manager).
## Backup and restore using legacy etcd
### Volume backups
If you are running your cluster in legacy etcd mode (without etcd-manager),
backups can be done through snapshots of the etcd volumes.
You can for example use CloudWatch to trigger an AWS Lambda with a defined schedule (e.g. once per
hour). The Lambda will then create a new snapshot of all etcd volumes. A complete
guide on how to setup automated snapshots can be found [here](https://serverlesscode.com/post/lambda-schedule-ebs-snapshot-backups/).
Note: this is one of many examples on how to do scheduled snapshots.
### Restore volume backups
If you're using legacy etcd (without etcd-manager), it is possible to restore the volume from a snapshot we created
earlier. Details about creating a volume from a snapshot can be found in the
[AWS documentation](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-restoring-volume.html).
Kubernetes uses protokube to identify the right volumes for etcd. Therefore it
is important to tag the EBS volumes with the correct tags after restoring them
from a EBS snapshot.
protokube will look for the following tags:
* `KubernetesCluster` containing the cluster name (e.g. `k8s.mycompany.tld`)
* `Name` containing the volume name (e.g. `eu-central-1a.etcd-main.k8s.mycompany.tld`)
* `k8s.io/etcd/main` containing the availability zone of the volume (e.g. `eu-central-1a/eu-central-1a`)
* `k8s.io/role/master` with the value `1`
After fully restoring the volume ensure that the old volume is no longer there,
or you've removed the tags from the old volume. After restarting the master node
Kubernetes should pick up the new volume and start running again.
For more information and troubleshooting, please check the [etcd-manager documentation](https://github.com/kubernetes-sigs/etcdadm/etcd-manager).
## Etcd Volume Encryption

View File

@ -40,34 +40,7 @@ Often the issue is obvious such as passing incorrect CLI flags.
After resizing an etcd cluster or restoring backup, the kubernetes API can contain too many endpoints.
You can confirm this by running `kubectl get endpoints -n default kubernetes`. This command should list exactly as many IPs as you have control plane nodes.
[This bug](https://github.com/kubernetes/kubernetes/issues/86812) causes old apiserver leases to get stuck. In order to recover from this you need to remove the leases from etcd directly:
```
CONTAINER=$(kubectl get pods -n kube-system | grep etcd-manager-main | head -n 1 | awk '{print $1}')
kubectl exec -it -n kube-system $CONTAINER -- sh
```
etcd and etcdctl are installed into directories in /opt - look for the latest version eg 3.5.1
```
# DIRNAME=/opt/etcd-v3.5.1-linux-amd64
# ETCDCTL_API=3
# alias etcdctl='$DIRNAME/etcdctl --cacert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --cert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --key=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key --endpoints=https://127.0.0.1:4001'
```
You can get a list of the leases eg:
```
etcdctl get --prefix /registry/masterleases
```
And delete with:
```
etcdctl del /registry/masterleases/$IP_ADDRESS
```
Also you can delete all of the leases in one go...
```
./etcdctl --cacert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --cert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key --key=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --endpoints=https://127.0.0.1:4001 del --prefix /registry/masterleases/
```
The remaining api servers will immediately recreate their own leases.
Check the [backup and restore documentation](etcd_backup_restore_encryption.md) for more details about this problem.
## etcd