Include how to route away from broken etcd in etcd maintenance docs (#35882)

* Include how to route away from broken etcd in etcd maintenance docs * Apply suggestions from code review Apply suggestions and use 1. for all numbering (markdown will set the numbering automatically this way) Co-authored-by: Han Kang <hankang@google.com> Co-authored-by: Jihoon Seo <46767780+jihoon-seo@users.noreply.github.com> * Update content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md Co-authored-by: Jihoon Seo <46767780+jihoon-seo@users.noreply.github.com> Co-authored-by: Han Kang <hankang@google.com> Co-authored-by: Jihoon Seo <46767780+jihoon-seo@users.noreply.github.com>
2022-08-15 22:27:07 -04:00 · 2022-08-15 22:27:07 -04:00 · 263fc03201
parent 59cd910ec5
commit 263fc03201
1 changed files with 27 additions and 9 deletions
--- a/content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md
+++ b/content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md
@ -2,6 +2,7 @@
 reviewers:
 - mml
 - wojtek-t
+- jpbetz
 title: Operating etcd clusters for Kubernetes
 content_type: task
 ---
@ -187,7 +188,21 @@ replace it with `member4=http://10.0.0.4`.
   fd422379fda50e48, started, member3, http://10.0.0.3:2380, http://10.0.0.3:2379
   ```

-2. Remove the failed member:
+1. Do either of the following:
+
+   1. If each Kubernetes API server is configured to communicate with all etcd
+      members, remove the failed member from the `--etcd-servers` flag, then
+      restart each Kubernetes API server.
+   1. If each Kubernetes API server communicates with a single etcd member,
+      then stop the Kubernetes API server that communicates with the failed
+      etcd.
+
+1. Stop the etcd server on the broken node. It is possible that other 
+   clients besides the Kubernetes API server is causing traffic to etcd 
+   and it is desirable to stop all traffic to prevent writes to the data
+   dir.
+
+1. Remove the failed member:

   ```shell
   etcdctl member remove 8211f1d0f64f3269
@ -199,7 +214,7 @@ replace it with `member4=http://10.0.0.4`.
   Removed member 8211f1d0f64f3269 from cluster
   ```

-3. Add the new member:
+1. Add the new member:

   ```shell
   etcdctl member add member4 --peer-urls=http://10.0.0.4:2380
@ -211,7 +226,7 @@ replace it with `member4=http://10.0.0.4`.
   Member 2be1eb8f84b7f63e added to cluster ef37ad9dc622a7c4
   ```

-4. Start the newly added member on a machine with the IP `10.0.0.4`:
+1. Start the newly added member on a machine with the IP `10.0.0.4`:

   ```shell
   export ETCD_NAME="member4"
@ -220,13 +235,16 @@ replace it with `member4=http://10.0.0.4`.
   etcd [flags]
   ```

-5. Do either of the following:
+1. Do either of the following:

-   1. Update the `--etcd-servers` flag for the Kubernetes API servers to make
-      Kubernetes aware of the configuration changes, then restart the
-      Kubernetes API servers.
-   2. Update the load balancer configuration if a load balancer is used in the
-      deployment.
+   1. If each Kubernetes API server is configured to communicate with all etcd
+      members, add the newly added member to the `--etcd-servers` flag, then
+      restart each Kubernetes API server.
+   1. If each Kubernetes API server communicates with a single etcd member,
+      start the Kubernetes API server that was stopped in step 2. Then
+      configure Kubernetes API server clients to again route requests to the
+      Kubernetes API server that was stopped. This can often be done by
+      configuring a load balancer.

 For more information on cluster reconfiguration, see
 [etcd reconfiguration documentation](https://etcd.io/docs/current/op-guide/runtime-configuration/#remove-a-member).