From 263fc03201352488ff5efee2bbbad3bcc6cd1e6b Mon Sep 17 00:00:00 2001
From: Joe Betz <jpbetz@google.com>
Date: Mon, 15 Aug 2022 22:27:07 -0400
Subject: [PATCH] Include how to route away from broken etcd in etcd
 maintenance docs (#35882)

* Include how to route away from broken etcd in etcd maintenance docs

* Apply suggestions from code review

Apply suggestions and use 1. for all numbering (markdown will set the numbering automatically this way)

Co-authored-by: Han Kang <hankang@google.com>
Co-authored-by: Jihoon Seo <46767780+jihoon-seo@users.noreply.github.com>

* Update content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md

Co-authored-by: Jihoon Seo <46767780+jihoon-seo@users.noreply.github.com>

Co-authored-by: Han Kang <hankang@google.com>
Co-authored-by: Jihoon Seo <46767780+jihoon-seo@users.noreply.github.com>
---
 .../configure-upgrade-etcd.md                 | 36 ++++++++++++++-----
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md b/content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md
index be77074dc1..3a5771e36e 100644
--- a/content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md
+++ b/content/en/docs/tasks/administer-cluster/configure-upgrade-etcd.md
@@ -2,6 +2,7 @@
 reviewers:
 - mml
 - wojtek-t
+- jpbetz
 title: Operating etcd clusters for Kubernetes
 content_type: task
 ---
@@ -187,7 +188,21 @@ replace it with `member4=http://10.0.0.4`.
    fd422379fda50e48, started, member3, http://10.0.0.3:2380, http://10.0.0.3:2379
    ```
 
-2. Remove the failed member:
+1. Do either of the following:
+
+   1. If each Kubernetes API server is configured to communicate with all etcd
+      members, remove the failed member from the `--etcd-servers` flag, then
+      restart each Kubernetes API server.
+   1. If each Kubernetes API server communicates with a single etcd member,
+      then stop the Kubernetes API server that communicates with the failed
+      etcd.
+
+1. Stop the etcd server on the broken node. It is possible that other 
+   clients besides the Kubernetes API server is causing traffic to etcd 
+   and it is desirable to stop all traffic to prevent writes to the data
+   dir.
+
+1. Remove the failed member:
 
    ```shell
    etcdctl member remove 8211f1d0f64f3269
@@ -199,7 +214,7 @@ replace it with `member4=http://10.0.0.4`.
    Removed member 8211f1d0f64f3269 from cluster
    ```
 
-3. Add the new member:
+1. Add the new member:
 
    ```shell
    etcdctl member add member4 --peer-urls=http://10.0.0.4:2380
@@ -211,7 +226,7 @@ replace it with `member4=http://10.0.0.4`.
    Member 2be1eb8f84b7f63e added to cluster ef37ad9dc622a7c4
    ```
 
-4. Start the newly added member on a machine with the IP `10.0.0.4`:
+1. Start the newly added member on a machine with the IP `10.0.0.4`:
 
    ```shell
    export ETCD_NAME="member4"
@@ -220,13 +235,16 @@ replace it with `member4=http://10.0.0.4`.
    etcd [flags]
    ```
 
-5. Do either of the following:
+1. Do either of the following:
 
-   1. Update the `--etcd-servers` flag for the Kubernetes API servers to make
-      Kubernetes aware of the configuration changes, then restart the
-      Kubernetes API servers.
-   2. Update the load balancer configuration if a load balancer is used in the
-      deployment.
+   1. If each Kubernetes API server is configured to communicate with all etcd
+      members, add the newly added member to the `--etcd-servers` flag, then
+      restart each Kubernetes API server.
+   1. If each Kubernetes API server communicates with a single etcd member,
+      start the Kubernetes API server that was stopped in step 2. Then
+      configure Kubernetes API server clients to again route requests to the
+      Kubernetes API server that was stopped. This can often be done by
+      configuring a load balancer.
 
 For more information on cluster reconfiguration, see
 [etcd reconfiguration documentation](https://etcd.io/docs/current/op-guide/runtime-configuration/#remove-a-member).