Reorg the monitoring task section (#32823)

* reorg the monitoring task section Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com> * reorg from review comments Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com> * review comments Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com> * review fixes Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>
2022-04-26 00:30:51 -04:00 · 2022-04-26 00:30:51 -04:00 · f26e8eff23
parent 5521d32c12
commit f26e8eff23
20 changed files with 656 additions and 765 deletions
--- a/content/en/docs/tasks/debug-application-cluster/_index.md
+++ b/content/en/docs/tasks/debug-application-cluster/_index.md
@ -1,6 +0,0 @@
---
-title: "Monitoring, Logging, and Debugging"
-description: Set up monitoring and logging to troubleshoot a cluster, or debug a containerized application.
-weight: 80
---
-
--- a/content/en/docs/tasks/debug-application-cluster/debug-cluster.md
+++ b/content/en/docs/tasks/debug-application-cluster/debug-cluster.md
@ -1,124 +0,0 @@
---
-reviewers:
- davidopp
-title: Troubleshoot Clusters
-content_type: concept
---
-
-<!-- overview -->
-
-This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the
-problem you are experiencing. See
-the [application troubleshooting guide](/docs/tasks/debug-application-cluster/debug-application) for tips on application debugging.
-You may also visit [troubleshooting document](/docs/tasks/debug-application-cluster/troubleshooting/) for more information.
-
-<!-- body -->
-
-## Listing your cluster
-
-The first thing to debug in your cluster is if your nodes are all registered correctly.
-
-Run
-
-```shell
-kubectl get nodes
-```
-
-And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.
-
-To get detailed information about the overall health of your cluster, you can run:
-
-```shell
-kubectl cluster-info dump
-```
-## Looking at logs
-
-For now, digging deeper into the cluster requires logging into the relevant machines.  Here are the locations
-of the relevant log files.  (note that on systemd-based systems, you may need to use `journalctl` instead)
-
-### Master
-
-   * `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
-   * `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
-   * `/var/log/kube-controller-manager.log` - Controller that manages replication controllers
-
-### Worker Nodes
-
-   * `/var/log/kubelet.log` - Kubelet, responsible for running containers on the node
-   * `/var/log/kube-proxy.log` - Kube Proxy, responsible for service load balancing
-
-## A general overview of cluster failure modes
-
-This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
-
-### Root causes:
-
-  - VM(s) shutdown
-  - Network partition within cluster, or between cluster and users
-  - Crashes in Kubernetes software
-  - Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
-  - Operator error, for example misconfigured Kubernetes software or application software
-
-### Specific scenarios:
-
-  - Apiserver VM shutdown or apiserver crashing
-    - Results
-      - unable to stop, update, or start new pods, services, replication controller
-      - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
-  - Apiserver backing storage lost
-    - Results
-      - apiserver should fail to come up
-      - kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
-      - manual recovery or recreation of apiserver state necessary before apiserver is restarted
-  - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
-    - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
-    - in future, these will be replicated as well and may not be co-located
-    - they do not have their own persistent state
-  - Individual node (VM or physical machine) shuts down
-    - Results
-      - pods on that Node stop running
-  - Network partition
-    - Results
-      - partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
-  - Kubelet software fault
-    - Results
-      - crashing kubelet cannot start new pods on the node
-      - kubelet might delete the pods or not
-      - node marked unhealthy
-      - replication controllers start new pods elsewhere
-  - Cluster operator error
-    - Results
-      - loss of pods, services, etc
-      - lost of apiserver backing store
-      - users unable to read API
-      - etc.
-
-### Mitigations:
-
- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
-  - Mitigates: Apiserver VM shutdown or apiserver crashing
-  - Mitigates: Supporting services VM shutdown or crashes
-
- Action: Use IaaS providers reliable storage (e.g. GCE PD or AWS EBS volume) for VMs with apiserver+etcd
-  - Mitigates: Apiserver backing storage lost
-
- Action: Use [high-availability](/docs/setup/production-environment/tools/kubeadm/high-availability/) configuration
-  - Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
-    - Will tolerate one or more simultaneous node or component failures
-  - Mitigates: API server backing storage (i.e., etcd's data directory) lost
-    - Assumes HA (highly-available) etcd configuration
-
- Action: Snapshot apiserver PDs/EBS-volumes periodically
-  - Mitigates: Apiserver backing storage lost
-  - Mitigates: Some cases of operator error
-  - Mitigates: Some cases of Kubernetes software fault
-
- Action: use replication controller and services in front of pods
-  - Mitigates: Node shutdown
-  - Mitigates: Kubelet software fault
-
- Action: applications (containers) designed to tolerate unexpected restarts
-  - Mitigates: Node shutdown
-  - Mitigates: Kubelet software fault
-
-
--- a/content/en/docs/tasks/debug-application-cluster/debug-pod-replication-controller.md
+++ b/content/en/docs/tasks/debug-application-cluster/debug-pod-replication-controller.md
@ -1,107 +0,0 @@
---
-reviewers:
- bprashanth
-title: Debug Pods and ReplicationControllers
-content_type: task
---
-
-<!-- overview -->
-
-This page shows how to debug Pods and ReplicationControllers.
-
-## {{% heading "prerequisites" %}}
-
-
-{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
-
-* You should be familiar with the basics of
-  {{< glossary_tooltip text="Pods" term_id="pod" >}} and with 
-  Pods' [lifecycles](/docs/concepts/workloads/pods/pod-lifecycle/).
-
-<!-- steps -->
-
-## Debugging Pods
-
-The first step in debugging a pod is taking a look at it. Check the current
-state of the pod and recent events with the following command:
-
-```shell
-kubectl describe pods ${POD_NAME}
-```
-
-Look at the state of the containers in the pod. Are they all `Running`?  Have
-there been recent restarts?
-
-Continue debugging depending on the state of the pods.
-
-### My pod stays pending
-
-If a pod is stuck in `Pending` it means that it can not be scheduled onto a
-node. Generally this is because there are insufficient resources of one type or
-another that prevent scheduling. Look at the output of the `kubectl describe
-...` command above. There should be messages from the scheduler about why it
-can not schedule your pod. Reasons include:
-
-#### Insufficient resources
-
-You may have exhausted the supply of CPU or Memory in your cluster. In this
-case you can try several things:
-
-* Add more nodes to the cluster.
-
-* [Terminate unneeded pods](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
-  to make room for pending pods.
-
-* Check that the pod is not larger than your nodes. For example, if all
-  nodes have a capacity of `cpu:1`, then a pod with a request of `cpu: 1.1`
-  will never be scheduled.
-
-    You can check node capacities with the `kubectl get nodes -o <format>`
-    command. Here are some example command lines that extract the necessary
-    information:
-
-    ```shell
-    kubectl get nodes -o yaml | egrep '\sname:|cpu:|memory:'
-    kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, cap: .status.capacity}'
-    ```
-
-  The [resource quota](/docs/concepts/policy/resource-quotas/)
-  feature can be configured to limit the total amount of
-  resources that can be consumed. If used in conjunction with namespaces, it can
-  prevent one team from hogging all the resources.
-
-#### Using hostPort
-
-When you bind a pod to a `hostPort` there are a limited number of places that
-the pod can be scheduled. In most cases, `hostPort` is unnecessary; try using a
-service object to expose your pod. If you do require `hostPort` then you can
-only schedule as many pods as there are nodes in your container cluster.
-
-### My pod stays waiting
-
-If a pod is stuck in the `Waiting` state, then it has been scheduled to a
-worker node, but it can't run on that machine. Again, the information from
-`kubectl describe ...` should be informative. The most common cause of
-`Waiting` pods is a failure to pull the image. There are three things to check:
-
-* Make sure that you have the name of the image correct.
-* Have you pushed the image to the repository?
-* Try to manually pull the image to see if it can be pulled. For example, if you
-  use Docker on your PC, run `docker pull <image>`.
-
-### My pod is crashing or otherwise unhealthy
-
-Once your pod has been scheduled, the methods described in [Debug Running Pods](
-/docs/tasks/debug-application-cluster/debug-running-pod/) are available for debugging.
-
-
-## Debugging ReplicationControllers
-
-ReplicationControllers are fairly straightforward. They can either create pods
-or they can't. If they can't create pods, then please refer to the
-[instructions above](#debugging-pods) to debug your pods.
-
-You can also use `kubectl describe rc ${CONTROLLER_NAME}` to inspect events
-related to the replication controller.
-
-
--- a/content/en/docs/tasks/debug-application-cluster/debug-running-pod.md
+++ b/content/en/docs/tasks/debug-application-cluster/debug-running-pod.md
@ -1,333 +0,0 @@
---
-reviewers:
- verb
- soltysh
-title: Debug Running Pods
-content_type: task
---
-
-<!-- overview -->
-
-This page explains how to debug Pods running (or crashing) on a Node.
-
-
-
-## {{% heading "prerequisites" %}}
-
-
-* Your {{< glossary_tooltip text="Pod" term_id="pod" >}} should already be
-  scheduled and running. If your Pod is not yet running, start with [Troubleshoot
-  Applications](/docs/tasks/debug-application-cluster/debug-application/).
-* For some of the advanced debugging steps you need to know on which Node the
-  Pod is running and have shell access to run commands on that Node. You don't
-  need that access to run the standard debug steps that use `kubectl`.
-
-
-
-<!-- steps -->
-
-## Examining pod logs {#examine-pod-logs}
-
-First, look at the logs of the affected container:
-
-```shell
-kubectl logs ${POD_NAME} ${CONTAINER_NAME}
-```
-
-If your container has previously crashed, you can access the previous container's crash log with:
-
-```shell
-kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
-```
-
-## Debugging with container exec {#container-exec}
-
-If the {{< glossary_tooltip text="container image" term_id="image" >}} includes
-debugging utilities, as is the case with images built from Linux and Windows OS
-base images, you can run commands inside a specific container with
-`kubectl exec`:
-
-```shell
-kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}
-```
-
-{{< note >}}
-`-c ${CONTAINER_NAME}` is optional. You can omit it for Pods that only contain a single container.
-{{< /note >}}
-
-As an example, to look at the logs from a running Cassandra pod, you might run
-
-```shell
-kubectl exec cassandra -- cat /var/log/cassandra/system.log
-```
-
-You can run a shell that's connected to your terminal using the `-i` and `-t`
-arguments to `kubectl exec`, for example:
-
-```shell
-kubectl exec -it cassandra -- sh
-```
-
-For more details, see [Get a Shell to a Running Container](
-/docs/tasks/debug-application-cluster/get-shell-running-container/).
-
-## Debugging with an ephemeral debug container {#ephemeral-container}
-
-{{< feature-state state="beta" for_k8s_version="v1.23" >}}
-
-{{< glossary_tooltip text="Ephemeral containers" term_id="ephemeral-container" >}}
-are useful for interactive troubleshooting when `kubectl exec` is insufficient
-because a container has crashed or a container image doesn't include debugging
-utilities, such as with [distroless images](
-https://github.com/GoogleContainerTools/distroless).
-
-### Example debugging using ephemeral containers {#ephemeral-container-example}
-
-You can use the `kubectl debug` command to add ephemeral containers to a
-running Pod. First, create a pod for the example:
-
-```shell
-kubectl run ephemeral-demo --image=k8s.gcr.io/pause:3.1 --restart=Never
-```
-
-The examples in this section use the `pause` container image because it does not
-contain debugging utilities, but this method works with all container
-images.
-
-If you attempt to use `kubectl exec` to create a shell you will see an error
-because there is no shell in this container image.
-
-```shell
-kubectl exec -it ephemeral-demo -- sh
-```
-
-```
-OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown
-```
-
-You can instead add a debugging container using `kubectl debug`. If you
-specify the `-i`/`--interactive` argument, `kubectl` will automatically attach
-to the console of the Ephemeral Container.
-
-```shell
-kubectl debug -it ephemeral-demo --image=busybox:1.28 --target=ephemeral-demo
-```
-
-```
-Defaulting debug container name to debugger-8xzrl.
-If you don't see a command prompt, try pressing enter.
-/ #
-```
-
-This command adds a new busybox container and attaches to it. The `--target`
-parameter targets the process namespace of another container. It's necessary
-here because `kubectl run` does not enable [process namespace sharing](
-/docs/tasks/configure-pod-container/share-process-namespace/) in the pod it
-creates.
-
-{{< note >}}
-The `--target` parameter must be supported by the {{< glossary_tooltip
-text="Container Runtime" term_id="container-runtime" >}}. When not supported,
-the Ephemeral Container may not be started, or it may be started with an
-isolated process namespace so that `ps` does not reveal processes in other
-containers.
-{{< /note >}}
-
-You can view the state of the newly created ephemeral container using `kubectl describe`:
-
-```shell
-kubectl describe pod ephemeral-demo
-```
-
-```
-...
-Ephemeral Containers:
-  debugger-8xzrl:
-    Container ID:   docker://b888f9adfd15bd5739fefaa39e1df4dd3c617b9902082b1cfdc29c4028ffb2eb
-    Image:          busybox
-    Image ID:       docker-pullable://busybox@sha256:1828edd60c5efd34b2bf5dd3282ec0cc04d47b2ff9caa0b6d4f07a21d1c08084
-    Port:           <none>
-    Host Port:      <none>
-    State:          Running
-      Started:      Wed, 12 Feb 2020 14:25:42 +0100
-    Ready:          False
-    Restart Count:  0
-    Environment:    <none>
-    Mounts:         <none>
-...
-```
-
-Use `kubectl delete` to remove the Pod when you're finished:
-
-```shell
-kubectl delete pod ephemeral-demo
-```
-
-## Debugging using a copy of the Pod
-
-Sometimes Pod configuration options make it difficult to troubleshoot in certain
-situations. For example, you can't run `kubectl exec` to troubleshoot your
-container if your container image does not include a shell or if your application
-crashes on startup. In these situations you can use `kubectl debug` to create a
-copy of the Pod with configuration values changed to aid debugging.
-
-### Copying a Pod while adding a new container
-
-Adding a new container can be useful when your application is running but not
-behaving as you expect and you'd like to add additional troubleshooting
-utilities to the Pod.
-
-For example, maybe your application's container images are built on `busybox`
-but you need debugging utilities not included in `busybox`. You can simulate
-this scenario using `kubectl run`:
-
-```shell
-kubectl run myapp --image=busybox:1.28 --restart=Never -- sleep 1d
-```
-
-Run this command to create a copy of `myapp` named `myapp-debug` that adds a
-new Ubuntu container for debugging:
-
-```shell
-kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug
-```
-
-```
-Defaulting debug container name to debugger-w7xmf.
-If you don't see a command prompt, try pressing enter.
-root@myapp-debug:/#
-```
-
-{{< note >}}
-* `kubectl debug` automatically generates a container name if you don't choose
-  one using the `--container` flag.
-* The `-i` flag causes `kubectl debug` to attach to the new container by
-  default.  You can prevent this by specifying `--attach=false`. If your session
-  becomes disconnected you can reattach using `kubectl attach`.
-* The `--share-processes` allows the containers in this Pod to see processes
-  from the other containers in the Pod. For more information about how this
-  works, see [Share Process Namespace between Containers in a Pod](
-  /docs/tasks/configure-pod-container/share-process-namespace/).
-{{< /note >}}
-
-Don't forget to clean up the debugging Pod when you're finished with it:
-
-```shell
-kubectl delete pod myapp myapp-debug
-```
-
-### Copying a Pod while changing its command
-
-Sometimes it's useful to change the command for a container, for example to
-add a debugging flag or because the application is crashing.
-
-To simulate a crashing application, use `kubectl run` to create a container
-that immediately exits:
-
-```
-kubectl run --image=busybox:1.28 myapp -- false
-```
-
-You can see using `kubectl describe pod myapp` that this container is crashing:
-
-```
-Containers:
-  myapp:
-    Image:         busybox
-    ...
-    Args:
-      false
-    State:          Waiting
-      Reason:       CrashLoopBackOff
-    Last State:     Terminated
-      Reason:       Error
-      Exit Code:    1
-```
-
-You can use `kubectl debug` to create a copy of this Pod with the command
-changed to an interactive shell:
-
-```
-kubectl debug myapp -it --copy-to=myapp-debug --container=myapp -- sh
-```
-
-```
-If you don't see a command prompt, try pressing enter.
-/ #
-```
-
-Now you have an interactive shell that you can use to perform tasks like
-checking filesystem paths or running the container command manually.
-
-{{< note >}}
-* To change the command of a specific container you must
-  specify its name using `--container` or `kubectl debug` will instead
-  create a new container to run the command you specified.
-* The `-i` flag causes `kubectl debug` to attach to the container by default.
-  You can prevent this by specifying `--attach=false`. If your session becomes
-  disconnected you can reattach using `kubectl attach`.
-{{< /note >}}
-
-Don't forget to clean up the debugging Pod when you're finished with it:
-
-```shell
-kubectl delete pod myapp myapp-debug
-```
-
-### Copying a Pod while changing container images
-
-In some situations you may want to change a misbehaving Pod from its normal
-production container images to an image containing a debugging build or
-additional utilities.
-
-As an example, create a Pod using `kubectl run`:
-
-```
-kubectl run myapp --image=busybox:1.28 --restart=Never -- sleep 1d
-```
-
-Now use `kubectl debug` to make a copy and change its container image
-to `ubuntu`:
-
-```
-kubectl debug myapp --copy-to=myapp-debug --set-image=*=ubuntu
-```
-
-The syntax of `--set-image` uses the same `container_name=image` syntax as
-`kubectl set image`. `*=ubuntu` means change the image of all containers
-to `ubuntu`.
-
-Don't forget to clean up the debugging Pod when you're finished with it:
-
-```shell
-kubectl delete pod myapp myapp-debug
-```
-
-## Debugging via a shell on the node {#node-shell-session}
-
-If none of these approaches work, you can find the Node on which the Pod is
-running and create a privileged Pod running in the host namespaces. To create
-an interactive shell on a node using `kubectl debug`, run:
-
-```shell
-kubectl debug node/mynode -it --image=ubuntu
-```
-
-```
-Creating debugging pod node-debugger-mynode-pdx84 with container debugger on node mynode.
-If you don't see a command prompt, try pressing enter.
-root@ek8s:/#
-```
-
-When creating a debugging session on a node, keep in mind that:
-
-* `kubectl debug` automatically generates the name of the new Pod based on
-  the name of the Node.
-* The container runs in the host IPC, Network, and PID namespaces.
-* The root filesystem of the Node will be mounted at `/host`.
-
-Don't forget to clean up the debugging Pod when you're finished with it:
-
-```shell
-kubectl delete pod node-debugger-mynode-pdx84
-```
--- a/content/en/docs/tasks/debug-application-cluster/troubleshooting.md
+++ b/content/en/docs/tasks/debug-application-cluster/troubleshooting.md
@ -1,9 +1,12 @@
 ---
+title: "Monitoring, Logging, and Debugging"
+description: Set up monitoring and logging to troubleshoot a cluster, or debug a containerized application.
+weight: 20
 reviewers:
 - brendandburns
 - davidopp
 content_type: concept
-title: Troubleshooting
+no_list: true
 ---

 <!-- overview -->
@ -11,9 +14,9 @@ title: Troubleshooting
 Sometimes things go wrong. This guide is aimed at making them right. It has
 two sections:

-* [Troubleshooting your application](/docs/tasks/debug-application-cluster/debug-application/) - Useful
+* [Debugging your application](/docs/tasks/debug/debug-application/) - Useful
  for users who are deploying code into Kubernetes and wondering why it is not working.
-* [Troubleshooting your cluster](/docs/tasks/debug-application-cluster/debug-cluster/) - Useful
+* [Debugging your cluster](/docs/tasks/debug/debug-cluster/) - Useful
  for cluster administrators and people whose Kubernetes cluster is unhappy.

 You should also check the known issues for the [release](https://github.com/kubernetes/kubernetes/releases)
--- a/content/en/docs/tasks/debug/debug-application/_index.md
+++ b/content/en/docs/tasks/debug/debug-application/_index.md
@ -0,0 +1,8 @@
+---
+title: "Troubleshooting Applications"
+description: Debugging common containerized application issues.
+weight: 20
+---
+
+This doc contains a set of resources for fixing issues with containerized applications. It covers things like common issues with Kubernetes resources (like Pods, Services, or StatefulSets), advice on making sense of container termination messages, and ways to debug running containers.
+
--- a/content/en/docs/tasks/debug-application-cluster/debug-init-containers.md
+++ b/content/en/docs/tasks/debug-application-cluster/debug-init-containers.md
@ -9,6 +9,7 @@ reviewers:
 - smarterclayton
 title: Debug Init Containers
 content_type: task
+weight: 40
 ---

 <!-- overview -->
--- a/content/en/docs/tasks/debug-application-cluster/debug-application.md
+++ b/content/en/docs/tasks/debug-application-cluster/debug-application.md
@ -2,15 +2,16 @@
 reviewers:
 - mikedanese
 - thockin
-title: Troubleshoot Applications
+title: Debug Pods
 content_type: concept
+weight: 10
 ---

 <!-- overview -->

 This guide is to help users debug applications that are deployed into Kubernetes and not behaving correctly.
 This is *not* a guide for people who want to debug their cluster.  For that you should check out
-[this guide](/docs/tasks/debug-application-cluster/debug-cluster).
+[this guide](/docs/tasks/debug/debug-cluster).

 <!-- body -->

@ -64,7 +65,7 @@ Again, the information from `kubectl describe ...` should be informative.  The m
 #### My pod is crashing or otherwise unhealthy

 Once your pod has been scheduled, the methods described in [Debug Running Pods](
-/docs/tasks/debug-application-cluster/debug-running-pod/) are available for debugging.
+/docs/tasks/debug/debug-applications/debug-running-pod/) are available for debugging.

 #### My pod is running but not doing what I told it to do

@ -145,15 +146,15 @@ Verify that the pod's `containerPort` matches up with the Service's `targetPort`

 #### Network traffic is not forwarded

-Please see [debugging service](/docs/tasks/debug-application-cluster/debug-service/) for more information.
+Please see [debugging service](/docs/tasks/debug/debug-applications/debug-service/) for more information.

 ## {{% heading "whatsnext" %}}

 If none of the above solves your problem, follow the instructions in
-[Debugging Service document](/docs/tasks/debug-application-cluster/debug-service/)
+[Debugging Service document](/docs/tasks/debug/debug-applications/debug-service/)
 to make sure that your `Service` is running, has `Endpoints`, and your `Pods` are
 actually serving; you have DNS working, iptables rules installed, and kube-proxy
 does not seem to be misbehaving.

-You may also visit [troubleshooting document](/docs/tasks/debug-application-cluster/troubleshooting/) for more information.
+You may also visit [troubleshooting document](/docs/tasks/debug/overview/) for more information.

--- a/content/en/docs/tasks/debug-application-cluster/debug-application-introspection.md
+++ b/content/en/docs/tasks/debug-application-cluster/debug-application-introspection.md
@ -1,21 +1,25 @@
 ---
 reviewers:
- janetkuo
- thockin
-content_type: concept
-title: Application Introspection and Debugging
+- verb
+- soltysh
+title: Debug Running Pods
+content_type: task
 ---

 <!-- overview -->

-Once your application is running, you'll inevitably need to debug problems with it.
-Earlier we described how you can use `kubectl get pods` to retrieve simple status information about
-your pods. But there are a number of ways to get even more information about your application.
+This page explains how to debug Pods running (or crashing) on a Node.


+## {{% heading "prerequisites" %}}


-<!-- body -->
+* Your {{< glossary_tooltip text="Pod" term_id="pod" >}} should already be
+  scheduled and running. If your Pod is not yet running, start with [Debugging
+  Pods](/docs/tasks/debug/debug-application/).
+* For some of the advanced debugging steps you need to know on which Node the
+  Pod is running and have shell access to run commands on that Node. You don't
+  need that access to run the standard debug steps that use `kubectl`.

 ## Using `kubectl describe pod` to fetch details about pods

@ -125,6 +129,7 @@ Currently the only Condition associated with a Pod is the binary Ready condition

 Lastly, you see a log of recent events related to your Pod. The system compresses multiple identical events by indicating the first and last time it was seen and the number of times it was seen. "From" indicates the component that is logging the event, "SubobjectPath" tells you which object (e.g. container within the pod) is being referred to, and "Reason" and "Message" tell you what happened.

+
 ## Example: debugging Pending Pods

 A common scenario that you can detect using events is when you've created a Pod that won't fit on any node. For example, the Pod might request more resources than are free on any node, or it might specify a label selector that doesn't match any nodes. Let's say we created the previous Deployment with 5 replicas (instead of 2) and requesting 600 millicores instead of 500, on a four-node cluster where each (virtual) machine has 1 CPU. In that case one of the Pods will not be able to schedule. (Note that because of the cluster addon pods such as fluentd, skydns, etc., that run on each node, if we requested 1000 millicores then none of the Pods would be able to schedule.)
@ -326,197 +331,308 @@ status:
  startTime: "2022-02-17T21:51:01Z"
 ```

-## Example: debugging a down/unreachable node
+## Examining pod logs {#examine-pod-logs}

-Sometimes when debugging it can be useful to look at the status of a node -- for example, because you've noticed strange behavior of a Pod that's running on the node, or to find out why a Pod won't schedule onto the node. As with Pods, you can use `kubectl describe node` and `kubectl get node -o yaml` to retrieve detailed information about nodes. For example, here's what you'll see if a node is down (disconnected from the network, or kubelet dies and won't restart, etc.). Notice the events that show the node is NotReady, and also notice that the pods are no longer running (they are evicted after five minutes of NotReady status).
+First, look at the logs of the affected container:

 ```shell
-kubectl get nodes
+kubectl logs ${POD_NAME} ${CONTAINER_NAME}
 ```

-```none
-NAME                     STATUS       ROLES     AGE     VERSION
-kube-worker-1            NotReady     <none>    1h      v1.23.3
-kubernetes-node-bols     Ready        <none>    1h      v1.23.3
-kubernetes-node-st6x     Ready        <none>    1h      v1.23.3
-kubernetes-node-unaj     Ready        <none>    1h      v1.23.3
-```
+If your container has previously crashed, you can access the previous container's crash log with:

 ```shell
-kubectl describe node kube-worker-1
+kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
 ```

-```none
-Name:               kube-worker-1
-Roles:              <none>
-Labels:             beta.kubernetes.io/arch=amd64
-                    beta.kubernetes.io/os=linux
-                    kubernetes.io/arch=amd64
-                    kubernetes.io/hostname=kube-worker-1
-                    kubernetes.io/os=linux
-Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
-                    node.alpha.kubernetes.io/ttl: 0
-                    volumes.kubernetes.io/controller-managed-attach-detach: true
-CreationTimestamp:  Thu, 17 Feb 2022 16:46:30 -0500
-Taints:             node.kubernetes.io/unreachable:NoExecute
-                    node.kubernetes.io/unreachable:NoSchedule
-Unschedulable:      false
-Lease:
-  HolderIdentity:  kube-worker-1
-  AcquireTime:     <unset>
-  RenewTime:       Thu, 17 Feb 2022 17:13:09 -0500
-Conditions:
-  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
-  ----                 ------    -----------------                 ------------------                ------              -------
-  NetworkUnavailable   False     Thu, 17 Feb 2022 17:09:13 -0500   Thu, 17 Feb 2022 17:09:13 -0500   WeaveIsUp           Weave pod has set this
-  MemoryPressure       Unknown   Thu, 17 Feb 2022 17:12:40 -0500   Thu, 17 Feb 2022 17:13:52 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
-  DiskPressure         Unknown   Thu, 17 Feb 2022 17:12:40 -0500   Thu, 17 Feb 2022 17:13:52 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
-  PIDPressure          Unknown   Thu, 17 Feb 2022 17:12:40 -0500   Thu, 17 Feb 2022 17:13:52 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
-  Ready                Unknown   Thu, 17 Feb 2022 17:12:40 -0500   Thu, 17 Feb 2022 17:13:52 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
-Addresses:
-  InternalIP:  192.168.0.113
-  Hostname:    kube-worker-1
-Capacity:
-  cpu:                2
-  ephemeral-storage:  15372232Ki
-  hugepages-2Mi:      0
-  memory:             2025188Ki
-  pods:               110
-Allocatable:
-  cpu:                2
-  ephemeral-storage:  14167048988
-  hugepages-2Mi:      0
-  memory:             1922788Ki
-  pods:               110
-System Info:
-  Machine ID:                 9384e2927f544209b5d7b67474bbf92b
-  System UUID:                aa829ca9-73d7-064d-9019-df07404ad448
-  Boot ID:                    5a295a03-aaca-4340-af20-1327fa5dab5c
-  Kernel Version:             5.13.0-28-generic
-  OS Image:                   Ubuntu 21.10
-  Operating System:           linux
-  Architecture:               amd64
-  Container Runtime Version:  containerd://1.5.9
-  Kubelet Version:            v1.23.3
-  Kube-Proxy Version:         v1.23.3
-Non-terminated Pods:          (4 in total)
-  Namespace                   Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
-  ---------                   ----                                 ------------  ----------  ---------------  -------------  ---
-  default                     nginx-deployment-67d4bdd6f5-cx2nz    500m (25%)    500m (25%)  128Mi (6%)       128Mi (6%)     23m
-  default                     nginx-deployment-67d4bdd6f5-w6kd7    500m (25%)    500m (25%)  128Mi (6%)       128Mi (6%)     23m
-  kube-system                 kube-proxy-dnxbz                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
-  kube-system                 weave-net-gjxxp                      100m (5%)     0 (0%)      200Mi (10%)      0 (0%)         28m
-Allocated resources:
-  (Total limits may be over 100 percent, i.e., overcommitted.)
-  Resource           Requests     Limits
-  --------           --------     ------
-  cpu                1100m (55%)  1 (50%)
-  memory             456Mi (24%)  256Mi (13%)
-  ephemeral-storage  0 (0%)       0 (0%)
-  hugepages-2Mi      0 (0%)       0 (0%)
-Events:
+## Debugging with container exec {#container-exec}
+
+If the {{< glossary_tooltip text="container image" term_id="image" >}} includes
+debugging utilities, as is the case with images built from Linux and Windows OS
+base images, you can run commands inside a specific container with
+`kubectl exec`:
+
+```shell
+kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}
+```
+
+{{< note >}}
+`-c ${CONTAINER_NAME}` is optional. You can omit it for Pods that only contain a single container.
+{{< /note >}}
+
+As an example, to look at the logs from a running Cassandra pod, you might run
+
+```shell
+kubectl exec cassandra -- cat /var/log/cassandra/system.log
+```
+
+You can run a shell that's connected to your terminal using the `-i` and `-t`
+arguments to `kubectl exec`, for example:
+
+```shell
+kubectl exec -it cassandra -- sh
+```
+
+For more details, see [Get a Shell to a Running Container](
+/docs/tasks/debug/debug-application/get-shell-running-container/).
+
+## Debugging with an ephemeral debug container {#ephemeral-container}
+
+{{< feature-state state="beta" for_k8s_version="v1.23" >}}
+
+{{< glossary_tooltip text="Ephemeral containers" term_id="ephemeral-container" >}}
+are useful for interactive troubleshooting when `kubectl exec` is insufficient
+because a container has crashed or a container image doesn't include debugging
+utilities, such as with [distroless images](
+https://github.com/GoogleContainerTools/distroless).
+
+### Example debugging using ephemeral containers {#ephemeral-container-example}
+
+You can use the `kubectl debug` command to add ephemeral containers to a
+running Pod. First, create a pod for the example:
+
+```shell
+kubectl run ephemeral-demo --image=k8s.gcr.io/pause:3.1 --restart=Never
+```
+
+The examples in this section use the `pause` container image because it does not
+contain debugging utilities, but this method works with all container
+images.
+
+If you attempt to use `kubectl exec` to create a shell you will see an error
+because there is no shell in this container image.
+
+```shell
+kubectl exec -it ephemeral-demo -- sh
+```
+
+```
+OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown
+```
+
+You can instead add a debugging container using `kubectl debug`. If you
+specify the `-i`/`--interactive` argument, `kubectl` will automatically attach
+to the console of the Ephemeral Container.
+
+```shell
+kubectl debug -it ephemeral-demo --image=busybox:1.28 --target=ephemeral-demo
+```
+
+```
+Defaulting debug container name to debugger-8xzrl.
+If you don't see a command prompt, try pressing enter.
+/ #
+```
+
+This command adds a new busybox container and attaches to it. The `--target`
+parameter targets the process namespace of another container. It's necessary
+here because `kubectl run` does not enable [process namespace sharing](
+/docs/tasks/configure-pod-container/share-process-namespace/) in the pod it
+creates.
+
+{{< note >}}
+The `--target` parameter must be supported by the {{< glossary_tooltip
+text="Container Runtime" term_id="container-runtime" >}}. When not supported,
+the Ephemeral Container may not be started, or it may be started with an
+isolated process namespace so that `ps` does not reveal processes in other
+containers.
+{{< /note >}}
+
+You can view the state of the newly created ephemeral container using `kubectl describe`:
+
+```shell
+kubectl describe pod ephemeral-demo
+```
+
+```
+...
+Ephemeral Containers:
+  debugger-8xzrl:
+    Container ID:   docker://b888f9adfd15bd5739fefaa39e1df4dd3c617b9902082b1cfdc29c4028ffb2eb
+    Image:          busybox
+    Image ID:       docker-pullable://busybox@sha256:1828edd60c5efd34b2bf5dd3282ec0cc04d47b2ff9caa0b6d4f07a21d1c08084
+    Port:           <none>
+    Host Port:      <none>
+    State:          Running
+      Started:      Wed, 12 Feb 2020 14:25:42 +0100
+    Ready:          False
+    Restart Count:  0
+    Environment:    <none>
+    Mounts:         <none>
 ...
 ```

+Use `kubectl delete` to remove the Pod when you're finished:
+
 ```shell
-kubectl get node kube-worker-1 -o yaml
+kubectl delete pod ephemeral-demo
 ```

-```yaml
-apiVersion: v1
-kind: Node
-metadata:
-  annotations:
-    kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
-    node.alpha.kubernetes.io/ttl: "0"
-    volumes.kubernetes.io/controller-managed-attach-detach: "true"
-  creationTimestamp: "2022-02-17T21:46:30Z"
-  labels:
-    beta.kubernetes.io/arch: amd64
-    beta.kubernetes.io/os: linux
-    kubernetes.io/arch: amd64
-    kubernetes.io/hostname: kube-worker-1
-    kubernetes.io/os: linux
-  name: kube-worker-1
-  resourceVersion: "4026"
-  uid: 98efe7cb-2978-4a0b-842a-1a7bf12c05f8
-spec: {}
-status:
-  addresses:
-  - address: 192.168.0.113
-    type: InternalIP
-  - address: kube-worker-1
-    type: Hostname
-  allocatable:
-    cpu: "2"
-    ephemeral-storage: "14167048988"
-    hugepages-2Mi: "0"
-    memory: 1922788Ki
-    pods: "110"
-  capacity:
-    cpu: "2"
-    ephemeral-storage: 15372232Ki
-    hugepages-2Mi: "0"
-    memory: 2025188Ki
-    pods: "110"
-  conditions:
-  - lastHeartbeatTime: "2022-02-17T22:20:32Z"
-    lastTransitionTime: "2022-02-17T22:20:32Z"
-    message: Weave pod has set this
-    reason: WeaveIsUp
-    status: "False"
-    type: NetworkUnavailable
-  - lastHeartbeatTime: "2022-02-17T22:20:15Z"
-    lastTransitionTime: "2022-02-17T22:13:25Z"
-    message: kubelet has sufficient memory available
-    reason: KubeletHasSufficientMemory
-    status: "False"
-    type: MemoryPressure
-  - lastHeartbeatTime: "2022-02-17T22:20:15Z"
-    lastTransitionTime: "2022-02-17T22:13:25Z"
-    message: kubelet has no disk pressure
-    reason: KubeletHasNoDiskPressure
-    status: "False"
-    type: DiskPressure
-  - lastHeartbeatTime: "2022-02-17T22:20:15Z"
-    lastTransitionTime: "2022-02-17T22:13:25Z"
-    message: kubelet has sufficient PID available
-    reason: KubeletHasSufficientPID
-    status: "False"
-    type: PIDPressure
-  - lastHeartbeatTime: "2022-02-17T22:20:15Z"
-    lastTransitionTime: "2022-02-17T22:15:15Z"
-    message: kubelet is posting ready status. AppArmor enabled
-    reason: KubeletReady
-    status: "True"
-    type: Ready
-  daemonEndpoints:
-    kubeletEndpoint:
-      Port: 10250
-  nodeInfo:
-    architecture: amd64
-    bootID: 22333234-7a6b-44d4-9ce1-67e31dc7e369
-    containerRuntimeVersion: containerd://1.5.9
-    kernelVersion: 5.13.0-28-generic
-    kubeProxyVersion: v1.23.3
-    kubeletVersion: v1.23.3
-    machineID: 9384e2927f544209b5d7b67474bbf92b
-    operatingSystem: linux
-    osImage: Ubuntu 21.10
-    systemUUID: aa829ca9-73d7-064d-9019-df07404ad448
+## Debugging using a copy of the Pod
+
+Sometimes Pod configuration options make it difficult to troubleshoot in certain
+situations. For example, you can't run `kubectl exec` to troubleshoot your
+container if your container image does not include a shell or if your application
+crashes on startup. In these situations you can use `kubectl debug` to create a
+copy of the Pod with configuration values changed to aid debugging.
+
+### Copying a Pod while adding a new container
+
+Adding a new container can be useful when your application is running but not
+behaving as you expect and you'd like to add additional troubleshooting
+utilities to the Pod.
+
+For example, maybe your application's container images are built on `busybox`
+but you need debugging utilities not included in `busybox`. You can simulate
+this scenario using `kubectl run`:
+
+```shell
+kubectl run myapp --image=busybox:1.28 --restart=Never -- sleep 1d
 ```

+Run this command to create a copy of `myapp` named `myapp-debug` that adds a
+new Ubuntu container for debugging:

-## {{% heading "whatsnext" %}}
+```shell
+kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug
+```

+```
+Defaulting debug container name to debugger-w7xmf.
+If you don't see a command prompt, try pressing enter.
+root@myapp-debug:/#
+```

-Learn about additional debugging tools, including:
+{{< note >}}
+* `kubectl debug` automatically generates a container name if you don't choose
+  one using the `--container` flag.
+* The `-i` flag causes `kubectl debug` to attach to the new container by
+  default.  You can prevent this by specifying `--attach=false`. If your session
+  becomes disconnected you can reattach using `kubectl attach`.
+* The `--share-processes` allows the containers in this Pod to see processes
+  from the other containers in the Pod. For more information about how this
+  works, see [Share Process Namespace between Containers in a Pod](
+  /docs/tasks/configure-pod-container/share-process-namespace/).
+{{< /note >}}

-* [Logging](/docs/concepts/cluster-administration/logging/)
-* [Monitoring](/docs/tasks/debug-application-cluster/resource-usage-monitoring/)
-* [Getting into containers via `exec`](/docs/tasks/debug-application-cluster/get-shell-running-container/)
-* [Connecting to containers via proxies](/docs/tasks/extend-kubernetes/http-proxy-access-api/)
-* [Connecting to containers via port forwarding](/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)
-* [Inspect Kubernetes node with crictl](/docs/tasks/debug-application-cluster/crictl/)
+Don't forget to clean up the debugging Pod when you're finished with it:

+```shell
+kubectl delete pod myapp myapp-debug
+```

+### Copying a Pod while changing its command
+
+Sometimes it's useful to change the command for a container, for example to
+add a debugging flag or because the application is crashing.
+
+To simulate a crashing application, use `kubectl run` to create a container
+that immediately exits:
+
+```
+kubectl run --image=busybox:1.28 myapp -- false
+```
+
+You can see using `kubectl describe pod myapp` that this container is crashing:
+
+```
+Containers:
+  myapp:
+    Image:         busybox
+    ...
+    Args:
+      false
+    State:          Waiting
+      Reason:       CrashLoopBackOff
+    Last State:     Terminated
+      Reason:       Error
+      Exit Code:    1
+```
+
+You can use `kubectl debug` to create a copy of this Pod with the command
+changed to an interactive shell:
+
+```
+kubectl debug myapp -it --copy-to=myapp-debug --container=myapp -- sh
+```
+
+```
+If you don't see a command prompt, try pressing enter.
+/ #
+```
+
+Now you have an interactive shell that you can use to perform tasks like
+checking filesystem paths or running the container command manually.
+
+{{< note >}}
+* To change the command of a specific container you must
+  specify its name using `--container` or `kubectl debug` will instead
+  create a new container to run the command you specified.
+* The `-i` flag causes `kubectl debug` to attach to the container by default.
+  You can prevent this by specifying `--attach=false`. If your session becomes
+  disconnected you can reattach using `kubectl attach`.
+{{< /note >}}
+
+Don't forget to clean up the debugging Pod when you're finished with it:
+
+```shell
+kubectl delete pod myapp myapp-debug
+```
+
+### Copying a Pod while changing container images
+
+In some situations you may want to change a misbehaving Pod from its normal
+production container images to an image containing a debugging build or
+additional utilities.
+
+As an example, create a Pod using `kubectl run`:
+
+```
+kubectl run myapp --image=busybox:1.28 --restart=Never -- sleep 1d
+```
+
+Now use `kubectl debug` to make a copy and change its container image
+to `ubuntu`:
+
+```
+kubectl debug myapp --copy-to=myapp-debug --set-image=*=ubuntu
+```
+
+The syntax of `--set-image` uses the same `container_name=image` syntax as
+`kubectl set image`. `*=ubuntu` means change the image of all containers
+to `ubuntu`.
+
+Don't forget to clean up the debugging Pod when you're finished with it:
+
+```shell
+kubectl delete pod myapp myapp-debug
+```
+
+## Debugging via a shell on the node {#node-shell-session}
+
+If none of these approaches work, you can find the Node on which the Pod is
+running and create a privileged Pod running in the host namespaces. To create
+an interactive shell on a node using `kubectl debug`, run:
+
+```shell
+kubectl debug node/mynode -it --image=ubuntu
+```
+
+```
+Creating debugging pod node-debugger-mynode-pdx84 with container debugger on node mynode.
+If you don't see a command prompt, try pressing enter.
+root@ek8s:/#
+```
+
+When creating a debugging session on a node, keep in mind that:
+
+* `kubectl debug` automatically generates the name of the new Pod based on
+  the name of the Node.
+* The container runs in the host IPC, Network, and PID namespaces.
+* The root filesystem of the Node will be mounted at `/host`.
+
+Don't forget to clean up the debugging Pod when you're finished with it:
+
+```shell
+kubectl delete pod node-debugger-mynode-pdx84
+```
--- a/content/en/docs/tasks/debug-application-cluster/debug-service.md
+++ b/content/en/docs/tasks/debug-application-cluster/debug-service.md
@ -4,6 +4,7 @@ reviewers:
 - bowei
 content_type: concept
 title: Debug Services
+weight: 20
 ---

 <!-- overview -->
@ -441,7 +442,7 @@ they are running fine and not crashing.

 The "RESTARTS" column says that these pods are not crashing frequently or being
 restarted.  Frequent restarts could lead to intermittent connectivity issues.
-If the restart count is high, read more about how to [debug pods](/docs/tasks/debug-application-cluster/debug-pod-replication-controller/#debugging-pods).
+If the restart count is high, read more about how to [debug pods](/docs/tasks/debug/debug-application/debug-pods).

 Inside the Kubernetes system is a control loop which evaluates the selector of
 every Service and saves the results into a corresponding Endpoints object.
@ -727,13 +728,13 @@ Service is not working.  Please let us know what is going on, so we can help
 investigate!

 Contact us on
-[Slack](/docs/tasks/debug-application-cluster/troubleshooting/#slack) or
+[Slack](/docs/tasks/debug/overview/#slack) or
 [Forum](https://discuss.kubernetes.io) or
 [GitHub](https://github.com/kubernetes/kubernetes).

 ## {{% heading "whatsnext" %}}

-Visit [troubleshooting document](/docs/tasks/debug-application-cluster/troubleshooting/)
+Visit the [troubleshooting overview document](/docs/tasks/debug/overview/)
 for more information.


--- a/content/en/docs/tasks/debug-application-cluster/debug-stateful-set.md
+++ b/content/en/docs/tasks/debug-application-cluster/debug-stateful-set.md
@ -9,6 +9,7 @@ reviewers:
 - smarterclayton
 title: Debug a StatefulSet
 content_type: task
+weight: 30
 ---

 <!-- overview -->
@ -34,9 +35,9 @@ If you find that any Pods listed are in `Unknown` or `Terminating` state for an
 refer to the [Deleting StatefulSet Pods](/docs/tasks/run-application/delete-stateful-set/) task for
 instructions on how to deal with them.
 You can debug individual Pods in a StatefulSet using the
-[Debugging Pods](/docs/tasks/debug-application-cluster/debug-pod-replication-controller/) guide.
+[Debugging Pods](/docs/tasks/debug/debug-application/debug-pods/) guide.

 ## {{% heading "whatsnext" %}}

-Learn more about [debugging an init-container](/docs/tasks/debug-application-cluster/debug-init-containers/).
+Learn more about [debugging an init-container](/docs/tasks/debug/debug-application/debug-init-containers/).

--- a/content/en/docs/tasks/debug-application-cluster/determine-reason-pod-failure.md
+++ b/content/en/docs/tasks/debug-application-cluster/determine-reason-pod-failure.md
--- a/content/en/docs/tasks/debug-application-cluster/get-shell-running-container.md
+++ b/content/en/docs/tasks/debug-application-cluster/get-shell-running-container.md
--- a/content/en/docs/tasks/debug/debug-cluster/_index.md
+++ b/content/en/docs/tasks/debug/debug-cluster/_index.md
@ -0,0 +1,316 @@
+---
+reviewers:
+- davidopp
+title: "Troubleshooting Clusters"
+description: Debugging common cluster issues.
+weight: 20
+no_list: true
+---
+
+<!-- overview -->
+
+This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the
+problem you are experiencing. See
+the [application troubleshooting guide](/docs/tasks/debug/debug-application/) for tips on application debugging.
+You may also visit the [troubleshooting overview document](/docs/tasks/debug/) for more information.
+
+<!-- body -->
+
+## Listing your cluster
+
+The first thing to debug in your cluster is if your nodes are all registered correctly.
+
+Run the following command:
+
+```shell
+kubectl get nodes
+```
+
+And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.
+
+To get detailed information about the overall health of your cluster, you can run:
+
+```shell
+kubectl cluster-info dump
+```
+
+### Example: debugging a down/unreachable node
+
+Sometimes when debugging it can be useful to look at the status of a node -- for example, because you've noticed strange behavior of a Pod that's running on the node, or to find out why a Pod won't schedule onto the node. As with Pods, you can use `kubectl describe node` and `kubectl get node -o yaml` to retrieve detailed information about nodes. For example, here's what you'll see if a node is down (disconnected from the network, or kubelet dies and won't restart, etc.). Notice the events that show the node is NotReady, and also notice that the pods are no longer running (they are evicted after five minutes of NotReady status).
+
+```shell
+kubectl get nodes
+```
+
+```none
+NAME                     STATUS       ROLES     AGE     VERSION
+kube-worker-1            NotReady     <none>    1h      v1.23.3
+kubernetes-node-bols     Ready        <none>    1h      v1.23.3
+kubernetes-node-st6x     Ready        <none>    1h      v1.23.3
+kubernetes-node-unaj     Ready        <none>    1h      v1.23.3
+```
+
+```shell
+kubectl describe node kube-worker-1
+```
+
+```none
+Name:               kube-worker-1
+Roles:              <none>
+Labels:             beta.kubernetes.io/arch=amd64
+                    beta.kubernetes.io/os=linux
+                    kubernetes.io/arch=amd64
+                    kubernetes.io/hostname=kube-worker-1
+                    kubernetes.io/os=linux
+Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
+                    node.alpha.kubernetes.io/ttl: 0
+                    volumes.kubernetes.io/controller-managed-attach-detach: true
+CreationTimestamp:  Thu, 17 Feb 2022 16:46:30 -0500
+Taints:             node.kubernetes.io/unreachable:NoExecute
+                    node.kubernetes.io/unreachable:NoSchedule
+Unschedulable:      false
+Lease:
+  HolderIdentity:  kube-worker-1
+  AcquireTime:     <unset>
+  RenewTime:       Thu, 17 Feb 2022 17:13:09 -0500
+Conditions:
+  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
+  ----                 ------    -----------------                 ------------------                ------              -------
+  NetworkUnavailable   False     Thu, 17 Feb 2022 17:09:13 -0500   Thu, 17 Feb 2022 17:09:13 -0500   WeaveIsUp           Weave pod has set this
+  MemoryPressure       Unknown   Thu, 17 Feb 2022 17:12:40 -0500   Thu, 17 Feb 2022 17:13:52 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
+  DiskPressure         Unknown   Thu, 17 Feb 2022 17:12:40 -0500   Thu, 17 Feb 2022 17:13:52 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
+  PIDPressure          Unknown   Thu, 17 Feb 2022 17:12:40 -0500   Thu, 17 Feb 2022 17:13:52 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
+  Ready                Unknown   Thu, 17 Feb 2022 17:12:40 -0500   Thu, 17 Feb 2022 17:13:52 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
+Addresses:
+  InternalIP:  192.168.0.113
+  Hostname:    kube-worker-1
+Capacity:
+  cpu:                2
+  ephemeral-storage:  15372232Ki
+  hugepages-2Mi:      0
+  memory:             2025188Ki
+  pods:               110
+Allocatable:
+  cpu:                2
+  ephemeral-storage:  14167048988
+  hugepages-2Mi:      0
+  memory:             1922788Ki
+  pods:               110
+System Info:
+  Machine ID:                 9384e2927f544209b5d7b67474bbf92b
+  System UUID:                aa829ca9-73d7-064d-9019-df07404ad448
+  Boot ID:                    5a295a03-aaca-4340-af20-1327fa5dab5c
+  Kernel Version:             5.13.0-28-generic
+  OS Image:                   Ubuntu 21.10
+  Operating System:           linux
+  Architecture:               amd64
+  Container Runtime Version:  containerd://1.5.9
+  Kubelet Version:            v1.23.3
+  Kube-Proxy Version:         v1.23.3
+Non-terminated Pods:          (4 in total)
+  Namespace                   Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
+  ---------                   ----                                 ------------  ----------  ---------------  -------------  ---
+  default                     nginx-deployment-67d4bdd6f5-cx2nz    500m (25%)    500m (25%)  128Mi (6%)       128Mi (6%)     23m
+  default                     nginx-deployment-67d4bdd6f5-w6kd7    500m (25%)    500m (25%)  128Mi (6%)       128Mi (6%)     23m
+  kube-system                 kube-proxy-dnxbz                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
+  kube-system                 weave-net-gjxxp                      100m (5%)     0 (0%)      200Mi (10%)      0 (0%)         28m
+Allocated resources:
+  (Total limits may be over 100 percent, i.e., overcommitted.)
+  Resource           Requests     Limits
+  --------           --------     ------
+  cpu                1100m (55%)  1 (50%)
+  memory             456Mi (24%)  256Mi (13%)
+  ephemeral-storage  0 (0%)       0 (0%)
+  hugepages-2Mi      0 (0%)       0 (0%)
+Events:
+...
+```
+
+```shell
+kubectl get node kube-worker-1 -o yaml
+```
+
+```yaml
+apiVersion: v1
+kind: Node
+metadata:
+  annotations:
+    kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
+    node.alpha.kubernetes.io/ttl: "0"
+    volumes.kubernetes.io/controller-managed-attach-detach: "true"
+  creationTimestamp: "2022-02-17T21:46:30Z"
+  labels:
+    beta.kubernetes.io/arch: amd64
+    beta.kubernetes.io/os: linux
+    kubernetes.io/arch: amd64
+    kubernetes.io/hostname: kube-worker-1
+    kubernetes.io/os: linux
+  name: kube-worker-1
+  resourceVersion: "4026"
+  uid: 98efe7cb-2978-4a0b-842a-1a7bf12c05f8
+spec: {}
+status:
+  addresses:
+  - address: 192.168.0.113
+    type: InternalIP
+  - address: kube-worker-1
+    type: Hostname
+  allocatable:
+    cpu: "2"
+    ephemeral-storage: "14167048988"
+    hugepages-2Mi: "0"
+    memory: 1922788Ki
+    pods: "110"
+  capacity:
+    cpu: "2"
+    ephemeral-storage: 15372232Ki
+    hugepages-2Mi: "0"
+    memory: 2025188Ki
+    pods: "110"
+  conditions:
+  - lastHeartbeatTime: "2022-02-17T22:20:32Z"
+    lastTransitionTime: "2022-02-17T22:20:32Z"
+    message: Weave pod has set this
+    reason: WeaveIsUp
+    status: "False"
+    type: NetworkUnavailable
+  - lastHeartbeatTime: "2022-02-17T22:20:15Z"
+    lastTransitionTime: "2022-02-17T22:13:25Z"
+    message: kubelet has sufficient memory available
+    reason: KubeletHasSufficientMemory
+    status: "False"
+    type: MemoryPressure
+  - lastHeartbeatTime: "2022-02-17T22:20:15Z"
+    lastTransitionTime: "2022-02-17T22:13:25Z"
+    message: kubelet has no disk pressure
+    reason: KubeletHasNoDiskPressure
+    status: "False"
+    type: DiskPressure
+  - lastHeartbeatTime: "2022-02-17T22:20:15Z"
+    lastTransitionTime: "2022-02-17T22:13:25Z"
+    message: kubelet has sufficient PID available
+    reason: KubeletHasSufficientPID
+    status: "False"
+    type: PIDPressure
+  - lastHeartbeatTime: "2022-02-17T22:20:15Z"
+    lastTransitionTime: "2022-02-17T22:15:15Z"
+    message: kubelet is posting ready status. AppArmor enabled
+    reason: KubeletReady
+    status: "True"
+    type: Ready
+  daemonEndpoints:
+    kubeletEndpoint:
+      Port: 10250
+  nodeInfo:
+    architecture: amd64
+    bootID: 22333234-7a6b-44d4-9ce1-67e31dc7e369
+    containerRuntimeVersion: containerd://1.5.9
+    kernelVersion: 5.13.0-28-generic
+    kubeProxyVersion: v1.23.3
+    kubeletVersion: v1.23.3
+    machineID: 9384e2927f544209b5d7b67474bbf92b
+    operatingSystem: linux
+    osImage: Ubuntu 21.10
+    systemUUID: aa829ca9-73d7-064d-9019-df07404ad448
+```
+
+
+## Looking at logs
+
+For now, digging deeper into the cluster requires logging into the relevant machines.  Here are the locations
+of the relevant log files.  On systemd-based systems, you may need to use `journalctl` instead of examining log files.
+
+### Control Plane nodes
+
+   * `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
+   * `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
+   * `/var/log/kube-controller-manager.log` - a component that runs most Kubernetes built-in {{<glossary_tooltip text="controllers" term_id="controller">}}, with the notable exception of scheduling (the kube-scheduler handles scheduling).
+
+### Worker Nodes
+
+   * `/var/log/kubelet.log` - logs from the kubelet, responsible for running containers on the node
+   * `/var/log/kube-proxy.log` - logs from `kube-proxy`, which is responsible for directing traffic to Service endpoints
+
+## Cluster failure modes
+
+This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
+
+### Contributing causes
+
+  - VM(s) shutdown
+  - Network partition within cluster, or between cluster and users
+  - Crashes in Kubernetes software
+  - Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
+  - Operator error, for example misconfigured Kubernetes software or application software
+
+### Specific scenarios
+
+  - API server VM shutdown or apiserver crashing
+    - Results
+      - unable to stop, update, or start new pods, services, replication controller
+      - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
+  - API server backing storage lost
+    - Results
+      - the kube-apiserver component fails to start successfully and become healthy
+      - kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
+      - manual recovery or recreation of apiserver state necessary before apiserver is restarted
+  - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
+    - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
+    - in future, these will be replicated as well and may not be co-located
+    - they do not have their own persistent state
+  - Individual node (VM or physical machine) shuts down
+    - Results
+      - pods on that Node stop running
+  - Network partition
+    - Results
+      - partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
+  - Kubelet software fault
+    - Results
+      - crashing kubelet cannot start new pods on the node
+      - kubelet might delete the pods or not
+      - node marked unhealthy
+      - replication controllers start new pods elsewhere
+  - Cluster operator error
+    - Results
+      - loss of pods, services, etc
+      - lost of apiserver backing store
+      - users unable to read API
+      - etc.
+
+### Mitigations
+
+- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
+  - Mitigates: Apiserver VM shutdown or apiserver crashing
+  - Mitigates: Supporting services VM shutdown or crashes
+
+- Action: Use IaaS providers reliable storage (e.g. GCE PD or AWS EBS volume) for VMs with apiserver+etcd
+  - Mitigates: Apiserver backing storage lost
+
+- Action: Use [high-availability](/docs/setup/production-environment/tools/kubeadm/high-availability/) configuration
+  - Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
+    - Will tolerate one or more simultaneous node or component failures
+  - Mitigates: API server backing storage (i.e., etcd's data directory) lost
+    - Assumes HA (highly-available) etcd configuration
+
+- Action: Snapshot apiserver PDs/EBS-volumes periodically
+  - Mitigates: Apiserver backing storage lost
+  - Mitigates: Some cases of operator error
+  - Mitigates: Some cases of Kubernetes software fault
+
+- Action: use replication controller and services in front of pods
+  - Mitigates: Node shutdown
+  - Mitigates: Kubelet software fault
+
+- Action: applications (containers) designed to tolerate unexpected restarts
+  - Mitigates: Node shutdown
+  - Mitigates: Kubelet software fault
+
+
+## {{% heading "whatsnext" %}}
+
+* Learn about the metrics available in the [Resource Metrics Pipeline](resource-metrics-pipeline)
+* Discover additional tools for [monitoring resource usage](resource-usage-monitoring)
+* Use Node Problem Detector to [monitor node health](monitor-node-health)
+* Use `crictl` to [debug Kubernetes nodes](crictl)
+* Get more information about [Kubernetes auditing](audit)
+* Use `telepresence` to [develop and debug services locally](local-debugging)
--- a/content/en/docs/tasks/debug-application-cluster/audit.md
+++ b/content/en/docs/tasks/debug-application-cluster/audit.md
--- a/content/en/docs/tasks/debug-application-cluster/crictl.md
+++ b/content/en/docs/tasks/debug-application-cluster/crictl.md
@ -5,6 +5,7 @@ reviewers:
 - mrunalp
 title: Debugging Kubernetes nodes with crictl
 content_type: task
+weight: 30
 ---


--- a/content/en/docs/tasks/debug-application-cluster/local-debugging.md
+++ b/content/en/docs/tasks/debug-application-cluster/local-debugging.md
@ -1,5 +1,5 @@
 ---
-title: Developing and debugging services locally
+title: Developing and debugging services locally using telepresence
 content_type: task
 ---

@ -58,4 +58,4 @@ Telepresence installs a traffic-agent sidecar next to your existing application'
 
 If you're interested in a hands-on tutorial, check out [this tutorial](https://cloud.google.com/community/tutorials/developing-services-with-k8s) that walks through locally developing the Guestbook application on Google Kubernetes Engine.
 
-For further reading, visit the [Telepresence website](https://www.telepresence.io).
+For further reading, visit the [Telepresence website](https://www.telepresence.io).
--- a/content/en/docs/tasks/debug-application-cluster/monitor-node-health.md
+++ b/content/en/docs/tasks/debug-application-cluster/monitor-node-health.md
@ -4,6 +4,7 @@ content_type: task
 reviewers:
 - Random-Liu
 - dchen1107
+weight: 20
 ---

 <!-- overview -->
--- a/content/en/docs/tasks/debug-application-cluster/resource-metrics-pipeline.md
+++ b/content/en/docs/tasks/debug-application-cluster/resource-metrics-pipeline.md
@ -4,6 +4,7 @@ reviewers:
 - piosz
 title: Resource metrics pipeline
 content_type: concept
+weight: 15
 ---

 <!-- overview -->
--- a/content/en/docs/tasks/debug-application-cluster/resource-usage-monitoring.md
+++ b/content/en/docs/tasks/debug-application-cluster/resource-usage-monitoring.md
@ -3,6 +3,7 @@ reviewers:
 - mikedanese
 content_type: concept
 title: Tools for Monitoring Resources
+weight: 15
 ---

 <!-- overview -->
@ -58,4 +59,14 @@ then exposes them to Kubernetes via an adapter by implementing either the
 [Prometheus](https://prometheus.io), a CNCF project, can natively monitor Kubernetes, nodes, and Prometheus itself.
 Full metrics pipeline projects that are not part of the CNCF are outside the scope of Kubernetes documentation.  

+## {{% heading "whatsnext" %}}

+
+Learn about additional debugging tools, including:
+
+* [Logging](/docs/concepts/cluster-administration/logging/)
+* [Monitoring](/docs/tasks/debug-application-cluster/resource-usage-monitoring/)
+* [Getting into containers via `exec`](/docs/tasks/debug-application-cluster/applications/get-shell-running-container/)
+* [Connecting to containers via proxies](/docs/tasks/extend-kubernetes/http-proxy-access-api/)
+* [Connecting to containers via port forwarding](/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)
+* [Inspect Kubernetes node with crictl](/docs/tasks/debug-application-cluster/monitoring/crictl/)