Update troubleshooting topics (#140)

* Update troubleshooting topics * Trim log dump
2017-07-14 15:11:00 -07:00 · 2017-07-14 15:11:00 -07:00 · da71e90152
parent a866583961
commit da71e90152
8 changed files with 99 additions and 62 deletions
--- a/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/index.md
+++ b/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/index.md
@ -1,8 +1,7 @@
 ---
-title: Monitor the cluster status
-description: Monitor your Docker Universal Control Plane installation, and learn how
-  to troubleshoot it.
-keywords: Docker, UCP, troubleshoot
+title: Monitor the swarm status
+description: Monitor your Docker Universal Control Plane installation, and learn how to troubleshoot it.
+keywords: UCP, troubleshoot, health, swarm
 ---

 You can monitor the status of UCP by using the web UI or the CLI.
@ -10,7 +9,7 @@ You can also use the `_ping` endpoint to build monitoring automation.

 ## Check status from the UI

-The first place to check the status of UCP is the **UCP web UI**, since it
+The first place to check the status of UCP is the UCP web UI, since it
 shows warnings for situations that require your immediate attention.
 Administrators might see more warnings than regular users.

@ -22,7 +21,11 @@ managed by UCP are healthy or not.
 ![UCP dashboard](../../images/monitor-ucp-1.png){: .with-border}

 Each node has a status message explaining any problems with the node.
+In this example, a Windows worker node is down. 
 [Learn more about node status](troubleshoot-node-messages.md).
+Click the node to get more info on its status. In the details pane, click
+**Actions** and select **Agent logs** to see the log entries from the
+node. 


 ## Check status from the CLI
--- a/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/troubleshoot-configurations.md
+++ b/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/troubleshoot-configurations.md
@ -1,14 +1,14 @@
 ---
-title: Troubleshoot cluster configurations
+title: Troubleshoot swarm configurations
 description: Learn how to troubleshoot your Docker Universal Control Plane cluster.
-keywords: ectd, rethinkdb, key, value, store, database, ucp
+keywords:  troubleshoot, etcd, rethinkdb, key, value, store, database, ucp, health, swarm
 ---

-UCP automatically tries to heal itself by monitoring it's internal
+UCP automatically tries to heal itself by monitoring its internal
 components and trying to bring them to a healthy state.

-In most cases, if a single UCP component is persistently in a
-failed state, you should be able to restore the cluster to a healthy state by
+In most cases, if a single UCP component is in a failed state persistently,
+you should be able to restore the cluster to a healthy state by
 removing the unhealthy node from the cluster and joining it again.
 [Lean how to remove and join modes](../configure/scale-your-cluster.md).

@ -16,10 +16,11 @@ removing the unhealthy node from the cluster and joining it again.

 UCP persists configuration data on an [etcd](https://coreos.com/etcd/)
 key-value store and [RethinkDB](https://rethinkdb.com/) database that are
-replicated on all manager nodes of the UCP cluster. These data stores are for
-internal use only, and should not be used by other applications.
+replicated on all manager nodes of the UCP swarm. These data stores are for
+internal use only and should not be used by other applications.

 ### With the HTTP API
+
 In this example we'll use `curl` for making requests to the key-value
 store REST API, and `jq` to process the responses.

@ -32,18 +33,19 @@ $ sudo apt-get update && apt-get install curl jq
 1. Use a client bundle to authenticate your requests.
 [Learn more](../../user/access-ucp/cli-based-access.md).

-2. Use the REST API to access the cluster configurations.
+2. Use the REST API to access the cluster configurations. The $DOCKER_HOST
+   and $DOCKER_CERT_PATH environment variables are set when using the client
+   bundle.

-```bash
-# $DOCKER_HOST and $DOCKER_CERT_PATH are set when using the client bundle
-$ export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
+   ```bash
+   $ export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"

-$ curl -s \
-    --cert ${DOCKER_CERT_PATH}/cert.pem \
-    --key ${DOCKER_CERT_PATH}/key.pem \
-    --cacert ${DOCKER_CERT_PATH}/ca.pem \
-    ${KV_URL}/v2/keys | jq "."
-```
+   $ curl -s \
+        --cert ${DOCKER_CERT_PATH}/cert.pem \
+        --key ${DOCKER_CERT_PATH}/key.pem \
+        --cacert ${DOCKER_CERT_PATH}/ca.pem \
+        ${KV_URL}/v2/keys | jq "."
+   ```

 To learn more about the key-value store REST API check the
 [etcd official documentation](https://coreos.com/etcd/docs/latest/).
@ -69,15 +71,16 @@ member ca3c1bb18f1b30bf is healthy: got healthy result from https://192.168.122.
 cluster is healthy
 ```

-On failure the command exits with an error code, and no output.
+On failure, the command exits with an error code and no output.

 To learn more about the `etcdctl` utility, check the
 [etcd official documentation](https://coreos.com/etcd/docs/latest/).

 ## RethinkDB Database

-User and organization data for Docker Datacenter is stored in a RethinkDB
-database which is replicated across all manager nodes in the UCP cluster.
+User and organization data for Docker Enterprise Edition is stored in a
+RethinkDB database which is replicated across all manager nodes in the UCP
+swarm.

 Replication and failover of this database is typically handled automatically by
 UCP's own configuration management processes, but detailed database status and
@ -98,6 +101,23 @@ VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
 # in the RethinkDB cluster.
 docker run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 db-status
 {% endraw %}
+
+Server Status: [
+  {
+    "ID": "ffa9cd5a-3370-4ccd-a21f-d7437c90e900",
+    "Name": "ucp_auth_store_192_168_1_25",
+    "Network": {
+      "CanonicalAddresses": [
+        {
+          "Host": "192.168.1.25",
+          "Port": 12384
+        }
+      ],
+      "TimeConnected": "2017-07-14T17:21:44.198Z"
+    }
+  }
+]
+...
 ```

 ### Manually reconfigure database replication
@ -114,6 +134,13 @@ VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
 # number of replicas equal to the number of manager nodes in the cluster.
 docker run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 --debug reconfigure-db --num-replicas ${NUM_MANAGERS} --emergency-repair
 {% endraw %}
+
+time="2017-07-14T20:46:09Z" level=debug msg="Connecting to db ..." 
+time="2017-07-14T20:46:09Z" level=debug msg="connecting to DB Addrs: [192.168.1.25:12383]" 
+time="2017-07-14T20:46:09Z" level=debug msg="Reconfiguring number of replicas to 1" 
+time="2017-07-14T20:46:09Z" level=debug msg="(00/16) Emergency Repairing Tables..." 
+time="2017-07-14T20:46:09Z" level=debug msg="(01/16) Emergency Repaired Table \"grant_objects\"" 
+...
 ```

 ## Where to go next
--- a/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/troubleshoot-node-messages.md
+++ b/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/troubleshoot-node-messages.md
@ -1,13 +1,13 @@
 ---
-title: Troubleshoot UCP Node States
+title: Troubleshoot UCP node states
 description: Learn how to troubleshoot individual UCP nodes.
-keywords: Docker, UCP, troubleshoot, health, swarm
+keywords: UCP, troubleshoot, health, swarm
 ---

 There are several cases in the lifecycle of UCP when a node is actively
 transitioning from one state to another, such as when a new node is joining the
-cluster or during node promotion and demotion. In these cases, the current step
-of the transition will be reported by UCP as a node message.  You can view the
+swarm or during node promotion and demotion. In these cases, the current step
+of the transition will be reported by UCP as a node message. You can view the
 state of each individual node by following the same steps required to [monitor
 cluster status](index.md).

@ -19,11 +19,11 @@ UCP node, their explanation, and the expected duration of a given step.

 | Message                                              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                              | Typical step duration |
 |:-----------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------|
-| Completing node registration                         | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP cluster.                                                                                                                                                                                                                                                                                                                                  | 5 - 30 seconds        |
-| The ucp-agent task is <state>                        | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP cluster. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node.                                                                                                                        | 1-10 seconds          |
+| Completing node registration                         | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm.                                                                                                                                                                                                                                                                                                                                  | 5 - 30 seconds        |
+| The ucp-agent task is <state>                        | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP swarm. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node.                                                                                                                        | 1-10 seconds          |
 | Unable to determine node state                       | The `ucp-reconcile` container on the target node just started running and we are not able to determine its state.                                                                                                                                                                                                                                                                                                                                        | 1-10 seconds          |
 | Node is being reconfigured                           | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state.                                                                                                                                                                                                            | 1 - 60 seconds        |
 | Reconfiguration pending                              | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet.                                                                                                                                                                                                                                                                                                                                                  | 1 - 10 seconds        |
-| Unhealthy UCP Controller: node is unreachable        | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Please ensure the underlying networking infrastructure is operational and contact support if the symptom persists.                                                                            | Until resolved        |
-| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational and contact support if the symptom persists. | Until resolved        |
+| Unhealthy UCP Controller: node is unreachable        | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                                                            | Until resolved        |
+| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists. | Until resolved        |
 | Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved        |
--- a/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/troubleshoot-with-logs.md
+++ b/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/troubleshoot-with-logs.md
@ -1,7 +1,7 @@
 ---
+title: Troubleshoot your swarm
 description: Learn how to troubleshoot your Docker Universal Control Plane cluster.
-keywords: docker, ucp, troubleshoot
-title: Troubleshoot your cluster
+keywords: ucp, troubleshoot, health, swarm
 ---

 If you detect problems in your UCP cluster, you can start your troubleshooting
@ -12,14 +12,13 @@ see information about UCP system containers.
 ## Check the logs from the UI

 To see the logs of the UCP system containers, navigate to the **Containers**
-page of UCP. By default the UCP system containers are hidden. Click the
-**Show all containers** option for the UCP system containers to be listed as
-well.
+page of UCP. By default, the UCP system containers are hidden. Click
+**Settings** and check **Show system containers** for the UCP system containers
+to be listed as well.

 ![](../../images/troubleshoot-with-logs-1.png){: .with-border}

-You can click on a container to see more details like its configurations and
-logs.
+Click on a container to see more details, like its configurations and logs.


 ## Check the logs from the CLI
@ -29,28 +28,35 @@ specially useful if the UCP web application is not working.

 1. Get a client certificate bundle.

-    When using the Docker CLI client you need to authenticate using client
+    When using the Docker CLI client, you need to authenticate using client
    certificates.
    [Learn how to use client certificates](../../user/access-ucp/cli-based-access.md).
-
    If your client certificate bundle is for a non-admin user, you won't have
    permissions to see the UCP system containers.

-2.  Check the logs of UCP system containers.
+2.  Check the logs of UCP system containers. By default, system containers
+    aren't displayed. Use the `-a` flag to display them.

    ```bash
-    # By default system containers are not displayed. Use the -a flag to display them
    $ docker ps -a
+    CONTAINER ID        IMAGE                                     COMMAND                  CREATED             STATUS                     PORTS                                                                             NAMES
+    8b77cfa87889        dockerorcadev/ucp-agent:2.2.0-latest      "/bin/ucp-agent re..."   3 hours ago         Exited (0) 3 hours ago                                                                                       ucp-reconcile
+    b844cf76a7a5        dockerorcadev/ucp-agent:2.2.0-latest      "/bin/ucp-agent agent"   3 hours ago         Up 3 hours                 2376/tcp                                                                          ucp-agent.tahzo3m4xjwhtsn6l3n8oc2bf.xx2hf6dg4zrphgvy2eohtpns9
+    de5b45871acb        dockerorcadev/ucp-controller:2.2.0-latest "/bin/controller s..."   3 hours ago         Up 3 hours (unhealthy)     0.0.0.0:443->8080/tcp                                                             ucp-controller
+    ...
+    ```

-    CONTAINER ID    IMAGE                             COMMAND                  CREATED         STATUS           PORTS                            NAMES
-    922503c2102a    docker/ucp-controller:1.1.0-rc2   "/bin/controller serv"   4 hours ago     Up 30 minutes    192.168.10.100:444->8080/tcp     ucp/ucp-controller
-    1b6d429f1bd5    docker/ucp-swarm:1.1.0-rc2        "/swarm join --discov"   4 hours ago     Up 4 hours       2375/tcp                         ucp/ucp-swarm-join
+ 3. Get the log from a UCP container by using the `docker logs <ucp container ID>`
+    command. For example, the following command emits the log for the
+    `ucp-controller` container listed above.  

-    # See the logs of the ucp/ucp-controller container
-    $ docker logs ucp/ucp-controller
+    ```bash
+    $ docker logs de5b45871acb

-    {"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json","remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
-    {"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs","remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
+    {"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json",
+    "remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
+    {"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs",
+    "remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
    ```

 ## Get a support dump
@ -64,30 +70,31 @@ the status of the UCP cluster. Changing the UCP log level restarts all UCP
 system components and introduces a small downtime window to UCP. Your
 applications won't be affected by this.

-To increase the UCP log level, navigate to the **UCP web UI**, go to the
+To increase the UCP log level, navigate to the UCP web UI, go to the
 **Admin Settings** tab, and choose **Logs**.

 ![](../../images/troubleshoot-with-logs-2.png){: .with-border}

-Once you change the log level to **Debug** the UCP containers are restarted.
-Now that the UCP components are creating more descriptive logs, you can download
-again a support dump and use it to troubleshoot the component causing the
+Once you change the log level to **Debug** the UCP containers restart.
+Now that the UCP components are creating more descriptive logs, you can
+download a support dump and use it to troubleshoot the component causing the
 problem.

-Depending on the problem you are experiencing, it's more likely that you'll
+Depending on the problem you're experiencing, it's more likely that you'll
 find related messages in the logs of specific components on manager nodes:

 * If the problem occurs after a node was added or removed, check the logs
-of the `ucp-reconcile` container.
+  of the `ucp-reconcile` container.
 * If the problem occurs in the normal state of the system, check the logs
-of the `ucp-controller` container.
+  of the `ucp-controller` container.
 * If you are able to visit the UCP web UI but unable to log in, check the
-logs of the `ucp-auth-api` and `ucp-auth-store` containers.
+  logs of the `ucp-auth-api` and `ucp-auth-store` containers.

 It's normal for the `ucp-reconcile` container to be in a stopped state. This
-container is only started when the `ucp-agent` detects that a node needs to
-transition to a different state, and it is responsible for creating and removing
-containers, issuing certificates, and pulling missing images.
+container starts only when the `ucp-agent` detects that a node needs to
+transition to a different state. The `ucp-reconcile` container is responsible
+for creating and removing containers, issuing certificates, and pulling
+missing images.


 ## Where to go next
--- a/datacenter/ucp/2.2/guides/images/monitor-ucp-0.png
+++ b/datacenter/ucp/2.2/guides/images/monitor-ucp-0.png
--- a/datacenter/ucp/2.2/guides/images/monitor-ucp-1.png
+++ b/datacenter/ucp/2.2/guides/images/monitor-ucp-1.png
--- a/datacenter/ucp/2.2/guides/images/troubleshoot-with-logs-1.png
+++ b/datacenter/ucp/2.2/guides/images/troubleshoot-with-logs-1.png
--- a/datacenter/ucp/2.2/guides/images/troubleshoot-with-logs-2.png
+++ b/datacenter/ucp/2.2/guides/images/troubleshoot-with-logs-2.png