Update troubleshooting topics (#140)

* Update troubleshooting topics

* Trim log dump
This commit is contained in:
Jim Galasyn 2017-07-14 15:11:00 -07:00
parent a866583961
commit da71e90152
8 changed files with 99 additions and 62 deletions

View File

@ -1,8 +1,7 @@
---
title: Monitor the cluster status
description: Monitor your Docker Universal Control Plane installation, and learn how
to troubleshoot it.
keywords: Docker, UCP, troubleshoot
title: Monitor the swarm status
description: Monitor your Docker Universal Control Plane installation, and learn how to troubleshoot it.
keywords: UCP, troubleshoot, health, swarm
---
You can monitor the status of UCP by using the web UI or the CLI.
@ -10,7 +9,7 @@ You can also use the `_ping` endpoint to build monitoring automation.
## Check status from the UI
The first place to check the status of UCP is the **UCP web UI**, since it
The first place to check the status of UCP is the UCP web UI, since it
shows warnings for situations that require your immediate attention.
Administrators might see more warnings than regular users.
@ -22,7 +21,11 @@ managed by UCP are healthy or not.
![UCP dashboard](../../images/monitor-ucp-1.png){: .with-border}
Each node has a status message explaining any problems with the node.
In this example, a Windows worker node is down.
[Learn more about node status](troubleshoot-node-messages.md).
Click the node to get more info on its status. In the details pane, click
**Actions** and select **Agent logs** to see the log entries from the
node.
## Check status from the CLI

View File

@ -1,14 +1,14 @@
---
title: Troubleshoot cluster configurations
title: Troubleshoot swarm configurations
description: Learn how to troubleshoot your Docker Universal Control Plane cluster.
keywords: ectd, rethinkdb, key, value, store, database, ucp
keywords: troubleshoot, etcd, rethinkdb, key, value, store, database, ucp, health, swarm
---
UCP automatically tries to heal itself by monitoring it's internal
UCP automatically tries to heal itself by monitoring its internal
components and trying to bring them to a healthy state.
In most cases, if a single UCP component is persistently in a
failed state, you should be able to restore the cluster to a healthy state by
In most cases, if a single UCP component is in a failed state persistently,
you should be able to restore the cluster to a healthy state by
removing the unhealthy node from the cluster and joining it again.
[Lean how to remove and join modes](../configure/scale-your-cluster.md).
@ -16,10 +16,11 @@ removing the unhealthy node from the cluster and joining it again.
UCP persists configuration data on an [etcd](https://coreos.com/etcd/)
key-value store and [RethinkDB](https://rethinkdb.com/) database that are
replicated on all manager nodes of the UCP cluster. These data stores are for
internal use only, and should not be used by other applications.
replicated on all manager nodes of the UCP swarm. These data stores are for
internal use only and should not be used by other applications.
### With the HTTP API
In this example we'll use `curl` for making requests to the key-value
store REST API, and `jq` to process the responses.
@ -32,18 +33,19 @@ $ sudo apt-get update && apt-get install curl jq
1. Use a client bundle to authenticate your requests.
[Learn more](../../user/access-ucp/cli-based-access.md).
2. Use the REST API to access the cluster configurations.
2. Use the REST API to access the cluster configurations. The $DOCKER_HOST
and $DOCKER_CERT_PATH environment variables are set when using the client
bundle.
```bash
# $DOCKER_HOST and $DOCKER_CERT_PATH are set when using the client bundle
$ export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
```bash
$ export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
$ curl -s \
--cert ${DOCKER_CERT_PATH}/cert.pem \
--key ${DOCKER_CERT_PATH}/key.pem \
--cacert ${DOCKER_CERT_PATH}/ca.pem \
${KV_URL}/v2/keys | jq "."
```
$ curl -s \
--cert ${DOCKER_CERT_PATH}/cert.pem \
--key ${DOCKER_CERT_PATH}/key.pem \
--cacert ${DOCKER_CERT_PATH}/ca.pem \
${KV_URL}/v2/keys | jq "."
```
To learn more about the key-value store REST API check the
[etcd official documentation](https://coreos.com/etcd/docs/latest/).
@ -69,15 +71,16 @@ member ca3c1bb18f1b30bf is healthy: got healthy result from https://192.168.122.
cluster is healthy
```
On failure the command exits with an error code, and no output.
On failure, the command exits with an error code and no output.
To learn more about the `etcdctl` utility, check the
[etcd official documentation](https://coreos.com/etcd/docs/latest/).
## RethinkDB Database
User and organization data for Docker Datacenter is stored in a RethinkDB
database which is replicated across all manager nodes in the UCP cluster.
User and organization data for Docker Enterprise Edition is stored in a
RethinkDB database which is replicated across all manager nodes in the UCP
swarm.
Replication and failover of this database is typically handled automatically by
UCP's own configuration management processes, but detailed database status and
@ -98,6 +101,23 @@ VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
# in the RethinkDB cluster.
docker run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 db-status
{% endraw %}
Server Status: [
{
"ID": "ffa9cd5a-3370-4ccd-a21f-d7437c90e900",
"Name": "ucp_auth_store_192_168_1_25",
"Network": {
"CanonicalAddresses": [
{
"Host": "192.168.1.25",
"Port": 12384
}
],
"TimeConnected": "2017-07-14T17:21:44.198Z"
}
}
]
...
```
### Manually reconfigure database replication
@ -114,6 +134,13 @@ VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
# number of replicas equal to the number of manager nodes in the cluster.
docker run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 --debug reconfigure-db --num-replicas ${NUM_MANAGERS} --emergency-repair
{% endraw %}
time="2017-07-14T20:46:09Z" level=debug msg="Connecting to db ..."
time="2017-07-14T20:46:09Z" level=debug msg="connecting to DB Addrs: [192.168.1.25:12383]"
time="2017-07-14T20:46:09Z" level=debug msg="Reconfiguring number of replicas to 1"
time="2017-07-14T20:46:09Z" level=debug msg="(00/16) Emergency Repairing Tables..."
time="2017-07-14T20:46:09Z" level=debug msg="(01/16) Emergency Repaired Table \"grant_objects\""
...
```
## Where to go next

View File

@ -1,13 +1,13 @@
---
title: Troubleshoot UCP Node States
title: Troubleshoot UCP node states
description: Learn how to troubleshoot individual UCP nodes.
keywords: Docker, UCP, troubleshoot, health, swarm
keywords: UCP, troubleshoot, health, swarm
---
There are several cases in the lifecycle of UCP when a node is actively
transitioning from one state to another, such as when a new node is joining the
cluster or during node promotion and demotion. In these cases, the current step
of the transition will be reported by UCP as a node message. You can view the
swarm or during node promotion and demotion. In these cases, the current step
of the transition will be reported by UCP as a node message. You can view the
state of each individual node by following the same steps required to [monitor
cluster status](index.md).
@ -19,11 +19,11 @@ UCP node, their explanation, and the expected duration of a given step.
| Message | Description | Typical step duration |
|:-----------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------|
| Completing node registration | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP cluster. | 5 - 30 seconds |
| The ucp-agent task is <state> | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP cluster. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node. | 1-10 seconds |
| Completing node registration | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm. | 5 - 30 seconds |
| The ucp-agent task is <state> | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP swarm. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node. | 1-10 seconds |
| Unable to determine node state | The `ucp-reconcile` container on the target node just started running and we are not able to determine its state. | 1-10 seconds |
| Node is being reconfigured | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state. | 1 - 60 seconds |
| Reconfiguration pending | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet. | 1 - 10 seconds |
| Unhealthy UCP Controller: node is unreachable | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Please ensure the underlying networking infrastructure is operational and contact support if the symptom persists. | Until resolved |
| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational and contact support if the symptom persists. | Until resolved |
| Unhealthy UCP Controller: node is unreachable | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists. | Until resolved |
| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists. | Until resolved |
| Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved |

View File

@ -1,7 +1,7 @@
---
title: Troubleshoot your swarm
description: Learn how to troubleshoot your Docker Universal Control Plane cluster.
keywords: docker, ucp, troubleshoot
title: Troubleshoot your cluster
keywords: ucp, troubleshoot, health, swarm
---
If you detect problems in your UCP cluster, you can start your troubleshooting
@ -12,14 +12,13 @@ see information about UCP system containers.
## Check the logs from the UI
To see the logs of the UCP system containers, navigate to the **Containers**
page of UCP. By default the UCP system containers are hidden. Click the
**Show all containers** option for the UCP system containers to be listed as
well.
page of UCP. By default, the UCP system containers are hidden. Click
**Settings** and check **Show system containers** for the UCP system containers
to be listed as well.
![](../../images/troubleshoot-with-logs-1.png){: .with-border}
You can click on a container to see more details like its configurations and
logs.
Click on a container to see more details, like its configurations and logs.
## Check the logs from the CLI
@ -29,28 +28,35 @@ specially useful if the UCP web application is not working.
1. Get a client certificate bundle.
When using the Docker CLI client you need to authenticate using client
When using the Docker CLI client, you need to authenticate using client
certificates.
[Learn how to use client certificates](../../user/access-ucp/cli-based-access.md).
If your client certificate bundle is for a non-admin user, you won't have
permissions to see the UCP system containers.
2. Check the logs of UCP system containers.
2. Check the logs of UCP system containers. By default, system containers
aren't displayed. Use the `-a` flag to display them.
```bash
# By default system containers are not displayed. Use the -a flag to display them
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8b77cfa87889 dockerorcadev/ucp-agent:2.2.0-latest "/bin/ucp-agent re..." 3 hours ago Exited (0) 3 hours ago ucp-reconcile
b844cf76a7a5 dockerorcadev/ucp-agent:2.2.0-latest "/bin/ucp-agent agent" 3 hours ago Up 3 hours 2376/tcp ucp-agent.tahzo3m4xjwhtsn6l3n8oc2bf.xx2hf6dg4zrphgvy2eohtpns9
de5b45871acb dockerorcadev/ucp-controller:2.2.0-latest "/bin/controller s..." 3 hours ago Up 3 hours (unhealthy) 0.0.0.0:443->8080/tcp ucp-controller
...
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
922503c2102a docker/ucp-controller:1.1.0-rc2 "/bin/controller serv" 4 hours ago Up 30 minutes 192.168.10.100:444->8080/tcp ucp/ucp-controller
1b6d429f1bd5 docker/ucp-swarm:1.1.0-rc2 "/swarm join --discov" 4 hours ago Up 4 hours 2375/tcp ucp/ucp-swarm-join
3. Get the log from a UCP container by using the `docker logs <ucp container ID>`
command. For example, the following command emits the log for the
`ucp-controller` container listed above.
# See the logs of the ucp/ucp-controller container
$ docker logs ucp/ucp-controller
```bash
$ docker logs de5b45871acb
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json","remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs","remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json",
"remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs",
"remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
```
## Get a support dump
@ -64,30 +70,31 @@ the status of the UCP cluster. Changing the UCP log level restarts all UCP
system components and introduces a small downtime window to UCP. Your
applications won't be affected by this.
To increase the UCP log level, navigate to the **UCP web UI**, go to the
To increase the UCP log level, navigate to the UCP web UI, go to the
**Admin Settings** tab, and choose **Logs**.
![](../../images/troubleshoot-with-logs-2.png){: .with-border}
Once you change the log level to **Debug** the UCP containers are restarted.
Now that the UCP components are creating more descriptive logs, you can download
again a support dump and use it to troubleshoot the component causing the
Once you change the log level to **Debug** the UCP containers restart.
Now that the UCP components are creating more descriptive logs, you can
download a support dump and use it to troubleshoot the component causing the
problem.
Depending on the problem you are experiencing, it's more likely that you'll
Depending on the problem you're experiencing, it's more likely that you'll
find related messages in the logs of specific components on manager nodes:
* If the problem occurs after a node was added or removed, check the logs
of the `ucp-reconcile` container.
of the `ucp-reconcile` container.
* If the problem occurs in the normal state of the system, check the logs
of the `ucp-controller` container.
of the `ucp-controller` container.
* If you are able to visit the UCP web UI but unable to log in, check the
logs of the `ucp-auth-api` and `ucp-auth-store` containers.
logs of the `ucp-auth-api` and `ucp-auth-store` containers.
It's normal for the `ucp-reconcile` container to be in a stopped state. This
container is only started when the `ucp-agent` detects that a node needs to
transition to a different state, and it is responsible for creating and removing
containers, issuing certificates, and pulling missing images.
container starts only when the `ucp-agent` detects that a node needs to
transition to a different state. The `ucp-reconcile` container is responsible
for creating and removing containers, issuing certificates, and pulling
missing images.
## Where to go next

Binary file not shown.

Before

Width:  |  Height:  |  Size: 242 KiB

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 257 KiB

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 113 KiB

After

Width:  |  Height:  |  Size: 120 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 223 KiB

After

Width:  |  Height:  |  Size: 61 KiB