Update troubleshooting topics (#140)
* Update troubleshooting topics * Trim log dump
|
@ -1,8 +1,7 @@
|
|||
---
|
||||
title: Monitor the cluster status
|
||||
description: Monitor your Docker Universal Control Plane installation, and learn how
|
||||
to troubleshoot it.
|
||||
keywords: Docker, UCP, troubleshoot
|
||||
title: Monitor the swarm status
|
||||
description: Monitor your Docker Universal Control Plane installation, and learn how to troubleshoot it.
|
||||
keywords: UCP, troubleshoot, health, swarm
|
||||
---
|
||||
|
||||
You can monitor the status of UCP by using the web UI or the CLI.
|
||||
|
@ -10,7 +9,7 @@ You can also use the `_ping` endpoint to build monitoring automation.
|
|||
|
||||
## Check status from the UI
|
||||
|
||||
The first place to check the status of UCP is the **UCP web UI**, since it
|
||||
The first place to check the status of UCP is the UCP web UI, since it
|
||||
shows warnings for situations that require your immediate attention.
|
||||
Administrators might see more warnings than regular users.
|
||||
|
||||
|
@ -22,7 +21,11 @@ managed by UCP are healthy or not.
|
|||
{: .with-border}
|
||||
|
||||
Each node has a status message explaining any problems with the node.
|
||||
In this example, a Windows worker node is down.
|
||||
[Learn more about node status](troubleshoot-node-messages.md).
|
||||
Click the node to get more info on its status. In the details pane, click
|
||||
**Actions** and select **Agent logs** to see the log entries from the
|
||||
node.
|
||||
|
||||
|
||||
## Check status from the CLI
|
||||
|
|
|
@ -1,14 +1,14 @@
|
|||
---
|
||||
title: Troubleshoot cluster configurations
|
||||
title: Troubleshoot swarm configurations
|
||||
description: Learn how to troubleshoot your Docker Universal Control Plane cluster.
|
||||
keywords: ectd, rethinkdb, key, value, store, database, ucp
|
||||
keywords: troubleshoot, etcd, rethinkdb, key, value, store, database, ucp, health, swarm
|
||||
---
|
||||
|
||||
UCP automatically tries to heal itself by monitoring it's internal
|
||||
UCP automatically tries to heal itself by monitoring its internal
|
||||
components and trying to bring them to a healthy state.
|
||||
|
||||
In most cases, if a single UCP component is persistently in a
|
||||
failed state, you should be able to restore the cluster to a healthy state by
|
||||
In most cases, if a single UCP component is in a failed state persistently,
|
||||
you should be able to restore the cluster to a healthy state by
|
||||
removing the unhealthy node from the cluster and joining it again.
|
||||
[Lean how to remove and join modes](../configure/scale-your-cluster.md).
|
||||
|
||||
|
@ -16,10 +16,11 @@ removing the unhealthy node from the cluster and joining it again.
|
|||
|
||||
UCP persists configuration data on an [etcd](https://coreos.com/etcd/)
|
||||
key-value store and [RethinkDB](https://rethinkdb.com/) database that are
|
||||
replicated on all manager nodes of the UCP cluster. These data stores are for
|
||||
internal use only, and should not be used by other applications.
|
||||
replicated on all manager nodes of the UCP swarm. These data stores are for
|
||||
internal use only and should not be used by other applications.
|
||||
|
||||
### With the HTTP API
|
||||
|
||||
In this example we'll use `curl` for making requests to the key-value
|
||||
store REST API, and `jq` to process the responses.
|
||||
|
||||
|
@ -32,18 +33,19 @@ $ sudo apt-get update && apt-get install curl jq
|
|||
1. Use a client bundle to authenticate your requests.
|
||||
[Learn more](../../user/access-ucp/cli-based-access.md).
|
||||
|
||||
2. Use the REST API to access the cluster configurations.
|
||||
2. Use the REST API to access the cluster configurations. The $DOCKER_HOST
|
||||
and $DOCKER_CERT_PATH environment variables are set when using the client
|
||||
bundle.
|
||||
|
||||
```bash
|
||||
# $DOCKER_HOST and $DOCKER_CERT_PATH are set when using the client bundle
|
||||
$ export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
|
||||
```bash
|
||||
$ export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
|
||||
|
||||
$ curl -s \
|
||||
--cert ${DOCKER_CERT_PATH}/cert.pem \
|
||||
--key ${DOCKER_CERT_PATH}/key.pem \
|
||||
--cacert ${DOCKER_CERT_PATH}/ca.pem \
|
||||
${KV_URL}/v2/keys | jq "."
|
||||
```
|
||||
$ curl -s \
|
||||
--cert ${DOCKER_CERT_PATH}/cert.pem \
|
||||
--key ${DOCKER_CERT_PATH}/key.pem \
|
||||
--cacert ${DOCKER_CERT_PATH}/ca.pem \
|
||||
${KV_URL}/v2/keys | jq "."
|
||||
```
|
||||
|
||||
To learn more about the key-value store REST API check the
|
||||
[etcd official documentation](https://coreos.com/etcd/docs/latest/).
|
||||
|
@ -69,15 +71,16 @@ member ca3c1bb18f1b30bf is healthy: got healthy result from https://192.168.122.
|
|||
cluster is healthy
|
||||
```
|
||||
|
||||
On failure the command exits with an error code, and no output.
|
||||
On failure, the command exits with an error code and no output.
|
||||
|
||||
To learn more about the `etcdctl` utility, check the
|
||||
[etcd official documentation](https://coreos.com/etcd/docs/latest/).
|
||||
|
||||
## RethinkDB Database
|
||||
|
||||
User and organization data for Docker Datacenter is stored in a RethinkDB
|
||||
database which is replicated across all manager nodes in the UCP cluster.
|
||||
User and organization data for Docker Enterprise Edition is stored in a
|
||||
RethinkDB database which is replicated across all manager nodes in the UCP
|
||||
swarm.
|
||||
|
||||
Replication and failover of this database is typically handled automatically by
|
||||
UCP's own configuration management processes, but detailed database status and
|
||||
|
@ -98,6 +101,23 @@ VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
|
|||
# in the RethinkDB cluster.
|
||||
docker run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 db-status
|
||||
{% endraw %}
|
||||
|
||||
Server Status: [
|
||||
{
|
||||
"ID": "ffa9cd5a-3370-4ccd-a21f-d7437c90e900",
|
||||
"Name": "ucp_auth_store_192_168_1_25",
|
||||
"Network": {
|
||||
"CanonicalAddresses": [
|
||||
{
|
||||
"Host": "192.168.1.25",
|
||||
"Port": 12384
|
||||
}
|
||||
],
|
||||
"TimeConnected": "2017-07-14T17:21:44.198Z"
|
||||
}
|
||||
}
|
||||
]
|
||||
...
|
||||
```
|
||||
|
||||
### Manually reconfigure database replication
|
||||
|
@ -114,6 +134,13 @@ VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
|
|||
# number of replicas equal to the number of manager nodes in the cluster.
|
||||
docker run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 --debug reconfigure-db --num-replicas ${NUM_MANAGERS} --emergency-repair
|
||||
{% endraw %}
|
||||
|
||||
time="2017-07-14T20:46:09Z" level=debug msg="Connecting to db ..."
|
||||
time="2017-07-14T20:46:09Z" level=debug msg="connecting to DB Addrs: [192.168.1.25:12383]"
|
||||
time="2017-07-14T20:46:09Z" level=debug msg="Reconfiguring number of replicas to 1"
|
||||
time="2017-07-14T20:46:09Z" level=debug msg="(00/16) Emergency Repairing Tables..."
|
||||
time="2017-07-14T20:46:09Z" level=debug msg="(01/16) Emergency Repaired Table \"grant_objects\""
|
||||
...
|
||||
```
|
||||
|
||||
## Where to go next
|
||||
|
|
|
@ -1,13 +1,13 @@
|
|||
---
|
||||
title: Troubleshoot UCP Node States
|
||||
title: Troubleshoot UCP node states
|
||||
description: Learn how to troubleshoot individual UCP nodes.
|
||||
keywords: Docker, UCP, troubleshoot, health, swarm
|
||||
keywords: UCP, troubleshoot, health, swarm
|
||||
---
|
||||
|
||||
There are several cases in the lifecycle of UCP when a node is actively
|
||||
transitioning from one state to another, such as when a new node is joining the
|
||||
cluster or during node promotion and demotion. In these cases, the current step
|
||||
of the transition will be reported by UCP as a node message. You can view the
|
||||
swarm or during node promotion and demotion. In these cases, the current step
|
||||
of the transition will be reported by UCP as a node message. You can view the
|
||||
state of each individual node by following the same steps required to [monitor
|
||||
cluster status](index.md).
|
||||
|
||||
|
@ -19,11 +19,11 @@ UCP node, their explanation, and the expected duration of a given step.
|
|||
|
||||
| Message | Description | Typical step duration |
|
||||
|:-----------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------|
|
||||
| Completing node registration | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP cluster. | 5 - 30 seconds |
|
||||
| The ucp-agent task is <state> | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP cluster. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node. | 1-10 seconds |
|
||||
| Completing node registration | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm. | 5 - 30 seconds |
|
||||
| The ucp-agent task is <state> | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP swarm. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node. | 1-10 seconds |
|
||||
| Unable to determine node state | The `ucp-reconcile` container on the target node just started running and we are not able to determine its state. | 1-10 seconds |
|
||||
| Node is being reconfigured | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state. | 1 - 60 seconds |
|
||||
| Reconfiguration pending | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet. | 1 - 10 seconds |
|
||||
| Unhealthy UCP Controller: node is unreachable | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Please ensure the underlying networking infrastructure is operational and contact support if the symptom persists. | Until resolved |
|
||||
| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational and contact support if the symptom persists. | Until resolved |
|
||||
| Unhealthy UCP Controller: node is unreachable | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists. | Until resolved |
|
||||
| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists. | Until resolved |
|
||||
| Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved |
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: Troubleshoot your swarm
|
||||
description: Learn how to troubleshoot your Docker Universal Control Plane cluster.
|
||||
keywords: docker, ucp, troubleshoot
|
||||
title: Troubleshoot your cluster
|
||||
keywords: ucp, troubleshoot, health, swarm
|
||||
---
|
||||
|
||||
If you detect problems in your UCP cluster, you can start your troubleshooting
|
||||
|
@ -12,14 +12,13 @@ see information about UCP system containers.
|
|||
## Check the logs from the UI
|
||||
|
||||
To see the logs of the UCP system containers, navigate to the **Containers**
|
||||
page of UCP. By default the UCP system containers are hidden. Click the
|
||||
**Show all containers** option for the UCP system containers to be listed as
|
||||
well.
|
||||
page of UCP. By default, the UCP system containers are hidden. Click
|
||||
**Settings** and check **Show system containers** for the UCP system containers
|
||||
to be listed as well.
|
||||
|
||||
{: .with-border}
|
||||
|
||||
You can click on a container to see more details like its configurations and
|
||||
logs.
|
||||
Click on a container to see more details, like its configurations and logs.
|
||||
|
||||
|
||||
## Check the logs from the CLI
|
||||
|
@ -29,28 +28,35 @@ specially useful if the UCP web application is not working.
|
|||
|
||||
1. Get a client certificate bundle.
|
||||
|
||||
When using the Docker CLI client you need to authenticate using client
|
||||
When using the Docker CLI client, you need to authenticate using client
|
||||
certificates.
|
||||
[Learn how to use client certificates](../../user/access-ucp/cli-based-access.md).
|
||||
|
||||
If your client certificate bundle is for a non-admin user, you won't have
|
||||
permissions to see the UCP system containers.
|
||||
|
||||
2. Check the logs of UCP system containers.
|
||||
2. Check the logs of UCP system containers. By default, system containers
|
||||
aren't displayed. Use the `-a` flag to display them.
|
||||
|
||||
```bash
|
||||
# By default system containers are not displayed. Use the -a flag to display them
|
||||
$ docker ps -a
|
||||
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
|
||||
8b77cfa87889 dockerorcadev/ucp-agent:2.2.0-latest "/bin/ucp-agent re..." 3 hours ago Exited (0) 3 hours ago ucp-reconcile
|
||||
b844cf76a7a5 dockerorcadev/ucp-agent:2.2.0-latest "/bin/ucp-agent agent" 3 hours ago Up 3 hours 2376/tcp ucp-agent.tahzo3m4xjwhtsn6l3n8oc2bf.xx2hf6dg4zrphgvy2eohtpns9
|
||||
de5b45871acb dockerorcadev/ucp-controller:2.2.0-latest "/bin/controller s..." 3 hours ago Up 3 hours (unhealthy) 0.0.0.0:443->8080/tcp ucp-controller
|
||||
...
|
||||
```
|
||||
|
||||
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
|
||||
922503c2102a docker/ucp-controller:1.1.0-rc2 "/bin/controller serv" 4 hours ago Up 30 minutes 192.168.10.100:444->8080/tcp ucp/ucp-controller
|
||||
1b6d429f1bd5 docker/ucp-swarm:1.1.0-rc2 "/swarm join --discov" 4 hours ago Up 4 hours 2375/tcp ucp/ucp-swarm-join
|
||||
3. Get the log from a UCP container by using the `docker logs <ucp container ID>`
|
||||
command. For example, the following command emits the log for the
|
||||
`ucp-controller` container listed above.
|
||||
|
||||
# See the logs of the ucp/ucp-controller container
|
||||
$ docker logs ucp/ucp-controller
|
||||
```bash
|
||||
$ docker logs de5b45871acb
|
||||
|
||||
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json","remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
|
||||
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs","remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
|
||||
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json",
|
||||
"remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
|
||||
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs",
|
||||
"remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
|
||||
```
|
||||
|
||||
## Get a support dump
|
||||
|
@ -64,30 +70,31 @@ the status of the UCP cluster. Changing the UCP log level restarts all UCP
|
|||
system components and introduces a small downtime window to UCP. Your
|
||||
applications won't be affected by this.
|
||||
|
||||
To increase the UCP log level, navigate to the **UCP web UI**, go to the
|
||||
To increase the UCP log level, navigate to the UCP web UI, go to the
|
||||
**Admin Settings** tab, and choose **Logs**.
|
||||
|
||||
{: .with-border}
|
||||
|
||||
Once you change the log level to **Debug** the UCP containers are restarted.
|
||||
Now that the UCP components are creating more descriptive logs, you can download
|
||||
again a support dump and use it to troubleshoot the component causing the
|
||||
Once you change the log level to **Debug** the UCP containers restart.
|
||||
Now that the UCP components are creating more descriptive logs, you can
|
||||
download a support dump and use it to troubleshoot the component causing the
|
||||
problem.
|
||||
|
||||
Depending on the problem you are experiencing, it's more likely that you'll
|
||||
Depending on the problem you're experiencing, it's more likely that you'll
|
||||
find related messages in the logs of specific components on manager nodes:
|
||||
|
||||
* If the problem occurs after a node was added or removed, check the logs
|
||||
of the `ucp-reconcile` container.
|
||||
of the `ucp-reconcile` container.
|
||||
* If the problem occurs in the normal state of the system, check the logs
|
||||
of the `ucp-controller` container.
|
||||
of the `ucp-controller` container.
|
||||
* If you are able to visit the UCP web UI but unable to log in, check the
|
||||
logs of the `ucp-auth-api` and `ucp-auth-store` containers.
|
||||
logs of the `ucp-auth-api` and `ucp-auth-store` containers.
|
||||
|
||||
It's normal for the `ucp-reconcile` container to be in a stopped state. This
|
||||
container is only started when the `ucp-agent` detects that a node needs to
|
||||
transition to a different state, and it is responsible for creating and removing
|
||||
containers, issuing certificates, and pulling missing images.
|
||||
container starts only when the `ucp-agent` detects that a node needs to
|
||||
transition to a different state. The `ucp-reconcile` container is responsible
|
||||
for creating and removing containers, issuing certificates, and pulling
|
||||
missing images.
|
||||
|
||||
|
||||
## Where to go next
|
||||
|
|
Before Width: | Height: | Size: 242 KiB After Width: | Height: | Size: 72 KiB |
Before Width: | Height: | Size: 257 KiB After Width: | Height: | Size: 80 KiB |
Before Width: | Height: | Size: 113 KiB After Width: | Height: | Size: 120 KiB |
Before Width: | Height: | Size: 223 KiB After Width: | Height: | Size: 61 KiB |