ucp: add "heartbeat failure" node state

Also: - Align table columns - Alpha sort table by symptom
2019-01-28 18:09:21 -05:00 · 2019-01-28 18:09:21 -05:00 · 2379901ef2
parent 7a6c818ff4
commit 2379901ef2
3 changed files with 33 additions and 30 deletions
--- a/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/troubleshoot-node-messages.md
+++ b/datacenter/ucp/2.2/guides/admin/monitor-and-troubleshoot/troubleshoot-node-messages.md
@ -17,13 +17,14 @@ cluster status](index.md).
 The following table lists all possible node states that may be reported for a
 UCP node, their explanation, and the expected duration of a given step.

-| Message                                              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                              | Typical step duration |
-|:-----------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------|
-| Completing node registration                         | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm.                                                                                                                                                                                                                                                                                                                                  | 5 - 30 seconds        |
-| The ucp-agent task is `state`                        | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP swarm. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node.                                                                                                                        | 1-10 seconds          |
-| Unable to determine node state                       | The `ucp-reconcile` container on the target node just started running and can't determine its state.                                                                                                                                                                                                                                                                                                                                        | 1-10 seconds          |
-| Node is being reconfigured                           | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state.                                                                                                                                                                                                            | 1 - 60 seconds        |
-| Reconfiguration pending                              | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet.                                                                                                                                                                                                                                                                                                                                                  | 1 - 10 seconds        |
-| Unhealthy UCP Controller: node is unreachable        | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                                                            | Until resolved        |
-| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists. | Until resolved        |
-| Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved        |
+| Message                                                                              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Typical step duration   |
+| :-----------------------------------------------------                               | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                           | :---------------------- |
+| Completing node registration                                                         | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm.                                                                                                                                                                                                                                                                                                                                                                                                                | 5 - 30 seconds          |
+| heartbeat failure                                                                    | The node has not contacted any swarm managers in the last 10 seconds.  Check Swarm state in `docker info` on the node. `inactive` means the node has been removed from the swarm with `docker swarm leave`.  `pending` means dockerd on the node has been attempting to contact a manager since dockerd on the node started.  Confirm network security policy allows tcp port 2377 from the node to managers.  `error` means an error prevented swarm from starting on the node. Check docker daemon logs on the node.               | Until resolved          |
+| Node is being reconfigured                                                           | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state.                                                                                                                                                                                                                                                                                       | 1 - 60 seconds          |
+| Reconfiguration pending                                                              | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet.                                                                                                                                                                                                                                                                                                                                                                                                                              | 1 - 10 seconds          |
+| The ucp-agent task is `state`                                                        | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP cluster. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node.                                                                                                                                                                                                    | 1-10 seconds            |
+| Unable to determine node state                                                       | The `ucp-reconcile` container on the target node just started running and we are not able to determine its state.                                                                                                                                                                                                                                                                                                                                                                                                                    | 1-10 seconds            |
+| Unhealthy UCP Controller: node is unreachable                                        | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                                                                                                                      | Until resolved          |
+| Unhealthy UCP Controller: unable to reach controller                                 | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                            | Until resolved          |
+| Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved          |
--- a/datacenter/ucp/3.0/guides/admin/monitor-and-troubleshoot/troubleshoot-node-messages.md
+++ b/datacenter/ucp/3.0/guides/admin/monitor-and-troubleshoot/troubleshoot-node-messages.md
@ -17,13 +17,14 @@ cluster status](index.md).
 The following table lists all possible node states that may be reported for a
 UCP node, their explanation, and the expected duration of a given step.

-| Message                                              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                              | Typical step duration |
-|:-----------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------|
-| Completing node registration                         | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm.                                                                                                                                                                                                                                                                                                                                  | 5 - 30 seconds        |
-| The ucp-agent task is `state`                        | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP swarm. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node.                                                                                                                        | 1-10 seconds          |
-| Unable to determine node state                       | The `ucp-reconcile` container on the target node just started running and can't determine its state.                                                                                                                                                                                                                                                                                                                                        | 1-10 seconds          |
-| Node is being reconfigured                           | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state.                                                                                                                                                                                                            | 1 - 60 seconds        |
-| Reconfiguration pending                              | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet.                                                                                                                                                                                                                                                                                                                                                  | 1 - 10 seconds        |
-| Unhealthy UCP Controller: node is unreachable        | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                                                            | Until resolved        |
-| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists. | Until resolved        |
-| Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved        |
+| Message                                                                              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Typical step duration   |
+| :-----------------------------------------------------                               | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                           | :---------------------- |
+| Completing node registration                                                         | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm.                                                                                                                                                                                                                                                                                                                                                                                                                | 5 - 30 seconds          |
+| heartbeat failure                                                                    | The node has not contacted any swarm managers in the last 10 seconds.  Check Swarm state in `docker info` on the node. `inactive` means the node has been removed from the swarm with `docker swarm leave`.  `pending` means dockerd on the node has been attempting to contact a manager since dockerd on the node started.  Confirm network security policy allows tcp port 2377 from the node to managers.  `error` means an error prevented swarm from starting on the node. Check docker daemon logs on the node.               | Until resolved          |
+| Node is being reconfigured                                                           | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state.                                                                                                                                                                                                                                                                                       | 1 - 60 seconds          |
+| Reconfiguration pending                                                              | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet.                                                                                                                                                                                                                                                                                                                                                                                                                              | 1 - 10 seconds          |
+| The ucp-agent task is `state`                                                        | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP cluster. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node.                                                                                                                                                                                                    | 1-10 seconds            |
+| Unable to determine node state                                                       | The `ucp-reconcile` container on the target node just started running and we are not able to determine its state.                                                                                                                                                                                                                                                                                                                                                                                                                    | 1-10 seconds            |
+| Unhealthy UCP Controller: node is unreachable                                        | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                                                                                                                      | Until resolved          |
+| Unhealthy UCP Controller: unable to reach controller                                 | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                            | Until resolved          |
+| Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved          |
--- a/ee/ucp/admin/monitor-and-troubleshoot/troubleshoot-node-messages.md
+++ b/ee/ucp/admin/monitor-and-troubleshoot/troubleshoot-node-messages.md
@ -17,14 +17,15 @@ cluster status](index.md).
 The following table lists all possible node states that may be reported for a
 UCP node, their explanation, and the expected duration of a given step.

-| Message                                              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                              | Typical step duration |
-|:-----------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------|
-| Completing node registration                         | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm.                                                                                                                                                                                                                                                                                                                                  | 5 - 30 seconds        |
-| The ucp-agent task is `state`                        | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP cluster. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node.                                                                                                                        | 1-10 seconds          |
-| Unable to determine node state                       | The `ucp-reconcile` container on the target node just started running and we are not able to determine its state.                                                                                                                                                                                                                                                                                                                                        | 1-10 seconds          |
-| Node is being reconfigured                           | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state.                                                                                                                                                                                                            | 1 - 60 seconds        |
-| Reconfiguration pending                              | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet.                                                                                                                                                                                                                                                                                                                                                  | 1 - 10 seconds        |
-| Unhealthy UCP Controller: node is unreachable        | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                                                            | Until resolved        |
-| Unhealthy UCP Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists. | Until resolved        |
-| Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved        |
+| Message                                                                              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Typical step duration   |
+| :-----------------------------------------------------                               | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                           | :---------------------- |
+| Completing node registration                                                         | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the UCP swarm.                                                                                                                                                                                                                                                                                                                                                                                                                | 5 - 30 seconds          |
+| heartbeat failure                                                                    | The node has not contacted any swarm managers in the last 10 seconds.  Check Swarm state in `docker info` on the node. `inactive` means the node has been removed from the swarm with `docker swarm leave`.  `pending` means dockerd on the node has been attempting to contact a manager since dockerd on the node started.  Confirm network security policy allows tcp port 2377 from the node to managers.  `error` means an error prevented swarm from starting on the node. Check docker daemon logs on the node.               | Until resolved          |
+| Node is being reconfigured                                                           | The `ucp-reconcile` container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state.                                                                                                                                                                                                                                                                                       | 1 - 60 seconds          |
+| Reconfiguration pending                                                              | The target node is expected to be a manager but the `ucp-reconcile` container has not been started yet.                                                                                                                                                                                                                                                                                                                                                                                                                              | 1 - 10 seconds          |
+| The ucp-agent task is `state`                                                        | The `ucp-agent` task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the UCP cluster. This step may take a longer time duration than expected if the UCP images need to be pulled from Docker Hub on the affected node.                                                                                                                                                                                                    | 1-10 seconds            |
+| Unable to determine node state                                                       | The `ucp-reconcile` container on the target node just started running and we are not able to determine its state.                                                                                                                                                                                                                                                                                                                                                                                                                    | 1-10 seconds            |
+| Unhealthy UCP Controller: node is unreachable                                        | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there's either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                                                                                                                      | Until resolved          |
+| Unhealthy UCP Controller: unable to reach controller                                 | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of UCP itself. Please ensure the underlying networking infrastructure is operational, and [contact support](../../get-support.md) if the symptom persists.                                            | Until resolved          |
+| Unhealthy UCP Controller: Docker Swarm Cluster: Local node `<ip>` has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it's added to the node inventory and discovered as `Pending` by Docker Swarm. The engine is "validated" if a `ucp-swarm-manager` container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don't have duplicate IDs. Use `docker info` to see the Engine ID. Refresh the ID by removing the `/etc/docker/key.json` file and restarting the daemon. | Until resolved          |