website/content/docs/3.0/tasks/monitor/key-metrics.md

4.6 KiB

title description menu
Key Metrics Learn some key metrics displayed on the Grafana Overview dashboard.
3.0
parent weight
Monitor 2

If your TiKV cluster is deployed using Ansible or Docker Compose, the monitoring system is deployed at the same time. For more details, see Overview of the TiKV Monitoring Framework.

The Grafana dashboard is divided into a series of sub-dashboards which include Overview, PD, TiKV, and so on. You can use various metrics to help you diagnose the cluster.

For routine operations, you can get an overview of the component (PD, TiKV) status and the entire cluster from the Overview dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.

Key metrics description

To understand the key metrics displayed on the Overview dashboard, check the following table:

Service Panel Name Description Normal Range
Services Port Status Services Online the online nodes number of each service
Services Port Status Services Offline the offline nodes number of each service
PD Storage Capacity the total storage capacity of the TiKV cluster
PD Current Storage Size the occupied storage capacity of the TiKV cluster
PD Number of Regions the total number of Regions of the current cluster
PD Leader Balance Ratio the leader ratio difference of the nodes with the biggest leader ratio and the smallest leader ratio It is less than 5% for a balanced situation and becomes bigger when you restart a node.
PD Region Balance Ratio the region ratio difference of the nodes with the biggest Region ratio and the smallest Region ratio It is less than 5% for a balanced situation and becomes bigger when you add or remove a node.
PD Store Status -- Up Stores the number of TiKV nodes that are up
PD Store Status -- Disconnect Stores the number of TiKV nodes that encounter abnormal communication within a short time
PD Store Status -- LowSpace Stores the number of TiKV nodes with an available space of less than 80%
PD Store Status -- Down Stores the number of TiKV nodes that are down The normal value is 0. If the number is bigger than 0, it means some node(s) are abnormal.
PD Store Status -- Offline Stores the number of TiKV nodes (still providing service) that are being made offline
PD Store Status -- Tombstone Stores the number of TiKV nodes that are successfully offline
PD 99% completed_cmds_duration_seconds the 99th percentile duration to complete a pd-server request less than 5ms
PD handle_requests_duration_seconds the request duration of a PD request
TiKV leader the number of leaders on each TiKV node
TiKV region the number of Regions on each TiKV node
TiKV CPU the CPU usage ratio on each TiKV node
TiKV Memory the memory usage on each TiKV node
TiKV store size the data amount on each TiKV node
TiKV cf size the data amount on different CFs in the cluster
TiKV channel full No data points is displayed in normal conditions. If a monitoring value displays, it means the corresponding TiKV node fails to handle the messages
TiKV server report failures No data points is displayed in normal conditions. If Unreachable is displayed, it means TiKV encounters a communication issue.
TiKV scheduler pending commands the number of commits on queue Occasional value peaks are normal.
TiKV coprocessor pending requests the number of requests on queue 0 or very small
TiKV coprocessor executor count the number of various query operations
TiKV coprocessor request duration the time consumed by TiKV queries
TiKV raft store CPU the CPU usage ratio of the raftstore thread Currently, it is a single thread. A value of over 80% indicates that the CPU usage ratio is very high.
TiKV Coprocessor CPU the CPU usage ratio of the TiKV query thread, related to the application; complex queries consume a great deal of CPU
System Info Vcores the number of CPU cores
System Info Memory the total memory
System Info CPU Usage the CPU usage ratio, 100% at a maximum
System Info Load [1m] the overload within 1 minute
System Info Memory Available the size of the available memory
System Info Network Traffic the statistics of the network traffic
System Info TCP Retrans the statistics about network monitoring and TCP
System Info IO Util the disk usage ratio, 100% at a maximum; generally you need to consider adding a new node when the usage ratio is up to 80% ~ 90%