website/content/docs/dev/deploy/monitor/key-metrics.md

13 KiB

title description menu
Key Metrics Learn some key metrics displayed on the Grafana Overview dashboard.
dev
parent weight identifier
Monitor and Alert-dev 5 Key Metrics-dev

If your TiKV cluster is deployed using TiUP, the monitoring system is deployed at the same time. For more details, see Overview of the TiKV Monitoring Framework.

The Grafana dashboard is divided into a series of sub-dashboards which include Overview, PD, TiKV, and so on. You can use various metrics to diagnose the cluster.

At the same time, you can also deploy your Grafana server to monitor the TiKV cluster, especially when you use TiKV without TiDB. This document provides a detailed description of key metrics so that you can monitor the Prometheus metrics you are interested in.

Key metrics description

To understand the key metrics, check the following table:

Service Metric Name Description Normal Range
Cluster tikv_store_size_bytes The size of storage. The metric has a type label (such as "capacity", "available").
gRPC tikv_grpc_msg_duration_seconds Bucketed histogram of gRPC server messages. The metric has a type label which represents the type of the server message. You can count the metric and calculate the QPS.
gRPC tikv_grpc_msg_fail_total The total number of gRPC message handling failure. The metric has a type label which represents gRPC message type.
gRPC grpc batch size of gRPC requests grpc batch size of gRPC requests.
Scheduler tikv_scheduler_too_busy_total The total count of too busy schedulers. The metric has a type label which represents the scheduler type.
Scheduler tikv_scheduler_contex_total The total number of pending commands. The scheduler receives commands from clients, executes them against the MVCC layer storage engine.
Scheduler tikv_scheduler_stage_total Total number of commands on each stage. The metric has two labels: type and stage. stage represents the stage of executed commands like "read_finish", "async_snapshot_err", "snapshot", and so on.
Scheduler tikv_scheduler_commands_pri_total Total count of different priority commands. The metric has a priority label.
Server tikv_server_grpc_resp_batch_size grpc batch size of gRPC responses.
Server tikv_server_report_failure_msg_total Total number of reporting failure messages. The metric has two labels: type and store_id. type represents the failure type, and store_id represents the destination peer store ID.
Server tikv_server_raft_message_flush_total Total number of raft messages flushed immediately.
Server tikv_server_raft_message_recv_total Total number of raft messages received.
Server tikv_region_written_keys Histogram of written keys for regions.
Server tikv_server_send_snapshot_duration_seconds Bucketed histogram of duration in which the server sends snapshots.
Server tikv_region_written_bytes Histogram of bytes written for regions.
Raft tikv_raftstore_leader_missing Total number of leader missed regions.
Raft tikv_raftstore_region_count The number of regions collected in each TiKV node. The label type has region and leader. region represents regions collected, and leader represents the number of leaders in each TiKV node.
Raft tikv_raftstore_region_size Bucketed histogram of approximate region size.
Raft tikv_raftstore_apply_log_duration_seconds Bucketed histogram of the duration in which each peer applies log.
Raft tikv_raftstore_commit_log_duration_seconds Bucketed histogram of the duration in which each peer commits logs.
Raft tikv_raftstore_raft_ready_handled_total Total number of Raft ready handled. The metric has a label type.
Raft tikv_raftstore_raft_process_duration_secs Bucketed histogram of duration in which each peer processes Raft. The metric has a label type.
Raft tikv_raftstore_event_duration Duration of raft store events. The metric has a label type.
Raft tikv_raftstore_raft_sent_message_total Total number of messages sent by Raft ready. The metric has a label type.
Raft tikv_raftstore_raft_dropped_message_total Total number of messages dropped by Raft. The metric has a label type.
Raft tikv_raftstore_apply_proposal The count of proposals sent by a region at once.
Raft tikv_raftstore_proposal_total Total number of proposals made. The metric has a label type.
Raft tikv_raftstore_request_wait_time_duration_secs Bucketed histogram of request wait time duration.
Raft tikv_raftstore_propose_log_size Bucketed histogram of the size of each peer proposing log.
Raft tikv_raftstore_apply_wait_time_duration_secs Bucketed histogram of apply task wait time duration.
Raft tikv_raftstore_admin_cmd_total Total number of admin command processed. The metric has 2 labels type and status.
Raft tikv_raftstore_check_split_total Total number of raftstore split check. The metric has a label type.
Raft tikv_raftstore_check_split_duration_seconds Bucketed histogram of duration for the raftstore split check.
Raft tikv_raftstore_local_read_reject_total Total number of rejections from the local reader. The metric has a label reason which represents the rejection reason.
Raft tikv_raftstore_snapshot_duration_seconds Bucketed histogram of raftstore snapshot process duration. The metric has a label type.
Raft tikv_raftstore_snapshot_traffic_total The total amount of raftstore snapshot traffic. The metric has a label type.
Raft tikv_raftstore_local_read_executed_requests Total number of requests directly executed by local reader.
Coprocessor tikv_coprocessor_request_duration_seconds Bucketed histogram of coprocessor request duration. The metric has a label req.
Coprocessor tikv_coprocessor_request_error Total number of push down request error. The metric has a label reason.
Coprocessor tikv_coprocessor_scan_keys Bucketed histogram of scan keys observed per request. The metric has a label req which represents the tag of requests.
Coprocessor tikv_coprocessor_rocksdb_perf Total number of RocksDB internal operations from PerfContext. The metric has 2 labels req and metric. req represents the tag of requests and metric is performance metric like "block_cache_hit_count", "block_read_count", "encrypt_data_nanos", and so on.
Coprocessor tikv_coprocessor_executor_count The number of various query operations. The metric has a single label type which represents the related query operation (for example, "limit", "top_n", and "batch_table_scan").
Coprocessor tikv_coprocessor_response_bytes Total bytes of response body.
Storage tikv_storage_mvcc_versions Histogram of versions for each key.
Storage tikv_storage_mvcc_gc_delete_versions Histogram of versions deleted by GC for each key.
Storage tikv_storage_mvcc_conflict_counter Total number of conflict error. The metric has a label type.
Storage tikv_storage_mvcc_duplicate_cmd_counter Total number of duplicated commands. The metric has a label type.
Storage tikv_storage_mvcc_check_txn_status Counter of different results of check_txn_status. The metric has a label type.
Storage tikv_storage_command_total Total number of commands received. The metric has a label type.
Storage tikv_storage_engine_async_request_duration_seconds Bucketed histogram of processing successful asynchronous requests. The metric has a label type.
Storage tikv_storage_engine_async_request_total Total number of engine asynchronous requests. The metric has 2 labels type and status.
GC tikv_gcworker_gc_task_fail_vec Counter of failed GC tasks. The metric has a label task.
GC tikv_gcworker_gc_task_duration_vec Duration of GC tasks execution. The metric has a label task.
GC tikv_gcworker_gc_keys Counter of keys affected during GC. The metric has two labels cf and tag.
GC tikv_gcworker_autogc_processed_regions Processed regions by auto GC. The metric has a label type.
GC tikv_gcworker_autogc_safe_point Safe point used for auto GC. The metric has a label type.
Snapshot tikv_snapshot_size Size of snapshot.
Snapshot tikv_snapshot_kv_count Total number of KVs in the snapshot
Snapshot tikv_worker_handled_task_total Total number of tasks handled by the worker. The metric has a label name.
Snapshot tikv_worker_pending_task_total The number of tasks currently running by the worker or pending. The metric has a label name.
Snapshot tikv_futurepool_handled_task_total The total number of tasks handled by future_pool. The metric has a label name.
Snapshot tikv_snapshot_ingest_sst_duration_seconds Bucketed histogram of RocksDB ingestion durations
Snapshot tikv_futurepool_pending_task_total Current future_pool pending + running tasks. The metric has a label name.
RocksDB tikv_engine_get_served queries served by engine. The metric has 2 labels db and type.
RocksDB tikv_engine_write_stall Histogram of write stall. The metric has 2 labels db and type.
RocksDB tikv_engine_size_bytes Sizes of each column families. The metric has two labels: db and type. db represents which database is being counted (for example, "kv", "raft"), and type represents the type of column families (for example, "default", "lock", "raft", "write").
RocksDB tikv_engine_flow_bytes Bytes and keys of read/write. The metric has type label (for example, "capacity", "available").
RocksDB tikv_engine_wal_file_synced The number of times WAL sync is done. The metric has a label db.
RocksDB tikv_engine_get_micro_seconds Histogram of time used to get micros. The metric has two labels: db and type.
RocksDB tikv_engine_locate The number of calls to seek/next/prev. The metric has 2 labels db and type.
RocksDB tikv_engine_seek_micro_seconds Histogram of seek micros. The metric has 2 labels db and type.
RocksDB tikv_engine_write_served Write queries served by engine. The metric has 2 labels db and type.
RocksDB tikv_engine_write_micro_seconds Histogram of write micros. The metric has 2 labels db and type.
RocksDB tikv_engine_write_wal_time_micro_seconds Histogram of duration for write WAL micros. The metric has 2 labels db and type.
RocksDB tikv_engine_event_total Number of engine events. The metric has 3 labels db, cf and type.
RocksDB tikv_engine_wal_file_sync_micro_seconds Histogram of WAL file sync micros. The metric has 2 labels db and type.
RocksDB tikv_engine_sst_read_micros Histogram of SST read micros. The metric has 2 labels db and type.
RocksDB tikv_engine_compaction_time Histogram of compaction time. The metric has 2 labels db and type.
RocksDB tikv_engine_block_cache_size_bytes Usage of each column families' block cache. The metric has 2 labels db and cf.
RocksDB tikv_engine_compaction_reason The number of compaction reasons. The metric has 3 labels db, cf and reason.
RocksDB tikv_engine_cache_efficiency Efficiency of RocksDB's block cache. The metric has 2 labels db and type.
RocksDB tikv_engine_memtable_efficiency Hit and miss of memtable. The metric has 2 labels db and type.
RocksDB tikv_engine_bloom_efficiency Efficiency of RocksDB's bloom filter. The metric has 2 labels db and type.
RocksDB tikv_engine_estimate_num_keys Estimate num keys of each column families. The metric has 2 labels db and cf.
RocksDB tikv_engine_compaction_flow_bytes Bytes of read/write during compaction
RocksDB tikv_engine_bytes_per_read Histogram of bytes per read. The metric has 2 labels db and type.
RocksDB tikv_engine_read_amp_flow_bytes Bytes of read amplification. The metric has 2 labels db and type.
RocksDB tikv_engine_bytes_per_write tikv_engine_bytes_per_write. The metric has 2 labels db and type.
RocksDB tikv_engine_num_snapshots Number of unreleased snapshots. The metric has a label db.
RocksDB tikv_engine_pending_compaction_bytes Pending compaction bytes. The metric has 2 labels db and cf.
RocksDB tikv_engine_num_files_at_level Number of files at each level. The metric has 3 labels db, cf and level.
RocksDB tikv_engine_compression_ratio Compression ratio at different levels. The metric has 3 labels db, cf and level.
RocksDB tikv_engine_oldest_snapshot_duration Oldest unreleased snapshot duration in seconds. The metric has a label db.
RocksDB tikv_engine_write_stall_reason QPS of each reason which causes TiKV write stall. The metric has 2 labels db and type.
RocksDB tikv_engine_memory_bytes Sizes of each column families. The metric has 3 labels db, cf and type.