Add multi-version support

Signed-off-by: Ana Hobden <operator@hoverbear.org>
This commit is contained in:
Ana Hobden 2020-01-07 16:45:18 -08:00
parent 28abc03942
commit d10180259c
115 changed files with 5332 additions and 1902 deletions

View File

@ -25,6 +25,9 @@ params:
googleAnalyticsId: "UA-130734531-1"
versions:
latest: "3.0"
all:
- "3.1-beta"
- "3.0"
description:
brief: "A distributed transactional key-value database"
long: "Based on the design of [Google Spanner](https://ai.google/research/pubs/pub39966) and [HBase](https://hbase.apache.org), but simpler to manage and without dependencies on any distributed filesystem"

View File

@ -2,7 +2,7 @@
title: Architecture
description: How TiKV works and how it was built
menu:
docs:
"3.0":
parent: Concepts
---

View File

@ -2,7 +2,7 @@
title: Features
description: The features of TiKV
menu:
docs:
"3.0":
parent: Concepts
---

View File

@ -2,11 +2,7 @@
title: Concepts
description: Some basic facts about TiKV
menu:
nav:
name: Concepts
parent: Docs
weight: 1
docs:
"3.0":
weight: 1
---

View File

@ -2,7 +2,7 @@
title: C Client
description: Interact with TiKV using C.
menu:
docs:
"3.0":
parent: Clients
weight: 4
---

View File

@ -2,7 +2,7 @@
title: Go Client
description: Interact with TiKV using Go.
menu:
docs:
"3.0":
parent: Clients
weight: 1
---

View File

@ -2,7 +2,7 @@
title: Clients
description: Interact with TiKV using the raw key-value API or the transactional key-value API
menu:
docs:
"3.0":
parent: Reference
weight: 2
---

View File

@ -2,7 +2,7 @@
title: Java Client
description: Interact with TiKV using Java.
menu:
docs:
"3.0":
parent: Clients
weight: 3
---

View File

@ -2,7 +2,7 @@
title: Rust Client
description: Interact with TiKV using Rust.
menu:
docs:
"3.0":
parent: Clients
weight: 2
---

View File

@ -2,10 +2,7 @@
title: Contribute
description: How to be a part of TiKV
menu:
nav:
parent: Docs
weight: 6
docs:
"3.0":
parent: Reference
weight: 6
aliases:

View File

@ -2,10 +2,7 @@
title: FAQ
description: FAQs about TiKV
menu:
nav:
parent: Docs
weight: 5
docs:
"3.0":
parent: Reference
weight: 4
aliases:

View File

@ -2,12 +2,9 @@
title: Reference
description: Details about TiKV
menu:
docs:
"3.0":
name: Reference
weight: 3
nav:
parent: Docs
weight: 3
---
This section includes instructions on using TiKV clients and tools.

View File

@ -2,7 +2,7 @@
title: Query Layers
description: Extend TiKV using stateless query layers
menu:
docs:
"3.0":
parent: Reference
weight: 3
---

View File

@ -2,7 +2,7 @@
title: Tools
description: Tools which can be used to administrate TiKV
menu:
docs:
"3.0":
parent: Reference
weight: 3
---

View File

@ -2,7 +2,7 @@
title: pd-ctl
description: Learn about interacting with pd-ctl
menu:
docs:
"3.0":
parent: Tools
weight: 4
---

View File

@ -2,7 +2,7 @@
title: pd-recover
description: Learn about interacting with pd-recover
menu:
docs:
"3.0":
parent: Tools
weight: 5
---

View File

@ -2,7 +2,7 @@
title: pd-server
description: Learn about interacting with pd-server
menu:
docs:
"3.0":
parent: Tools
weight: 3
---

View File

@ -2,7 +2,7 @@
title: tikv-ctl
description: Learn about interacting with tikv-ctl
menu:
docs:
"3.0":
parent: Tools
weight: 2
---

View File

@ -2,7 +2,7 @@
title: tikv-server
description: Learn about interacting with tikv-server
menu:
docs:
"3.0":
parent: Tools
weight: 1
---

View File

@ -3,6 +3,6 @@ title: Backup
description: Backup TiKV
draft: true
menu:
docs:
"3.0":
parent: Tasks
---

View File

@ -2,7 +2,7 @@
title: Configure
description: Configure a wide range of TiKV facets, including RocksDB, gRPC, the Placement Driver, and more
menu:
docs:
"3.0":
parent: Tasks
weight: 3
---

View File

@ -2,7 +2,7 @@
title: Limit Config
description: Learn how to configure scheduling rate limit on stores
menu:
docs:
"3.0":
parent: Configure
weight: 4
---

View File

@ -2,7 +2,7 @@
title: Namespace Config
description: Learn how to configure namespace in TiKV.
menu:
docs:
"3.0":
parent: Configure
weight: 3
---

View File

@ -2,7 +2,7 @@
title: Region Merge Config
description: Learn how to configure Region Merge in TiKV.
menu:
docs:
"3.0":
parent: Configure
weight: 5
---

View File

@ -2,7 +2,7 @@
title: RocksDB Config
description: Learn how to configure namespace in TiKV.
menu:
docs:
"3.0":
parent: Configure
weight: 6
---

View File

@ -2,7 +2,7 @@
title: Security Config
description: Keeping your TiKV deployment secure
menu:
docs:
"3.0":
parent: Configure
weight: 1
---

View File

@ -2,7 +2,7 @@
title: Titan Config
description: Learn how to enable Titan in TiKV.
menu:
docs:
"3.0":
parent: Configure
weight: 7
---

View File

@ -2,7 +2,7 @@
title: Topology Config
description: Learn how to configure labels.
menu:
docs:
"3.0":
parent: Configure
weight: 2
---

View File

@ -2,7 +2,7 @@
title: Ansible Deployment
description: Use TiDB-Ansible to deploy a TiKV cluster on multiple nodes.
menu:
docs:
"3.0":
parent: Deploy
weight: 2
---

View File

@ -2,7 +2,7 @@
title: Binary Deployment
description: Use binary files to deploy a TiKV cluster on a single machine or on multiple nodes for testing.
menu:
docs:
"3.0":
parent: Deploy
weight: 4
---

View File

@ -2,7 +2,7 @@
title: Docker Compose/Swarm
description: Use Docker Compose or Swarm to quickly deploy a TiKV testing cluster.
menu:
docs:
"3.0":
parent: Deploy
weight: 5
---

View File

@ -2,7 +2,7 @@
title: Docker Deployment
description: Use Docker to deploy a TiKV cluster on multiple nodes.
menu:
docs:
"3.0":
parent: Deploy
weight: 3
---

View File

@ -2,7 +2,7 @@
title: Deploy
description: Prerequisites for deploying TiKV
menu:
docs:
"3.0":
parent: Tasks
weight: 2
name: Deploy

View File

@ -2,10 +2,7 @@
title: Tasks
description: How to accomplish common tasks with TiKV
menu:
nav:
parent: Docs
weight: 2
docs:
"3.0":
weight: 2
---

View File

@ -2,7 +2,7 @@
title: Monitor
description: Monitor TiKV
menu:
docs:
"3.0":
parent: Tasks
weight: 4
---

View File

@ -2,7 +2,7 @@
title: Key Metrics
description: Learn some key metrics displayed on the Grafana Overview dashboard.
menu:
docs:
"3.0":
parent: Monitor
weight: 2
---

View File

@ -2,7 +2,7 @@
title: Monitoring a Cluster
description: Learn how to monitor the state of a TiKV cluster.
menu:
docs:
"3.0":
parent: Monitor
weight: 1
---

View File

@ -2,7 +2,7 @@
title: Ansible Scaling
description: Use TiDB-Ansible to scale out or scale in a TiKV cluster.
menu:
docs:
"3.0":
parent: Scale
---

View File

@ -2,7 +2,7 @@
title: Scale
description: Scale TiKV
menu:
docs:
"3.0":
parent: Tasks
weight: 5
---

View File

@ -2,7 +2,7 @@
title: Try
description: Try locally with Docker
menu:
docs:
"3.0":
parent: Tasks
weight: 1
---

View File

@ -0,0 +1,84 @@
---
title: Architecture
description: How TiKV works and how it was built
menu:
"3.1-beta":
parent: Concepts
---
This page discusses the core concepts and architecture behind TiKV, including:
* The [APIs](#apis) that applications can use to interact with TiKV
* The basic [system architecture](#system) underlying TiKV
* The anatomy of each [instance](#instance) in a TiKV installation
* The role of core system components, including the [Placement Driver](#placement-driver), [Store](#store), [Region](#region), and [Node](#node)
* TiKV's [transaction model](#transactions)
* The role of the [Raft consensus algorithm](#raft) in TiKV
* The [origins](#origins) of TiKV
## APIs
TiKV provides two APIs that you can use to interact with it:
API | Description | Atomicity | Use when...
:---|:------------|:----------|:-----------
Raw | A lower-level key-value API for interacting directly with individual key-value pairs. | Single key | Your application doesn't require distributed transactions or multi-version concurrency control (**MVCC**)
Transactional | A higher-level key-value API that provides ACID semantics | Multiple keys | Your application requires distributed transactions and/or MVCC
## System architecture {#system}
The overall architecture of TiKV is illustrated in **Figure 1** below:
{{< figure
src="/img/basic-architecture.png"
caption="The architecture of TiKV"
alt="TiKV architecture diagram"
width="70"
number="1" >}}
## TiKV instance {#instance}
The architecture of each TiKV instance is illustrated in **Figure 2** below:
{{< figure
src="/img/tikv-instance.png"
caption="TiKV instance architecture"
alt="TiKV instance architecture diagram"
width="60"
number="2" >}}
## Placement driver (PD) {#placement-driver}
The TiKV placement driver is the cluster manager of TiKV, which periodically checks replication constraints to balance load and data automatically across nodes and regions in a process called **auto-sharding**.
## Store
There is a [RocksDB](https://rocksdb.org) database within each Store and it stores data into the local disk.
## Region
Region is the basic unit of key-value data movement. Each Region is replicated to multiple Nodes. These multiple replicas form a Raft group.
## Node
A TiKV **Node** is just a physical node in the cluster, which could be a virtual machine, a container, etc. Within each Node, there are one or more **Store**s. The data in each Store is split across multiple regions. Data is distributed across Regions using the Raft algorithm.
When a Node starts, the metadata for the Node, Store, and Region is recorded into the Placement Driver. The status of each Region and Store is regularly reported to the PD.
## Transaction model {#transactions}
TiKV's transaction model is similar to that of Google's [Percolator](https://ai.google/research/pubs/pub36726), a system built for processing updates to large data sets. Percolator uses an incremental update model in place of a batch-based model.
TiKV's transaction model provides:
* **Snapshot isolation** with lock, with semantics analogous to `SELECT ... FOR UPDATE` in SQL
* Externally consistent reads and writes in distributed transactions
## Raft
Data is distributed across TiKV instances via the [Raft consensus algorithm](https://raft.github.io/), which is based on the so-called [Raft paper](https://raft.github.io/raft.pdf) ("In Search of an Understandable Consensus Algorithm") from [Diego Ongaro](https://ongardie.net/diego/) and [John Ousterhout](https://web.stanford.edu/~ouster/cgi-bin/home.php).
## The origins of TiKV {#origins}
TiKV was originally created by [PingCAP](https://pingcap.com) to complement [TiDB](https://github.com/pingcap/tidb), a distributed [HTAP](https://en.wikipedia.org/wiki/Hybrid_transactional/analytical_processing_(HTAP)) database compatible with the [MySQL protocol](https://dev.mysql.com/doc/dev/mysql-server/latest/PAGE_PROTOCOL.html).

View File

@ -0,0 +1,9 @@
---
title: Features
description: The features of TiKV
menu:
"3.1-beta":
parent: Concepts
---
{{< features >}}

View File

@ -0,0 +1,44 @@
---
title: Concepts
description: Some basic facts about TiKV
menu:
nav:
name: Concepts
parent: Docs
weight: 1
"3.1-beta":
weight: 1
---
**TiKV** is a distributed transactional key-value database originally created by [PingCAP](https://pingcap.com/en) to complement [TiDB](https://github.com/pingcap/tidb).
As an incubating project of the [Cloud Native Computing Foundation](https://www.cncf.io/), TiKV is intended to fill the role of a unifying distributed storage layer. TiKV excels at working with **data in the large** by supporting petabyte scale deployments spanning trillions of rows.
It compliments other CNCF projects technologies like [etcd](https://etcd.io/) which is useful for low-volume metadata storage, and can be extended using [stateless query layers](../../reference/query-layers) which speak other protocols, like [TiDB](https://github.com/pingcap/tidb) speaking MySQL.
{{< info >}}
The **Ti** in TiKV stands for **titanium**. Titanium has the highest strength-to-density ratio of any metallic element and is named after the Titans of Greek mythology.
{{< /info >}}
## Notable Features
{{< features featured >}}
You can browse a complete list on the [features](../features) page.
## Architecture
{{< figure
src="/img/basic-architecture.png"
caption="The architecture of TiKV"
alt="TiKV architecture diagram"
width="70"
number="1" >}}
You can read more in the [Concepts and architecture](../architecture/) documentation.
## Codebase, Inspiration, and Culture
TiKV is implemented in the [Rust](https://rust-lang.org) programming language. It uses technologies like [Facebook's RocksDB](https://rocksdb.org/) and [Raft](https://raft.github.io/).
The project was originally inspired by [Google Spanner](https://ai.google/research/pubs/pub39966) and [HBase](https://hbase.apache.org).

View File

@ -0,0 +1,14 @@
---
title: C Client
description: Interact with TiKV using C.
menu:
"3.1-beta":
parent: Clients
weight: 4
---
This document, like our C API, is still a work in progress. In the meantime, you can track development at [tikv/client-c](https://github.com/tikv/client-c/) repository. Most development happens on the `dev` branch.
{{< warning >}}
You should not use the C client for production use until it is released.
{{< /warning >}}

View File

@ -0,0 +1,361 @@
---
title: Go Client
description: Interact with TiKV using Go.
menu:
"3.1-beta":
parent: Clients
weight: 1
---
To apply to different scenarios, TiKV provides two types of APIs for developers: the Raw Key-Value API and the Transactional Key-Value API. This document uses two examples to guide you through how to use the two APIs in TiKV. The usage examples are based on multiple nodes for testing. You can also quickly try the two types of APIs on a single machine.
{{< warning >}}
It is **not recommended or supported** to use both the raw and transactional APIs on the same keyspace.
{{< /warning >}}
## Try the Raw Key-Value API
To use the Raw Key-Value API in applications developed in the Go language, take the following steps:
1. Install the necessary packages.
```bash
export GO111MODULE=on
go mod init rawkv-demo
go get github.com/pingcap/tidb@master
```
2. Import the dependency packages.
```go
import (
"fmt"
"github.com/pingcap/tidb/config"
"github.com/pingcap/tidb/store/tikv"
)
```
3. Create a Raw Key-Value client.
```go
cli, err := tikv.NewRawKVClient([]string{"192.168.199.113:2379"}, config.Security{})
```
Description of two parameters in the above command:
- `string`: a list of PD servers addresses
- `config.Security`: used to establish TLS connections, usually left empty when you do not need TLS
4. Call the Raw Key-Value client methods to access the data on TiKV. The Raw Key-Value API contains the following methods, and you can also find them at [GoDoc](https://godoc.org/github.com/pingcap/tidb/store/tikv#RawKVClient).
```go
type RawKVClient struct
func (c *RawKVClient) Close() error
func (c *RawKVClient) ClusterID() uint64
func (c *RawKVClient) Delete(key []byte) error
func (c *RawKVClient) Get(key []byte) ([]byte, error)
func (c *RawKVClient) Put(key, value []byte) error
func (c *RawKVClient) Scan(startKey, endKey []byte, limit int) (keys [][]byte, values [][]byte, err error)
```
### Usage example of the Raw Key-Value API
```go
package main
import (
"fmt"
"github.com/pingcap/tidb/config"
"github.com/pingcap/tidb/store/tikv"
)
func main() {
cli, err := tikv.NewRawKVClient([]string{"192.168.199.113:2379"}, config.Security{})
if err != nil {
panic(err)
}
defer cli.Close()
fmt.Printf("cluster ID: %d\n", cli.ClusterID())
key := []byte("Company")
val := []byte("PingCAP")
// put key into tikv
err = cli.Put(key, val)
if err != nil {
panic(err)
}
fmt.Printf("Successfully put %s:%s to tikv\n", key, val)
// get key from tikv
val, err = cli.Get(key)
if err != nil {
panic(err)
}
fmt.Printf("found val: %s for key: %s\n", val, key)
// delete key from tikv
err = cli.Delete(key)
if err != nil {
panic(err)
}
fmt.Printf("key: %s deleted\n", key)
// get key again from tikv
val, err = cli.Get(key)
if err != nil {
panic(err)
}
fmt.Printf("found val: %s for key: %s\n", val, key)
}
```
The result is like:
```bash
INFO[0000] [pd] create pd client with endpoints [192.168.199.113:2379]
INFO[0000] [pd] leader switches to: http://127.0.0.1:2379, previous:
INFO[0000] [pd] init cluster id 6554145799874853483
cluster ID: 6554145799874853483
Successfully put Company:PingCAP to tikv
found val: PingCAP for key: Company
key: Company deleted
found val: for key: Company
```
RawKVClient is a client of the TiKV server and only supports the GET/PUT/DELETE/SCAN commands. The RawKVClient can be safely and concurrently accessed by multiple goroutines, as long as it is not closed. Therefore, for one process, one client is enough generally.
### Possible Error
- If you see this error:
```bash
build rawkv-demo: cannot load github.com/pingcap/pd/pd-client: cannot find module providing package github.com/pingcap/pd/pd-client
```
You can run `GO111MODULE=on go get -u github.com/pingcap/tidb@master` to fix it.
- If you got this error when you run `go get -u github.com/pingcap/tidb@master`:
```
go: github.com/golang/lint@v0.0.0-20190409202823-959b441ac422: parsing go.mod: unexpected module path "golang.org/x/lint"
```
You can run `go mod edit -replace github.com/golang/lint=golang.org/x/lint@latest` to fix it. [Refer Link](https://github.com/golang/lint/issues/446#issuecomment-483638233)
## Try the Transactional Key-Value API
The Transactional Key-Value API is more complicated than the Raw Key-Value API. Some transaction related concepts are listed as follows. For more details, see the [KV package](https://github.com/pingcap/tidb/tree/master/kv).
- Storage
Like the RawKVClient, a Storage is an abstract TiKV cluster.
- Snapshot
A Snapshot is the state of a Storage at a particular point of time, which provides some readonly methods. The multiple times read from a same Snapshot is guaranteed consistent.
- Transaction
Like the transactions in SQL, a Transaction symbolizes a series of read and write operations performed within the Storage. Internally, a Transaction consists of a Snapshot for reads, and a MemBuffer for all writes. The default isolation level of a Transaction is Snapshot Isolation.
To use the Transactional Key-Value API in applications developed by golang, take the following steps:
1. Install the necessary packages.
```bash
export GO111MODULE=on
go mod init txnkv-demo
go get github.com/pingcap/tidb@master
```
2. Import the dependency packages.
```go
import (
"flag"
"fmt"
"os"
"github.com/juju/errors"
"github.com/pingcap/tidb/kv"
"github.com/pingcap/tidb/store/tikv"
"github.com/pingcap/tidb/terror"
goctx "golang.org/x/net/context"
)
```
3. Create Storage using a URL scheme.
```go
driver := tikv.Driver{}
storage, err := driver.Open("tikv://192.168.199.113:2379")
```
4. (Optional) Modify the Storage using a Transaction.
The lifecycle of a Transaction is: _begin → {get, set, delete, scan} → {commit, rollback}_.
5. Call the Transactional Key-Value API's methods to access the data on TiKV. The Transactional Key-Value API contains the following methods:
```go
Begin() -> Txn
Txn.Get(key []byte) -> (value []byte)
Txn.Set(key []byte, value []byte)
Txn.Iter(begin, end []byte) -> Iterator
Txn.Delete(key []byte)
Txn.Commit()
```
### Usage example of the Transactional Key-Value API
```go
package main
import (
"flag"
"fmt"
"os"
"github.com/juju/errors"
"github.com/pingcap/tidb/kv"
"github.com/pingcap/tidb/store/tikv"
"github.com/pingcap/tidb/terror"
goctx "golang.org/x/net/context"
)
type KV struct {
K, V []byte
}
func (kv KV) String() string {
return fmt.Sprintf("%s => %s (%v)", kv.K, kv.V, kv.V)
}
var (
store kv.Storage
pdAddr = flag.String("pd", "192.168.199.113:2379", "pd address:192.168.199.113:2379")
)
// Init initializes information.
func initStore() {
driver := tikv.Driver{}
var err error
store, err = driver.Open(fmt.Sprintf("tikv://%s", *pdAddr))
terror.MustNil(err)
}
// key1 val1 key2 val2 ...
func puts(args ...[]byte) error {
tx, err := store.Begin()
if err != nil {
return errors.Trace(err)
}
for i := 0; i < len(args); i += 2 {
key, val := args[i], args[i+1]
err := tx.Set(key, val)
if err != nil {
return errors.Trace(err)
}
}
err = tx.Commit(goctx.Background())
if err != nil {
return errors.Trace(err)
}
return nil
}
func get(k []byte) (KV, error) {
tx, err := store.Begin()
if err != nil {
return KV{}, errors.Trace(err)
}
v, err := tx.Get(k)
if err != nil {
return KV{}, errors.Trace(err)
}
return KV{K: k, V: v}, nil
}
func dels(keys ...[]byte) error {
tx, err := store.Begin()
if err != nil {
return errors.Trace(err)
}
for _, key := range keys {
err := tx.Delete(key)
if err != nil {
return errors.Trace(err)
}
}
err = tx.Commit(goctx.Background())
if err != nil {
return errors.Trace(err)
}
return nil
}
func scan(keyPrefix []byte, limit int) ([]KV, error) {
tx, err := store.Begin()
if err != nil {
return nil, errors.Trace(err)
}
it, err := tx.Iter(kv.Key(keyPrefix), nil)
if err != nil {
return nil, errors.Trace(err)
}
defer it.Close()
var ret []KV
for it.Valid() && limit > 0 {
ret = append(ret, KV{K: it.Key()[:], V: it.Value()[:]})
limit--
it.Next()
}
return ret, nil
}
func main() {
pdAddr := os.Getenv("PD_ADDR")
if pdAddr != "" {
os.Args = append(os.Args, "-pd", pdAddr)
}
flag.Parse()
initStore()
// set
err := puts([]byte("key1"), []byte("value1"), []byte("key2"), []byte("value2"))
terror.MustNil(err)
// get
kv, err := get([]byte("key1"))
terror.MustNil(err)
fmt.Println(kv)
// scan
ret, err := scan([]byte("key"), 10)
for _, kv := range ret {
fmt.Println(kv)
}
// delete
err = dels([]byte("key1"), []byte("key2"))
terror.MustNil(err)
}
```
The result is like:
```bash
INFO[0000] [pd] create pd client with endpoints [192.168.199.113:2379]
INFO[0000] [pd] leader switches to: http://192.168.199.113:2379, previous:
INFO[0000] [pd] init cluster id 6563858376412119197
key1 => value1 ([118 97 108 117 101 49])
key1 => value1 ([118 97 108 117 101 49])
key2 => value2 ([118 97 108 117 101 50])
```

View File

@ -0,0 +1,15 @@
---
title: Clients
description: Interact with TiKV using the raw key-value API or the transactional key-value API
menu:
"3.1-beta":
parent: Reference
weight: 2
---
TiKV has clients for a number of languages:
* [Go](../go) (Stable)
* [Java](../java) (Unstable)
* [Rust](../rust) (Unstable)
* [C](../c) (early development)

View File

@ -0,0 +1,14 @@
---
title: Java Client
description: Interact with TiKV using Java.
menu:
"3.1-beta":
parent: Clients
weight: 3
---
This document, like our Java API, is still a work in progress. In the meantime, you can track development at [tikv/client-java](https://github.com/tikv/client-java/) repository.
{{< warning >}}
You should not use the Java client for production use until it is released.
{{< /warning >}}

View File

@ -0,0 +1,199 @@
---
title: Rust Client
description: Interact with TiKV using Rust.
menu:
"3.1-beta":
parent: Clients
weight: 2
---
TiKV offers two APIs that you can interact with:
API | Description | Atomicity | Use when...
:---|:------------|:----------|:-----------
[Raw](#raw) | A lower-level key-value API for interacting directly with individual key-value pairs. | Single key | Your application doesn't require distributed transactions or multi-version concurrency control (MVCC)
[Transactional](#transactional) | A higher-level key-value API that provides ACID semantics | Multiple keys | Your application requires distributed transactions and/or MVCC
{{< warning >}}
It is **not recommended or supported** to use both the raw and transactional APIs on the same keyspace.
{{< /warning >}}
There are several clients that connect to TiKV:
* [Rust](https://github.com/tikv/client-rust)
* [Java](https://github.com/tikv/client-java)
* [Go](https://github.com/tikv/client-go)
Below we use the Rust client for some examples, but you should find all clients work similarly.
## Basic Types {#types}
Both clients use a few basic types for most of their API:
* `Key`, a wrapper around a `Vec<u8>` symbolizing the 'key' in a key-value pair.
* `Value`, a wrapper around a `Vec<u8>` symbolizing the 'value' in a key-value pair.
* `KvPair`, a tuple of `(Key, Value)` representing a key-value pair.
* `KeyRange`, a trait representing a range of `Key`s from one value to either another value, or the end of the entire dataset.
The `Key` and `Value` types implement `Deref<Target=Vec<u8>>` so they can generally be used just like their contained values. Where possible API calls accept `impl Into<T>` instead of the type `T` when it comes to `Key`, `Value`, and `KvPair`.
If you're using your own key or value types, we reccomend implementing `Into<Key>` and/or `Into<Value>` for them where appropriate. You can also `impl KeyRange` if you have any range types.
## Add the dependency {#dependency}
This guide assumes you are using Rust 1.31 or above. You will also need an already deployed TiKV and PD cluster, since TiKV is not an embedded database.
To start, open the `Cargo.toml` of your project, and add the `tikv-client` and `futures` as dependencies.
<!-- TODO: Use crates.to once published -->
```toml
[dependencies]
tikv-client = { git = "https://github.com/tikv/client-rust" }
futures = "0.1"
```
## Connect a client {#connnect}
In your `src/main.rs`, import the raw API as well as the functionality of the `Future` trait.
**Note:** In this example we used `raw`, but you can also use `transaction`. The process is the same.
```rust
use tikv_client::{Config, raw::Client}
use futures::Future;
```
Build an instance of `Config`, then use it to build an instance of a `Client`.
```rust
let config = Config::new(vec![ // Always use more than one PD endpoint!
"192.168.0.100:2379",
"192.168.0.101:2379",
"192.168.0.102:2379",
]).with_security( // Configure TLS if used.
"root.ca",
"internal.cert",
"internal.key",
);
let unconnected_client = Client::new(config);
let client = unconnected_client.wait()?; // Block and resolve the future.
```
The value returned by `Client::new` is a `Future`. Futures need to be resolved in order to obtain the output. During the resolution of the future the client must create a connection with the cluster.
{{< info >}}
If your application is syncronous you can call `.wait()` to block the current task until the future is resolved. If your application is asyncronous you might have better ways (eg. a Tokio reactor) of dealing with this.
{{< /info >}}
With a connected client, you'll be able to send requests to TiKV. This client supports both singlular or batch operations.
## Raw key-value API {#raw}
Using a connected `raw::Client`, you can perform actions such as `put`, `get`, and `delete`:
```rust
let client = Client::new(config).wait();
// Data stored in TiKV does not need to be UTF-8.
let key = "TiKV".to_bytes();
let value = "Astronaut".to_bytes();
// This creates a future that must be resolved.
let req = client.put(
key, // Vec<u8> impl Into<Key>
value // Vec<u8> impl Into<Value>
);
req.wait()?;
let req = client.get(key);
let result = req.wait()?;
// `Value` derefs to `Vec<u8>`.
assert_eq!(result, Some(value));
let req = client.delete(key);
req.wait()?;
let req = client.get(key).wait()?;
assert_eq!(result, None);
```
You can also perform `scan`s, giving you all the values for keys in a given range:
```rust
// For stability and reliability, it's good to chose a reasonable limit.
const REASONABLE_LIMIT = 1000;
// If you are using UTF-8,`Key` and `Value` arguments can be provided as
// `String` or `&'static str` as well.
const (START, END) = ("C", "F");
// Scanning can also work on an open end (Eg `START..`)
let req = client.scan(START..END, REASONABLE_LIMIT);
let result: Vec<KvPair> = req.wait()?;
```
These functions also have batch variants which accept sets and return `Vec<_>`s of data. These offer considerably reduced network overhead and can result in dramatic performance increases under certain workloads.
For documented, tested examples of all functionalities, check the documentation of `raw::Client` in the generated Rust documentation.
## Transactional key-value API {#transactional}
> The transactional API of the Rust client is incomplete. You can track the progress with [issue #15](https://github.com/tikv/client-rust/issues/15). For a complete implementation, you can try the [Go client](https://github.com/pingcap/tidb/store/tikv).
Using a connected `transaction::Client` you can then begin a transaction:
```rust
let client = Client::new(config).wait();
let txn = client.begin();
```
Then it's possible to send commands like `get`, `set`, `delete`, or `scan`. Batch variants also exist.
```rust
// `Key` and `Value` wrap around `Vec<u8>` values.
// This means data need not be UTF-8.
let key = "TiKV".to_bytes();
let value = "Astronaut".to_bytes();
// This creates a future that must be resolved.
let req = txn.set(key, value);
req.wait()?;
let req = txn.get(key);
let result = req.wait()?;
// `Value` and `Key` deref to `Vec<u8>`.
// This means you should find them easy to work with.
assert_eq!(result, Some(value));
let req = txn.delete(key);
req.wait()?;
let req = txn.get(key).wait()?;
assert_eq!(result, None);
// For more detail on scanning, see the raw section above or the documentation.
let req = client.scan("A".."B", 1000);
let result: Vec<KvPair> = req.wait()?;
```
Commit these changes when you're ready, or roll back if you prefer to abort the operation:
```rust
if all_is_good {
txn.commit()?;
} else {
txn.rollback()?
}
```
## Beyond the Basics
At this point you're familiar with the basic functionality of TiKV. To begin integrating with TiKV you should explore the documentation of your favorite client from those we listed above.
For the Rust client, you can find the full documentation for the client (and all your dependencies) by running:
```bash
cargo doc --package tikv-client --open
```

View File

@ -0,0 +1,82 @@
---
title: Contribute
description: How to be a part of TiKV
menu:
nav:
parent: Docs
weight: 6
"3.1-beta":
parent: Reference
weight: 6
aliases:
- /docs/3.0/contribute/contribute-to-tikv/
---
As an open source project, TiKV cannot grow without support and participating of contributors from the community. If you would like to contribute to the TiKV code, our documentation, or even the website, we would appreciate your help. And we are glad to provide any support along the way.
## How to be a TiKV Contributor
If a PR (Pull Request) submitted to the TIKV related projects by you is approved and merged, then you become a TiKV Contributor.
## Pick an area to contribute
You can choose the one of the following areas to contribute:
- [TiKV core](https://github.com/tikv/tikv)
This is where we host the core code base of the TiKV project, which is developed in Rust. See [Contribute to TiKV](https://github.com/tikv/tikv/blob/master/CONTRIBUTING.md) for details on how to make contributions to the TiKV core code base.
- [TiKV documentation](https://github.com/tikv/website/tree/master/content/docs)
We host our documentation within the [TiKV website] repository. The documentation is generated via the Hugo framework. See [Contribute to Docs](../docs) for detailed steps of contribution.
- Libraries
The TiKV team actively develops and maintains a bunch of dependencies used in TiKV, which you may be also interested in:
- [rust-prometheus](https://github.com/pingcap/rust-prometheus): The Prometheus client for Rust, our metrics collecting and reporting library
- [rust-rocksdb](https://github.com/pingcap/rust-rocksdb): Our RocksDB binding and wrapper for Rust
- [raft-rs](https://github.com/pingcap/raft-rs): The Raft distributed consensus algorithm implemented in Rust
- [grpc-rs](https://github.com/pingcap/grpc-rs): The gRPC library for Rust built on the gRPC C Core library and Rust Futures
- [fail-rs](https://github.com/pingcap/fail-rs): Fail points for Rust
For details on how to contribute to the above dependent libraries of TiKV, refer to the **README** file in the corresponding repository.
- [TiKV Clients](https://github.com/tikv)
This is where we host TiKV clients in different languages, which are:
- [Go client](https://github.com/tikv/client-go)
- [Rust client](https://github.com/tikv/client-rust)
- [Java client](https://github.com/tikv/client-java)
- [C client](https://github.com/tikv/client-c)
As the Go client came out the earliest, it has evolved into a stable shape. Clients like Rust and Java are not stable enough, while client C is still in early development phase, which is a good timing to get yourself involved with the development of TiKV clients. See [Contribute to TiKV](https://github.com/tikv/tikv/blob/master/CONTRIBUTING.md) for details on how to make contributions to the TiKV clients.
- [RFCs](https://github.com/tikv/rfcs)
This is where the design process of a new feature starts. If you wish to contribute something major or substantial to TiKV, we would love to see that and will ask you to submit a Request for Change (RFC) to generate a consensus among the TiKV community. See the [README](https://github.com/tikv/rfcs/blob/master/README.md) file and existing RFCs for references on how to draft and submit an RFC.
- Meetups/Events
As an open source project, we are passionate about meetups and events, where the community gather and share. Besides the official events hosted by TiKV, we would love to see you be the organizer or the participant of an event/meetup. Showing up, giving a talk, or joining in the discussions would all be a form of contribution in our eyes, and we appreciate that. Let us know if you have any ideas.
## Find an issue to work on
For beginners, we have prepared many suitable tasks for you. You can check out, for example, our [Help Wanted issues](https://github.com/tikv/tikv/issues?q=is%3Aissue+is%3Aopen+label%3A%22S%3A+HelpWanted%22) in the TiKV repository.
See below for some commonly used labels across major repositories listed in [Pick an area to contribute](#pick-an-area-to-contribute):
- **`bug`** Something is wrong; can be small or big in scope
- **`good first issue`** - An ideal first issue to work on for beginners, with mentoring available
- **`help wanted`** - Help wanted. Contributions are very welcome!
- **`discussion`** - Status: Under discussion or need discussion
- **`enhancement`** New feature or request
- **`question`** Further information is requested, or the question is to be answered.
## Ask a question
{{< info >}}
If you need any help or mentoring getting started, understanding the codebase, or making a PR (or anything else really), please ask on [Slack](https://join.slack.com/t/tikv-wg/shared_invite/enQtNTUyODE4ODU2MzI0LTgzZDQ3NzZlNDkzMGIyYjU1MTA0NzIwMjFjODFiZjA0YjFmYmQyOTZiNzNkNzg1N2U1MDdlZTIxNTU5NWNhNjk), or [WeChat](https://github.com/tikv/tikv#wechat).
{{< /info >}}

View File

@ -0,0 +1,169 @@
---
title: FAQ
description: FAQs about TiKV
menu:
nav:
parent: Docs
weight: 5
"3.1-beta":
parent: Reference
weight: 4
aliases:
- /docs/3.0/faq/faq/
---
## What is TiKV?
TiKV is a distributed Key-Value database that features in geo-replication, horizontal scalability, consistent distributed transactions and coprocessor support.
## How do TiDB and TiKV work together? What is the relationship between the two?
TiDB works as the SQL layer and TiKV works as the Key-Value layer. TiDB provides TiKV the SQL enablement and turns TiKV into a NewSQL database. TiDB and TiKV work together to be as scalable as a NoSQL database while maintains the ACID transactions of a relational database.
## Why do you have separate layers?
Inspired by Google F1 and Spanner, TiDB and TiKV adopt a highly-layered architecture. This architecture supports pluggable storage drivers and engines, which powers you to customize your database solutions based on your own business requirements. Meanwhile, this architecture makes it easy to debug, update, tune, and maintain. You wont have to go through the whole system just to find and fix a bug in one module.
## How do I run TiKV?
For further information, see [Quick Start](../../tasks/try) for deploying a TiKV testing cluster, or [Deploy TiKV](../../tasks/deploy/introduction) for deploying TiKV in production.
## When to use TiKV?
TiKV is at your service if your applications require:
* Horizontal scalability (including writes)
* Strong consistency
* Support for distributed ACID transactions
## When is TiKV inappropriate?
TiKV is not yet ready to deal with very low latency reads and writes.
## How does TiKV scale?
Grow TiKV as your business grows. You can increase the capacity simply by adding more machines. You can run TiKV across physical, virtual, container, and cloud environments.
PD ([Placement Driver](https://github.com/pingcap/pd)) periodically checks replication constraints and balances the load, and it handles data movement automatically. When PD notices that the load is too high, it will rebalance data.
## How is TiKV highly available?
TiKV is self-healing. With its strong consistency guarantee, whether its data machine failures or even downtime of an entire data center, your data can be recovered automatically.
## How is TiKV strongly-consistent?
Strong consistency means all replicas return the same value when queried for the attribute of an object. TiKV uses the [Raft consensus algorithm](https://raft.github.io/) to ensure consistency among multiple replicas. TiKV allows a collection of machines to work as a coherent group that can survive the failures of some of its members.
## Does TiKV support distributed transactions?
Yes. The transaction model in TiKV is inspired by Googles Percolator, a paper published in 2006. Its mainly a two-phase commit protocol with some practical optimizations. This model relies on a timestamp allocator to assign a monotonically increasing timestamp for each transaction, so that conflicts can be detected.
## Does TiKV have ACID semantics?
Yes. ACID semantics are guaranteed in TiKV:
* Atomicity: Each transaction in TiKV is "all or nothing": if one part of the transaction fails, then the entire transaction fails, and the database state is left unchanged. TiKV guarantees atomicity in each and every situation, including power failures, errors, and crashes.
* Consistency: TiKV ensures that any transaction brings the database from one valid state to another. Any data written to the TiKV database must be valid according to all defined rules.
* Isolation: TiKV provides snapshot isolation (SI), snapshot isolation with lock, and externally consistent reads and writes in distributed transactions.
* Durability: TiKV allows a collection of machines to work as a coherent group that can survive the failures of some of its members. So in TiKV, once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.
## How are transactions in TiKV lock-free?
TiKV has an optimistic transaction model, which means the client will buffer all the writes within a transaction, and when the client calls commit function, the writes will be packed and sent to the servers. If there is no conflict, the writes which are the key-value pairs with specific version are written into the database, and can be read by other transactions.
## Can I use TiKV as a key-value store?
Yes. That's what TiKV is.
## How does TiKV compare to NoSQL databases like Cassandra, HBase, or MongoDB?
TiKV is as scalable as NoSQL databases like to advertise, while including features like externally consistent distributed transactions and good support for stateless query layers such as TiSpark (Spark), Titan (Redis), or TiDB (MySQL)
## What is the recommended number of replicas in the TiKV cluster? Is it better to keep the minimum number for high availability?
3 replicas for each Region is sufficient for a testing environment. However, you should never operate a TiKV cluster with under 3 nodes in a production scenario. Depending on infrastructure, workload, and resiliency needs, you may wish to increase this number.
## If a node is down, will the service be affected? How long?
TiKV uses Raft to synchronize data among multiple replicas (by default 3 replicas for each Region). If one replica goes wrong, the other replicas can guarantee data safety. Based on the Raft protocol, if a single leader fails as the node goes down, a follower in another node is soon elected as the Region leader after a maximum of 2 * lease time (lease time is 10 seconds).
## Is the Range of the Key data table divided before data access?
No. It differs from the table splitting rules of MySQL. In TiKV, the table Range is dynamically split based on the size of Region.
## How does Region split?
Region is not divided in advance, but it follows a Region split mechanism. When the Region size exceeds the value of the `region-split-size` or `region-split-keys` parameters, split is triggered. After the split, the information is reported to PD.
## What are the features of TiKV block cache?
TiKV implements the Column Family (CF) feature of RocksDB. By default, the KV data is eventually stored in the 3 CFs (default, write and lock) within RocksDB.
- The default CF stores real data and the corresponding parameter is in `[rocksdb.defaultcf]`. The write CF stores the data version information (MVCC) and index-related data, and the corresponding parameter is in `[rocksdb.writecf]`. The lock CF stores the lock information and the system uses the default parameter.
- The Raft RocksDB instance stores Raft logs. The default CF mainly stores Raft logs and the corresponding parameter is in `[raftdb.defaultcf]`.
- All CFs have a shared block-cache to cache data blocks and improve RocksDB read speed. The size of block-cache is controlled by the `block-cache-size` parameter. A larger value of the parameter means more hot data can be cached and is more favorable to read operation. At the same time, it consumes more system memory.
- Each CF has an individual write-buffer and the size is controlled by the `write-buffer-size` parameter.
## What are the TiKV scenarios that take up high I/O, memory, CPU, and exceed the parameter configuration?
Writing or reading a large volume of data in TiKV takes up high I/O, memory and CPU. Executing very complex queries costs a lot of memory and CPU resources, such as the scenario that generates large intermediate result sets.
## Does TiKV have the `innodb_flush_log_trx_commit` parameter like MySQL, to guarantee the security of data?
Yes. Currently, the standalone storage engine uses two RocksDB instances. One instance is used to store the raft-log. When the `sync-log` parameter in TiKV is set to true, each commit is mandatorily flushed to the raft-log. If a crash occurs, you can restore the KV data using the raft-log.
## What is the recommended server configuration for WAL storage, such as SSD, RAID level, cache strategy of RAID card, NUMA configuration, file system, I/O scheduling strategy of the operating system?
WAL belongs to ordered writing, and currently, we don't have separate configuration for it. Recommended configuration is as follows:
- SSD
- RAID 10 preferred
- Cache strategy of RAID card and I/O scheduling strategy of the operating system: currently no specific best practices; you can use the default configuration in Linux 7 or later
- NUMA: no specific suggestion; for memory allocation strategy, you can use `interleave = all`
- File system: ext4
## Can Raft + multiple replicas in the TiKV architecture achieve absolute data safety? Is it necessary to apply the most strict mode (`sync-log = true`) to a standalone storage?
Data is redundantly replicated between TiKV nodes using the [Raft consensus algorithm](https://raft.github.io/) to ensure recoverability should a node failure occur. Only when the data has been written into more than 50% of the replicas will the application return ACK (two out of three nodes). However, theoretically, two nodes might crash. Therefore, except for scenarios with less strict requirement on data security but extreme requirement on performance, it is strongly recommended that you enable the sync-log mode.
As an alternative to using `sync-log`, you may also consider having five replicas instead of three in your Raft group. This would allow for the failure of two replicas, while still providing data safety.
For a standalone TiKV node, it is still recommended to enable the sync-log mode. Otherwise, the last write might be lost in case of a node failure.
## Why does TiKV frequently switch Region leader?
- Leaders can not reach out to followers. E.g., network problem or node failure.
- Leader balance from PD. E.g., PD wants to transfer leaders from a hotspot node to others.
## The `cluster ID mismatch` message is displayed when starting TiKV.
This is because the cluster ID stored in local TiKV is different from the cluster ID specified by PD. When a new PD cluster is deployed, PD generates random cluster IDs. TiKV gets the cluster ID from PD and stores the cluster ID locally when it is initialized. The next time when TiKV is started, it checks the local cluster ID with the cluster ID in PD. If the cluster IDs don't match, the `cluster ID mismatch` message is displayed and TiKV exits.
If you previously deploy a PD cluster, but then you remove the PD data and deploy a new PD cluster, this error occurs because TiKV uses the old data to connect to the new PD cluster.
## The `duplicated store address` message is displayed when starting TiKV.
This is because the address in the startup parameter has been registered in the PD cluster by other TiKVs. This error occurs when there is no data folder under the directory that TiKV `--store` specifies, but you use the previous parameter to restart the TiKV.
To solve this problem, use the [`store delete`](https://github.com/pingcap/pd/tree/55db505e8f35e8ab4e00efd202beb27a8ecc40fb/tools/pd-ctl#store-delete--label--weight-store_id----jqquery-string) function to delete the previous store and then restart TiKV.
## TiKV master and slave use the same compression algorithm, why the results are different?
Currently, some files of TiKV master have a higher compression rate, which depends on the underlying data distribution and RocksDB implementation. It is normal that the data size fluctuates occasionally. The underlying storage engine adjusts data as needed.
## What are causes for "TiKV channel full"?
- The Raftstore thread is too slow or blocked by I/O. You can view the CPU usage status of Raftstore.
- TiKV is too busy (CPU, disk I/O, etc.) and cannot manage to handle it.
## How is the write performance in the most strict data available mode (`sync-log = true`)?
Generally, enabling `sync-log` reduces about 30% of the performance. For write performance when `sync-log` is set to `false`, see [Performance test result for TiDB using Sysbench](https://github.com/pingcap/docs/blob/master/v3.0/benchmark/sysbench-v4.md).
## Why does `IO error: No space left on device While appending to file` occur?
This is because the disk space is not enough. You need to add nodes or enlarge the disk space.
## Why does the OOM (Out of Memory) error occur frequently in TiKV?
The memory usage of TiKV mainly comes from the block-cache of RocksDB, which is 40% of the system memory size by default. When the OOM error occurs frequently in TiKV, you should check whether the value of `block-cache-size` is set too high. In addition, when multiple TiKV instances are deployed on a single machine, you need to explicitly configure the parameter to prevent multiple instances from using too much system memory that results in the OOM error.

View File

@ -0,0 +1,13 @@
---
title: Reference
description: Details about TiKV
menu:
"3.1-beta":
name: Reference
weight: 3
nav:
parent: Docs
weight: 3
---
This section includes instructions on using TiKV clients and tools.

View File

@ -0,0 +1,16 @@
---
title: Query Layers
description: Extend TiKV using stateless query layers
menu:
"3.1-beta":
parent: Reference
weight: 3
---
There are several projects which harness TiKV to power their storage:
* [TiDB](https://github.com/pingcap/tidb) (MySQL)
* [TiPrometheus](https://github.com/bragfoo/TiPrometheus) (Prometheus)
* [Titan](https://github.com/distributedio/titan) (Redis)
* [Tidis](https://github.com/yongman/tidis) (Redis)
* [Titea](https://github.com/gengmei-tech/titea) (Redis)

View File

@ -0,0 +1,18 @@
---
title: Tools
description: Tools which can be used to administrate TiKV
menu:
"3.1-beta":
parent: Reference
weight: 3
---
There are a number of components and tools involved in maintaining a TiKV deployment.
You can browse documentation on:
* [`tikv-server`](../tikv-server): The TiKV service stores data and serves client requests.
* [`tikv-ctl`](../tikv-ctl): The control plane tool for managing TiKV, both online or offline.
* [`pd-server`](../pd-server): The PD service manages cluster metadata and transaction timestamps.
* [`pd-ctl`](../pd-ctl): The control plane tool for managing PD.
* [`pd-recover`](../pd-recover): A disaster recovery tool for PD.

View File

@ -0,0 +1,932 @@
---
title: pd-ctl
description: Learn about interacting with pd-ctl
menu:
"3.1-beta":
parent: Tools
weight: 4
---
As a command line tool of PD, PD Control obtains the state information of the cluster and tunes the cluster.
## Source code compiling
1. [Go](https://golang.org/) Version 1.11 or later
2. In the root directory of the [PD project](https://github.com/pingcap/pd), use the `make` command to compile and generate `bin/pd-ctl`
> **Note:** Generally, you do not need to compile source code because the PD Control tool already exists in the released Binary or Docker. For developer users, the `make` command can be used to compile source code.
## Usage
Single-command mode:
```bash
./pd-ctl store -u http://127.0.0.1:2379
```
Interactive mode:
```bash
./pd-ctl -u http://127.0.0.1:2379
```
Use environment variables:
```bash
export PD_ADDR=http://127.0.0.1:2379
./pd-ctl
```
Use TLS to encrypt:
```bash
./pd-ctl -u https://127.0.0.1:2379 --cacert="path/to/ca" --cert="path/to/cert" --key="path/to/key"
```
## Command line flags
### \-\-pd,-u
+ PD address
+ Default address: http://127.0.0.1:2379
+ Environment variable: PD_ADDR
### \-\-detach,-d
+ Use single command line mode (not entering readline)
+ Default: true
### --cacert
+ Specify the path to the certificate file of the trusted CA in PEM format
+ Default: ""
### --cert
+ Specify the path to the certificate of SSL in PEM format
+ Default: ""
### --key
+ Specify the path to the certificate key file of SSL in PEM format, which is the private key of the certificate specified by `--cert`
+ Default: ""
### --version,-V
+ Print the version information and exit
+ Default: false
## Command
### `cluster`
Use this command to view the basic information of the cluster.
Usage:
```bash
>> cluster // To show the cluster information
{
"id": 6493707687106161130,
"max_peer_count": 3
}
```
### `config [show | set <option> <value>]`
Use this command to view or modify the configuration information.
Usage:
```bash
>> ./bin/pd-ctl config show all // Display all config information of the scheduler
{
"client-urls": "http://127.0.0.1:2379",
"peer-urls": "http://127.0.0.1:2380",
"advertise-client-urls": "http://127.0.0.1:2379",
"advertise-peer-urls": "http://127.0.0.1:2380",
"name": "pd-chenshunings-MacBook-Pro.local",
"data-dir": "default.pd-chenshunings-MacBook-Pro.local",
"force-new-cluster": false,
"initial-cluster": "pd-chenshunings-MacBook-Pro.local=http://127.0.0.1:2380",
"initial-cluster-state": "new",
"join": "",
"lease": 3,
"log": {
"level": "",
"format": "text",
"disable-timestamp": false,
"file": {
"filename": "",
"log-rotate": true,
"max-size": 0,
"max-days": 0,
"max-backups": 0
},
"development": false,
"disable-caller": false,
"disable-stacktrace": false,
"sampling": null
},
"log-file": "",
"log-level": "",
"tso-save-interval": "3s",
"metric": {
"job": "pd-chenshunings-MacBook-Pro.local",
"address": "",
"interval": "15s"
},
"schedule": {
"max-snapshot-count": 3,
"max-pending-peer-count": 16,
"max-merge-region-size": 20,
"max-merge-region-keys": 200000,
"split-merge-interval": "1h0m0s",
"enable-one-way-merge": false,
"patrol-region-interval": "100ms",
"max-store-down-time": "30m0s",
"leader-schedule-limit": 8,
"region-schedule-limit": 1024,
"replica-schedule-limit": 1024,
"merge-schedule-limit": 8,
"hot-region-schedule-limit": 2,
"hot-region-cache-hits-threshold": 3,
"store-balance-rate": 1,
"tolerant-size-ratio": 0,
"low-space-ratio": 0.8,
"high-space-ratio": 0.6,
"disable-raft-learner": "false",
"disable-remove-down-replica": "false",
"disable-replace-offline-replica": "false",
"disable-make-up-replica": "false",
"disable-remove-extra-replica": "false",
"disable-location-replacement": "false",
"disable-namespace-relocation": "false",
"schedulers-v2": [
{
"type": "balance-region",
"args": null,
"disable": false
},
{
"type": "balance-leader",
"args": null,
"disable": false
},
{
"type": "hot-region",
"args": null,
"disable": false
},
{
"type": "label",
"args": null,
"disable": false
}
]
},
"replication": {
"max-replicas": 3,
"location-labels": "",
"strictly-match-label": "false"
},
"namespace": {},
"pd-server": {
"use-region-storage": "true"
},
"cluster-version": "0.0.0",
"quota-backend-bytes": "0 B",
"auto-compaction-mode": "periodic",
"auto-compaction-retention-v2": "1h",
"TickInterval": "500ms",
"ElectionInterval": "3s",
"PreVote": true,
"security": {
"cacert-path": "",
"cert-path": "",
"key-path": ""
},
"label-property": {},
"WarningMsgs": null,
"namespace-classifier": "table",
"LeaderPriorityCheckInterval": "1m0s"
}
>> config show // Display default config information
>> config show namespace ts1 // Display the config information of the namespace named ts1
{
"leader-schedule-limit": 4,
"region-schedule-limit": 4,
"replica-schedule-limit": 8,
"merge-schedule-limit": 8,
"max-replicas": 3,
}
>> config show replication // Display the config information of replication
{
"max-replicas": 3,
"location-labels": "zone,host"
}
>> config show cluster-version // Display the current version of the cluster, which is the current minimum version of TiKV nodes in the cluster and does not correspond to the binary version.
"2.0.0"
```
- `max-snapshot-count` controls the maximum number of snapshots that a single store receives or sends out at the same time. The scheduler is restricted by this configuration to avoid taking up normal application resources. When you need to improve the speed of adding replicas or balancing, increase this value.
```bash
>> config set max-snapshot-count 16 // Set the maximum number of snapshots to 16
```
- `max-pending-peer-count` controls the maximum number of pending peers in a single store. The scheduler is restricted by this configuration to avoid producing a large number of Regions without the latest log in some nodes. When you need to improve the speed of adding replicas or balancing, increase this value. Setting it to 0 indicates no limit.
```bash
>> config set max-pending-peer-count 64 // Set the maximum number of pending peers to 64
```
- `max-merge-region-size` controls the upper limit on the size of Region Merge (the unit is M). When `regionSize` exceeds the specified value, PD does not merge it with the adjacent Region. Setting it to 0 indicates disabling Region Merge. The default value is 20.
```bash
>> config set max-merge-region-size 16 // Set the upper limit on the size of Region Merge to 16M
```
- `max-merge-region-keys` controls the upper limit on the key count of Region Merge. When `regionKeyCount` exceeds the specified value, PD does not merge it with the adjacent Region. The default value is 200000.
```bash
>> config set max-merge-region-keys 50000 // Set the the upper limit on KeyCount to 50000
```
- `split-merge-interval` controls the interval between the `split` and `merge` operations on a same Region. This means:
- Newly split Regions won't be merged within the specified period of time.
- Region Merge won't happen within the specified period of time after PD starts or restarts.
The default value is 1h.
```bash
>> config set split-merge-interval 24h // Set the interval between `split` and `merge` to one day
```
- `patrol-region-interval` controls the execution frequency that `replicaChecker` checks the health status of Regions. A shorter interval indicates a higher execution frequency. Generally, you do not need to adjust it.
```bash
>> config set patrol-region-interval 10ms // Set the execution frequency of replicaChecker to 10ms
```
- `max-store-down-time` controls the time that PD decides the disconnected store cannot be restored if exceeded. If PD does not receive heartbeats from a store within the specified period of time, PD adds replicas in other nodes.
```bash
>> config set max-store-down-time 30m // Set the time within which PD receives no heartbeats and after which PD starts to add replicas to 30 minutes
```
- `leader-schedule-limit` controls the number of tasks scheduling the leader at the same time. This value affects the speed of leader balance. A larger value means a higher speed and setting the value to 0 closes the scheduling. Usually the leader scheduling has a small load, and you can increase the value in need.
```bash
>> config set leader-schedule-limit 4 // 4 tasks of leader scheduling at the same time at most
```
- `region-schedule-limit` controls the number of tasks scheduling the Region at the same time. This value affects the speed of Region balance. A larger value means a higher speed and setting the value to 0 closes the scheduling. Usually the Region scheduling has a large load, so do not set a too large value.
```bash
>> config set region-schedule-limit 2 // 2 tasks of Region scheduling at the same time at most
```
- `replica-schedule-limit` controls the number of tasks scheduling the replica at the same time. This value affects the scheduling speed when the node is down or removed. A larger value means a higher speed and setting the value to 0 closes the scheduling. Usually the replica scheduling has a large load, so do not set a too large value.
```bash
>> config set replica-schedule-limit 4 // 4 tasks of replica scheduling at the same time at most
```
- `merge-schedule-limit` controls the number of Region Merge scheduling tasks. Setting the value to 0 closes Region Merge. Usually the Merge scheduling has a large load, so do not set a too large value. The default value is 8.
```bash
>> config set merge-schedule-limit 16 // 16 tasks of Merge scheduling at the same time at most
```
The configuration above is global. You can also tune the configuration by configuring different namespaces. The global configuration is used if the corresponding configuration of the namespace is not set.
> **Note:** The configuration of the namespace only supports editing `leader-schedule-limit`, `region-schedule-limit`, `replica-schedule-limit` and `max-replicas`.
```bash
>> config set namespace ts1 leader-schedule-limit 4 // 4 tasks of leader scheduling at the same time at most for the namespace named ts1
>> config set namespace ts2 region-schedule-limit 2 // 2 tasks of region scheduling at the same time at most for the namespace named ts2
```
- `tolerant-size-ratio` controls the size of the balance buffer area. When the score difference between the leader or Region of the two stores is less than specified multiple times of the Region size, it is considered in balance by PD.
```bash
>> config set tolerant-size-ratio 20 // Set the size of the buffer area to about 20 times of the average regionSize
```
- `low-space-ratio` controls the threshold value that is considered as insufficient store space. When the ratio of the space occupied by the node exceeds the specified value, PD tries to avoid migrating data to the corresponding node as much as possible. At the same time, PD mainly schedules the remaining space to avoid using up the disk space of the corresponding node.
```bash
config set low-space-ratio 0.9 // Set the threshold value of insufficient space to 0.9
```
- `high-space-ratio` controls the threshold value that is considered as sufficient store space. When the ratio of the space occupied by the node is less than the specified value, PD ignores the remaining space and mainly schedules the actual data volume.
```bash
config set high-space-ratio 0.5 // Set the threshold value of sufficient space to 0.5
```
- `disable-raft-learner` is used to disable Raft learner. By default, PD uses Raft learner when adding replicas to reduce the risk of unavailability due to downtime or network failure.
```bash
config set disable-raft-learner true // Disable Raft learner
```
- `cluster-version` is the version of the cluster, which is used to enable or disable some features and to deal with the compatibility issues. By default, it is the minimum version of all normally running TiKV nodes in the cluster. You can set it manually only when you need to roll it back to an earlier version.
```bash
config set cluster-version 1.0.8 // Set the version of the cluster to 1.0.8
```
- `disable-remove-down-replica` is used to disable the feature of automatically deleting DownReplica. When you set it to `true`, PD does not automatically clean up the downtime replicas.
- `disable-replace-offline-replica` is used to disable the feature of migrating OfflineReplica. When you set it to `true`, PD does not migrate the offline replicas.
- `disable-make-up-replica` is used to disable the feature of making up replicas. When you set it to `true`, PD does not adding replicas for Regions without sufficient replicas.
- `disable-remove-extra-replica` is used to disable the feature of removing extra replicas. When you set it to `true`, PD does not remove extra replicas for Regions with redundant replicas.
- `disable-location-replacement` is used to disable the isolation level check. When you set it to `true`, PD does not improve the isolation level of Region replicas by scheduling.
- `disable-namespace-relocation` is used to disable Region relocation to the store of its namespace. When you set it to `true`, PD does not move Regions to stores where they belong to.
- `use-region-storage` is used to enable or disable Region metadata storage in PD. When you set it to `true`, the metadata of PD Regions will be saved to the Region Meta-Storage, a separate storage engine. This solves the potential performance issue with BoltDB (backend of etcd) that may occur if the stored metadata has reached a GB level. Metadata is synchronized across multiple PD servers and eventually consistency is guaranteed through Raft. The default value is `true`.
> **Note:**
>
> This feature is introduced in TiKV 3.0.
### `config delete namespace <name> [<option>]`
Use this command to delete the configuration of namespace.
Usage:
After you configure the namespace, if you want it to continue to use global configuration, delete the configuration information of the namespace using the following command:
```bash
>> config delete namespace ts1 // Delete the configuration of the namespace named ts1
```
If you want to use global configuration only for a certain configuration of the namespace, use the following command:
```bash
>> config delete namespace region-schedule-limit ts2 // Delete the region-schedule-limit configuration of the namespace named ts2
```
### `health`
Use this command to view the health information of the cluster.
Usage:
```bash
>> health // Display the health information
[
{
"name": "pd",
"member_id": 13195394291058371180,
"client_urls": [
"http://127.0.0.1:2379"
],
"health": true
},
...
]
```
### `hot [read | write | store]`
Use this command to view the hot spot information of the cluster.
Usage:
```bash
>> hot read // Display hot spot for the read operation
>> hot write // Display hot spot for the write operation
>> hot store // Display hot spot for all the read and write operations
```
### `label [store <name> <value>]`
Use this command to view the label information of the cluster.
Usage:
```bash
>> label // Display all labels
>> label store zone cn // Display all stores including the "zone":"cn" label
```
### `member [delete | leader_priority | leader [show | resign | transfer <member_name>]]`
Use this command to view the PD members, remove a specified member, or configure the priority of leader.
Usage:
```bash
>> member // Display the information of all members
{
"header": {
"cluster_id": 6493707687106161130
},
"members": [
{
"name": "pd1",
"member_id": 9724873857558226554,
"peer_urls": [
"http://127.0.0.1:2380"
],
"client_urls": [
"http://127.0.0.1:2379"
]
},
...
],
"leader": {...},
"etcd_leader": {...}
}
>> member delete name pd2 // Delete "pd2"
Success!
>> member delete id 1319539429105371180 // Delete a node using id
Success!
>> member leader show // Display the leader information
{
"name": "pd",
"addr": "http://192.168.199.229:2379",
"id": 9724873857558226554
}
>> member leader resign // Move leader away from the current member
Success!
>> member leader transfer pd3 // Migrate leader to a specified member
Success!
```
### `operator [show | add | remove]`
Use this command to view and control the scheduling operation, split a Region, or merge Regions.
Usage:
```bash
>> operator show // Display all operators
>> operator show admin // Display all admin operators
>> operator show leader // Display all leader operators
>> operator show region // Display all Region operators
>> operator add add-peer 1 2 // Add a replica of Region 1 on store 2
>> operator add remove-peer 1 2 // Remove a replica of Region 1 on store 2
>> operator add transfer-leader 1 2 // Schedule the leader of Region 1 to store 2
>> operator add transfer-region 1 2 3 4 // Schedule Region 1 to stores 2,3,4
>> operator add transfer-peer 1 2 3 // Schedule the replica of Region 1 on store 2 to store 3
>> operator add merge-region 1 2 // Merge Region 1 with Region 2
>> operator add split-region 1 --policy=approximate // Split Region 1 into two Regions in halves, based on approximately estimated value
>> operator add split-region 1 --policy=scan // Split Region 1 into two Regions in halves, based on accurate scan value
>> operator remove 1 // Remove the scheduling operation of Region 1
```
The splitting of Regions starts from the position as close as possible to the middle. You can locate this position using two strategies, namely "scan" and "approximate". The difference between them is that the former determines the middle key by scanning the Region, and the latter obtains the approximate position by checking the statistics recorded in the SST file. Generally, the former is more accurate, while the latter consumes less I/O and can be completed faster.
### `ping`
Use this command to view the time that `ping` PD takes.
Usage:
```bash
>> ping
time: 43.12698ms
```
### `region <region_id> [--jq="<query string>"]`
Use this command to view the region information. For a jq formatted output, see [jq-formatted-json-output-usage](#jq-formatted-json-output-usage).
Usage:
```bash
>> region // Display the information of all regions
{
"count": 10,
"regions": [
{
"id": 4,
"start_key": "",
"end_key": "t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xff\\x05\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf8",
"epoch": {...},
"peers": [...],
"leader": {...},
"written_bytes": 251302,
"read_bytes": 60472,
"approximate_size": 1,
"approximate_keys": 367
},
...
]
}
>> region 2 // Display the information of the region with the id of 2
{
"id": 2,
"start_key": "t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xff\\x1d\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf8",
"end_key": "",
"epoch": {...},
"peers": [...],
"leader": {...},
"written_bytes": 251302,
"read_bytes": 60472,
"approximate_size": 96,
"approximate_keys": 524442
}
```
### `region key [--format=raw|encode] <key>`
Use this command to query the region that a specific key resides in. It supports the raw and encoding formats. And you need to use single quotes around the key when it is in the encoding format.
Raw format usage (default):
```bash
>> region key abc
>> region key xyz
{
"id": 2,
"start_key": "t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xff\\x1d\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf8",
"end_key": "",
"epoch": {...},
"peers": [...],
"leader": {...},
"written_bytes": 251302,
"read_bytes": 60472,
"approximate_size": 96,
"approximate_keys": 524442
}
```
Encoding format usage:
```bash
>> region key --format=encode 't\200\000\000\000\000\000\000\377\035_r\200\000\000\000\000\377\017U\320\000\000\000\000\000\372'
{
"region": {
"id": 2,
...
}
}
```
### `region sibling <region_id>`
Use this command to check the adjacent Regions of a specific Region.
Usage:
```bash
>> region sibling 28
[
{
"id": 26,
"start_key": "t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xff\\x19\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf8",
"end_key": "t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xff\\x1b\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf8",
...
},
{
"id": 2,
"start_key": "t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xff\\x1d\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf8",
"end_key": "",
...
}
]
```
### `region store <store_id>`
Use this command to list all Regions of a specific store.
Usage:
```bash
>> region store 1
{
"count": 10,
"regions": [
{
"id": 8,
"start_key": "t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xff\\a\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf8",
...
},
{
...
},
...
]
}
```
### `region topread [limit]`
Use this command to list Regions with top read flow. The default value of the limit is 16.
Usage:
```bash
>> region topread
{
"count": 16,
"regions": [...]
}
```
### `region topwrite [limit]`
Use this command to list Regions with top write flow. The default value of the limit is 16.
Usage:
```bash
>> region topwrite
{
"count": 16,
"regions": [...]
}
```
### `region topconfver [limit]`
Use this command to list Regions with top conf version. The default value of the limit is 16.
Usage:
```bash
>> region topconfver
{
"count": 16,
"regions": [...]
}
```
### `region topversion [limit]`
Use this command to list Regions with top version. The default value of the limit is 16.
Usage:
```bash
>> region topversion
{
"count": 16,
"regions": [...]
}
```
### `region topsize [limit]`
Use this command to list Regions with top approximate size. The default value of the limit is 16.
Usage:
```bash
>> region topsize
{
"count": 16,
"regions": [...]
}
```
### `region check [miss-peer | extra-peer | down-peer | pending-peer | incorrect-ns]`
Use this command to check the Regions in abnormal conditions.
Description of various types:
- miss-peer: the Region without enough replicas
- extra-peer: the Region with extra replicas
- down-peer: the Region in which some replicas are Down
- pending-peerthe Region in which some replicas are Pending
- incorrect-nsthe Region in which some replicas deviate from the namespace constraints
Usage:
```bash
>> region check miss-peer
{
"count": 2,
"regions": [...]
}
```
### `scheduler [show | add | remove]`
Use this command to view and control the scheduling strategy.
Usage:
```bash
>> scheduler show // Display all schedulers
>> scheduler add grant-leader-scheduler 1 // Schedule all the leaders of the regions on store 1 to store 1
>> scheduler add evict-leader-scheduler 1 // Move all the region leaders on store 1 out
>> scheduler add shuffle-leader-scheduler // Randomly exchange the leader on different stores
>> scheduler add shuffle-region-scheduler // Randomly scheduling the regions on different stores
>> scheduler remove grant-leader-scheduler-1 // Remove the corresponding scheduler
```
### `store [delete | label | weight] <store_id> [--jq="<query string>"]`
Use this command to view the store information or remove a specified store. For a jq formatted output, see [jq-formatted-json-output-usage](#jq-formatted-json-output-usage).
Usage:
```bash
>> store // Display information of all stores
{
"count": 3,
"stores": [...]
}
>> store 1 // Get the store with the store id of 1
{
"store": {
"id": 1,
"address": "127.0.0.1:20160",
"version": "2.1.0-rc.5",
"state_name": "Up"
},
"status": {
"capacity": "10 GiB",
"available": "10 GiB",
"leader_count": 14,
"leader_weight": 1,
"leader_score": 14,
"leader_size": 14,
"region_count": 14,
"region_weight": 1,
"region_score": 14,
"region_size": 14,
"start_ts": "2018-11-26T18:59:05+08:00",
"last_heartbeat_ts": "2018-11-26T19:38:41.335120555+08:00",
"uptime": "39m36.335120555s"
}
}
>> store delete 1 // Delete the store with the store id of 1
Success!
>> store label 1 zone cn // Set the value of the label with the "zone" key to "cn" for the store with the store id of 1
>> store weight 1 5 10 // Set the leader weight to 5 and region weight to 10 for the store with the store id of 1
```
### `store limit <store_id> <rate>`
Use this command to modify the upper limit of scheduling rate for a specific store. The tasks include operations such as adding a peer or a learner.
Example:
```bash
>> store limit 2 10 // Set the upper limit of scheduling speed for store 2 to be 10 scheduling tasks per minute.
```
### `stores show limit`
Use this command to view the upper limit of the scheduling rate for all stores, in the unit of tasks per minute.
Example:
```bash
» stores show limit // If store-balance-rate is set to 15, the corresponding rate for all stores should be 15.
{
"4": {
"rate": 15
},
"5": {
"rate": 15
},
...
}
```
### `stores set limit <rate>`
Use this command to set the maximum number of scheduling tasks for all stores, in the unit of tasks per minute. The tasks include operations such as adding a peer or a learner.
Example:
```bash
>> stores set limit 20 // Set the upper limit of scheduling speed for all stores to be 20 scheduling tasks per minute.
```
### `table_ns [create | add | remove | set_store | rm_store | set_meta | rm_meta]`
Use this command to view the namespace information of the table.
Usage:
```bash
>> table_ns add ts1 1 // Add the table with the table id of 1 to the namespace named ts1
>> table_ns create ts1 // Add the namespace named ts1
>> table_ns remove ts1 1 // Remove the table with the table id of 1 from the namespace named ts1
>> table_ns rm_meta ts1 // Remove the metadata from the namespace named ts1
>> table_ns rm_store 1 ts1 // Remove the table with the store id of 1 from the namespace named ts1
>> table_ns set_meta ts1 // Add the metadata to namespace named ts1
>> table_ns set_store 1 ts1 // Add the table with the store id of 1 to the namespace named ts1
```
### `tso`
Use this command to parse the physical and logical time of TSO.
Usage:
```bash
>> tso 395181938313123110 // Parse TSO
system: 2017-10-09 05:50:59.507 +0800 CST
logic: 120102
```
## Jq formatted JSON output usage
### Simplify the output of `store`
```bash
» store --jq=".stores[].store | { id, address, state_name}"
{"id":1,"address":"127.0.0.1:20161","state_name":"Up"}
{"id":30,"address":"127.0.0.1:20162","state_name":"Up"}
...
```
### Query the remaining space of the node
```bash
» store --jq=".stores[] | {id: .store.id, available: .status.available}"
{"id":1,"available":"10 GiB"}
{"id":30,"available":"10 GiB"}
...
```
### Query the distribution status of the Region replicas
```bash
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id]}"
{"id":2,"peer_stores":[1,30,31]}
{"id":4,"peer_stores":[1,31,34]}
...
```
### Filter Regions according to the number of replicas
For example, to filter out all Regions whose number of replicas is not 3:
```bash
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length != 3)}"
{"id":12,"peer_stores":[30,32]}
{"id":2,"peer_stores":[1,30,31,32]}
```
### Filter Regions according to the store ID of replicas
For example, to filter out all Regions that have a replica on store30:
```bash
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(any(.==30))}"
{"id":6,"peer_stores":[1,30,31]}
{"id":22,"peer_stores":[1,30,32]}
...
```
You can also find out all Regions that have a replica on store30 or store31 in the same way:
```bash
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(any(.==(30,31)))}"
{"id":16,"peer_stores":[1,30,34]}
{"id":28,"peer_stores":[1,30,32]}
{"id":12,"peer_stores":[30,32]}
...
```
### Look for relevant Regions when restoring data
For example, when [store1, store30, store31] is unavailable at its downtime, you can find all Regions whose Down replicas are more than normal replicas:
```bash
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,30,31) then . else empty end) | length>=$total-length) }"
{"id":2,"peer_stores":[1,30,31,32]}
{"id":12,"peer_stores":[30,32]}
{"id":14,"peer_stores":[1,30,32]}
...
```
Or when [store1, store30, store31] fails to start, you can find Regions where the data can be manually removed safely on store1. In this way, you can filter out all Regions that have a replica on store1 but don't have other DownPeers:
```bash
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length>1 and any(.==1) and all(.!=(30,31)))}"
{"id":24,"peer_stores":[1,32,33]}
```
When [store30, store31] is down, find out all Regions that can be safely processed by creating the `remove-peer` Operator, that is, Regions with one and only DownPeer:
```bash
» region --jq=".regions[] | {id: .id, remove_peer: [.peers[].store_id] | select(length>1) | map(if .==(30,31) then . else empty end) | select(length==1)}"
{"id":12,"remove_peer":[30]}
{"id":4,"remove_peer":[31]}
{"id":22,"remove_peer":[30]}
...
```

View File

@ -0,0 +1,49 @@
---
title: pd-recover
description: Learn about interacting with pd-recover
menu:
"3.1-beta":
parent: Tools
weight: 5
---
PD Recover is a disaster recovery tool of PD, used to recover the PD cluster which cannot start or provide services normally.
## Source code compiling
1. [Go](https://golang.org/) Version 1.11 or later
2. In the root directory of the [PD project](https://github.com/pingcap/pd), use the `make` command to compile and generate `bin/pd-recover`
## Usage
This section describes how to recover a PD cluster which cannot start or provide services normally.
### Flags description
```
-alloc-id uint
Specify a number larger than the allocated ID of the original cluster
-cacert string
Specify the path to the trusted CA certificate file in PEM format
-cert string
Specify the path to the SSL certificate file in PEM format
-key string
Specify the path to the SSL certificate key file in PEM format, which is the private key of the certificate specified by `--cert`
-cluster-id uint
Specify the Cluster ID of the original cluster
-endpoints string
Specify the PD address (default: "http://127.0.0.1:2379")
```
### Recovery flow
1. Obtain the Cluster ID and the Alloc ID from the current cluster.
- Obtain the Cluster ID from the PD and TiKV logs.
- Obtain the allocated Alloc ID from either the PD log or the `Metadata Information` in the PD monitoring panel.
Specifying `alloc-id` requires a number larger than the current largest Alloc ID. If you fail to obtain the Alloc ID, you can make an estimate of a larger number according to the number of Regions and stores in the cluster. Generally, you can specify a number that is several orders of magnitude larger.
2. Stop the whole cluster, clear the PD data directory, and restart the PD cluster.
3. Use PD Recover to recover and make sure that you use the correct `cluster-id` and appropriate `alloc-id`.
4. When the recovery success information is prompted, restart the whole cluster.

View File

@ -0,0 +1,10 @@
---
title: pd-server
description: Learn about interacting with pd-server
menu:
"3.1-beta":
parent: Tools
weight: 3
---
You can explore `pd-server --help` and try `pd-server your sub command --help` to dig into functionality.

View File

@ -0,0 +1,296 @@
---
title: tikv-ctl
description: Learn about interacting with tikv-ctl
menu:
"3.1-beta":
parent: Tools
weight: 2
---
TiKV Control (`tikv-ctl`) is a command line tool of TiKV, used to manage the cluster.
When you compile TiKV, the `tikv-ctl` command is also compiled at the same time. If the cluster is deployed using Ansible, the `tikv-ctl` binary file exists in the corresponding `tidb-ansible/resources/bin` directory. If the cluster is deployed using the binary, the `tikv-ctl` file is in the `bin` directory together with other files such as `tidb-server`, `pd-server`, `tikv-server`, etc.
## General options
`tikv-ctl` provides two operation modes:
- Remote mode: use the `--host` option to accept the service address of TiKV as the argument
For this mode, if SSL is enabled in TiKV, `tikv-ctl` also needs to specify the related certificate file. For example:
```
$ tikv-ctl --ca-path ca.pem --cert-path client.pem --key-path client-key.pem --host 127.0.0.1:21060 <subcommands>
```
However, sometimes `tikv-ctl` communicates with PD instead of TiKV. In this case, you need to use the `--pd` option instead of `--host`. Here is an example:
```
$ tikv-ctl --pd 127.0.0.1:2379 compact-cluster
store:"127.0.0.1:20160" compact db:KV cf:default range:([], []) success!
```
- Local mode: Use the `--db` option to specify the local TiKV data directory path. In this mode, you need to stop the running TiKV instance.
Unless otherwise noted, all commands support both the remote mode and the local mode.
Additionally, `tikv-ctl` has two simple commands `--to-hex` and `--to-escaped`, which are used to make simple changes to the form of the key.
Generally, use the `escaped` form of the key. For example:
```bash
$ tikv-ctl --to-escaped 0xaaff
\252\377
$ tikv-ctl --to-hex "\252\377"
AAFF
```
> **Note:** When you specify the `escaped` form of the key in a command line, it is required to enclose it in double quotes. Otherwise, bash eats the backslash and a wrong result is returned.
## Subcommands, some options and flags
This section describes the subcommands that `tikv-ctl` supports in detail. Some subcommands support a lot of options. For all details, run `tikv-ctl --help <subcommand>`.
### View information of the Raft state machine
Use the `raft` subcommand to view the status of the Raft state machine at a specific moment. The status information includes two parts: three structs (**RegionLocalState**, **RaftLocalState**, and **RegionApplyState**) and the corresponding Entries of a certain piece of log.
Use the `region` and `log` subcommands to obtain the above information respectively. The two subcommands both support the remote mode and the local mode at the same time. Their usage and output are as follows:
```bash
$ tikv-ctl --host 127.0.0.1:21060 raft region -r 2
region id: 2
region state key: \001\003\000\000\000\000\000\000\000\002\001
region state: Some(region {id: 2 region_epoch {conf_ver: 3 version: 1} peers {id: 3 store_id: 1} peers {id: 5 store_id: 4} peers {id: 7 store_id: 6}})
raft state key: \001\002\000\000\000\000\000\000\000\002\002
raft state: Some(hard_state {term: 307 vote: 5 commit: 314617} last_index: 314617)
apply state key: \001\002\000\000\000\000\000\000\000\002\003
apply state: Some(applied_index: 314617 truncated_state {index: 313474 term: 151})
```
### View the Region size
Use the `size` command to view the Region size:
```bash
$ tikv-ctl --db /path/to/tikv/db size -r 2
region id: 2
cf default region size: 799.703 MB
cf write region size: 41.250 MB
cf lock region size: 27616
```
### Scan to view MVCC of a specific range
The `--from` and `--to` options of the `scan` command accept two escaped forms of raw key, and use the `--show-cf` flag to specify the column families that you need to view.
```bash
$ tikv-ctl --db /path/to/tikv/db scan --from 'zm' --limit 2 --show-cf lock,default,write
key: zmBootstr\377a\377pKey\000\000\377\000\000\373\000\000\000\000\000\377\000\000s\000\000\000\000\000\372
write cf value: start_ts: 399650102814441473 commit_ts: 399650102814441475 short_value: "20"
key: zmDB:29\000\000\377\000\374\000\000\000\000\000\000\377\000H\000\000\000\000\000\000\371
write cf value: start_ts: 399650105239273474 commit_ts: 399650105239273475 short_value: "\000\000\000\000\000\000\000\002"
write cf value: start_ts: 399650105199951882 commit_ts: 399650105213059076 short_value: "\000\000\000\000\000\000\000\001"
```
### View MVCC of a given key
Similar to the `scan` command, the `mvcc` command can be used to view MVCC of a given key.
```bash
$ tikv-ctl --db /path/to/tikv/db mvcc -k "zmDB:29\000\000\377\000\374\000\000\000\000\000\000\377\000H\000\000\000\000\000\000\371" --show-cf=lock,write,default
key: zmDB:29\000\000\377\000\374\000\000\000\000\000\000\377\000H\000\000\000\000\000\000\371
write cf value: start_ts: 399650105239273474 commit_ts: 399650105239273475 short_value: "\000\000\000\000\000\000\000\002"
write cf value: start_ts: 399650105199951882 commit_ts: 399650105213059076 short_value: "\000\000\000\000\000\000\000\001"
```
In this command, the key is also the escaped form of raw key.
### Scan raw keys
The `raw-scan` command scans directly from the RocksDB. Note that to scan data keys you need to add a `'z'` prefix to keys.
Use `--from` and `--to` options to specify the range to scan (unbounded by default). Use `--limit` to limit at most how
many keys to print out (30 by default). Use `--cf` to specify which cf to scan (can be `default`, `write` or `lock`).
```bash
$ ./tikv-ctl --db /var/lib/tikv/db/ raw-scan --from 'zt' --limit 2 --cf default
key: "zt\200\000\000\000\000\000\000\377\005_r\200\000\000\000\000\377\000\000\001\000\000\000\000\000\372\372b2,^\033\377\364", value: "\010\002\002\002%\010\004\002\010root\010\006\002\000\010\010\t\002\010\n\t\002\010\014\t\002\010\016\t\002\010\020\t\002\010\022\t\002\010\024\t\002\010\026\t\002\010\030\t\002\010\032\t\002\010\034\t\002\010\036\t\002\010 \t\002\010\\"\t\002\010$\t\002\010&\t\002\010(\t\002\010*\t\002\010,\t\002\010.\t\002\0100\t\002\0102\t\002\0104\t\002"
key: "zt\200\000\000\000\000\000\000\377\025_r\200\000\000\000\000\377\000\000\023\000\000\000\000\000\372\372b2,^\033\377\364", value: "\010\002\002&slow_query_log_file\010\004\002P/usr/local/mysql/data/localhost-slow.log"
Total scanned keys: 2
```
### Print a specific key value
To print the value of a key, use the `print` command.
### Print some properties about Region
In order to record Region state details, TiKV writes some statistics into the SST files of Regions. To view these properties, run `tikv-ctl` with the `region-properties` sub-command:
```bash
$ tikv-ctl --host localhost:20160 region-properties -r 2
num_files: 0
num_entries: 0
num_deletes: 0
mvcc.min_ts: 18446744073709551615
mvcc.max_ts: 0
mvcc.num_rows: 0
mvcc.num_puts: 0
mvcc.num_versions: 0
mvcc.max_row_versions: 0
middle_key_by_approximate_size:
```
The properties can be used to check whether the Region is healthy or not. If not, you can use them to fix the Region. For example, splitting the Region manually by `middle_key_approximate_size`.
### Compact data of each TiKV manually
Use the `compact` command to manually compact data of each TiKV. If you specify the `--from` and `--to` options, then their flags are also in the form of escaped raw key. You can use the `--db` option to specify the RocksDB that you need to compact. The optional values are `kv` and `raft`. Also, the `--threads` option allows you to specify the concurrency that you compact and its default value is 8. Generally, a higher concurrency comes with a faster compact speed, which might yet affect the service. You need to choose an appropriate concurrency based on the scenario.
```bash
$ tikv-ctl --db /path/to/tikv/db compact -d kv
success!
```
### Compact data of the whole TiKV cluster manually
Use the `compact-cluster` command to manually compact data of the whole TiKV cluster. The flags of this command have the same meanings and usage as those of the `compact` command.
### Set a Region to tombstone
The `tombstone` command is usually used in circumstances where the sync-log is not enabled, and some data written in the Raft state machine is lost caused by power down.
In a TiKV instance, you can use this command to set the status of some Regions to Tombstone. Then when you restart the instance, those Regions are skipped. Those Regions need to have enough healthy replicas in other TiKV instances to be able to continue writing and reading through the Raft mechanism.
```bash
pd-ctl>> operator add remove-peer <region_id> <peer_id>
$ tikv-ctl --db /path/to/tikv/db tombstone -p 127.0.0.1:2379 -r 2
success!
```
> **Note:**
>
> - This command only supports the local mode.
> - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether PD can securely switch to Tombstone. Therefore, before setting a PD instance to Tombstone, you need to take off the corresponding Peer of this Region on the machine in `pd-ctl`.
### Send a `consistency-check` request to TiKV
Use the `consistency-check` command to execute a consistency check among replicas in the corresponding Raft of a specific Region. If the check fails, TiKV itself panics. If the TiKV instance specified by `--host` is not the Region leader, an error is reported.
```bash
$ tikv-ctl --host 127.0.0.1:21060 consistency-check -r 2
success!
$ tikv-ctl --host 127.0.0.1:21061 consistency-check -r 2
DebugClient::check_region_consistency: RpcFailure(RpcStatus { status: Unknown, details: Some("StringError(\"Leader is on store 1\")") })
```
> **Note:**
>
> - This command only supports the remote mode.
> - Even if this command returns `success!`, you need to check whether TiKV panics. This is because this command is only a proposal that requests a consistency check for the leader, and you cannot know from the client whether the whole check process is successful or not.
### Dump snapshot meta
This sub-command is used to parse a snapshot meta file at given path and print the result.
### Print the Regions where the Raft state machine corrupts
To avoid checking the Regions while TiKV is started, you can use the `tombstone` command to set the Regions where the Raft state machine reports an error to Tombstone. Before running this command, use the `bad-regions` command to find out the Regions with errors, so as to combine multiple tools for automated processing.
```bash
$ tikv-ctl --db /path/to/tikv/db bad-regions
all regions are healthy
```
If the command is successfully executed, it prints the above information. If the command fails, it prints the list of bad Regions. Currently, the errors that can be detected include the mismatches between `last index`, `commit index` and `apply index`, and the loss of Raft log. Other conditions like the damage of snapshot files still need further support.
### View Region properties
- To view in local the properties of Region 2 on the TiKV instance that is deployed in `/path/to/tikv`:
```bash
$ tikv-ctl --db /path/to/tikv/data/db region-properties -r 2
```
- To view online the properties of Region 2 on the TiKV instance that is running on `127.0.0.1:20160`:
```bash
$ tikv-ctl --host 127.0.0.1:20160 region-properties -r 2
```
### Modify the RocksDB configuration of TiKV dynamically
You can use the `modify-tikv-config` command to dynamically modify the configuration arguments. Currently, it only supports dynamically modifying RocksDB related arguments.
- `-m` is used to specify the target RocksDB. You can set it to `kvdb` or `raftdb`.
- `-n` is used to specify the configuration name.
You can refer to the arguments of `[rocksdb]` and `[raftdb]` (corresponding to `kvdb` and `raftdb`) in the [TiKV configuration template](https://github.com/tikv/tikv/blob/master/etc/config-template.toml#L213-L500).
You can use `default|write|lock + . + argument name` to specify the configuration of different CFs. For `kvdb`, you can set it to `default`, `write`, or `lock`; for `raftdb`, you can only set it to `default`.
- `-v` is used to specify the configuration value.
```bash
$ tikv-ctl modify-tikv-config -m kvdb -n max_background_jobs -v 8
success
$ tikv-ctl modify-tikv-config -m kvdb -n write.block-cache-size -v 256MB
success!
$ tikv-ctl modify-tikv-config -m raftdb -n default.disable_auto_compactions -v true
success!
```
### Force Region to recover the service from failure of multiple replicas
Use the `unsafe-recover remove-fail-stores` command to remove the failed machines from the peer list of Regions. Then after you restart TiKV, these Regions can continue to provide services using the other healthy replicas. This command is usually used in circumstances where multiple TiKV stores are damaged or deleted.
The `-s` option accepts multiple `store_id` separated by comma and uses the `-r` flag to specify involved Regions. Otherwise, all Regions' peers located on these stores will be removed by default.
```bash
$ tikv-ctl --db /path/to/tikv/db unsafe-recover remove-fail-stores -s 3 -r 1001,1002
success!
```
> **Note:**
>
> - This command only supports the local mode. It prints `success!` when successfully run.
> - You must run this command for all stores where specified Regions' peers are located. If `-r` is not set, all Regions are involved, and you need to run this command for all stores.
### Recover from MVCC data corruption
Use the `recover-mvcc` command in circumstances where TiKV cannot run normally caused by MVCC data corruption. It cross-checks 3 CFs ("default", "write", "lock") to recover from various kinds of inconsistency.
Use the `-r` option to specify involved Regions by `region_id`. Use the `-p` option to specify PD endpoints.
```bash
$ tikv-ctl --db /path/to/tikv/db recover-mvcc -r 1001,1002 -p 127.0.0.1:2379
success!
```
> **Note**:
>
> - This command only supports the local mode. It prints `success!` when successfully run.
> - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether the specified `region_id` is validated or not.
> - You need to run this command for all stores where specified Regions' peers are located.
### Ldb Command
The ldb command line tool offers multiple data access and database administration commands. Some examples are listed below.
For more information, refer to the help message displayed when running `tikv-ctl ldb` or check the documents from RocksDB.
Examples of data access sequence:
To dump an existing RocksDB in HEX:
```bash
$ tikv-ctl ldb --hex --db=/tmp/db dump
```
To dump the manifest of an existing RocksDB:
```bash
$ tikv-ctl ldb --hex manifest_dump --path=/tmp/db/MANIFEST-000001
```
You can specify the column family that your query is against using the `--column_family=<string>` command line.
`--try_load_options` loads the database options file to open the database. It is recommended to always keep this option on when the database is running. If you open the database with default options, the LSM-tree might be messed up, which cannot be recovered automatically.

View File

@ -0,0 +1,10 @@
---
title: tikv-server
description: Learn about interacting with tikv-server
menu:
"3.1-beta":
parent: Tools
weight: 1
---
You can explore `tikv-server --help` and try `tikv-server your sub command --help` to dig into this functionality.

View File

@ -0,0 +1,8 @@
---
title: Backup
description: Backup TiKV
draft: true
menu:
"3.1-beta":
parent: Tasks
---

View File

@ -0,0 +1,23 @@
---
title: Configure
description: Configure a wide range of TiKV facets, including RocksDB, gRPC, the Placement Driver, and more
menu:
"3.1-beta":
parent: Tasks
weight: 3
---
TiKV features a large number of configuration options you can use to tweak TiKV's behavior. When getting started with TiKV, it's usually safe to start with the defaults, configuring only the `--pd` (`pd.endpoints`) configuration.
There are several guides that you can use to inform your configuration:
* [**Security**](../security): Use TLS security and review security procedures.
* [**Topology**](../topology): Use location awareness to improve resilency and performance.
* [**Namespace**](../namespace): Use namespacing to configure resource isolation.
* [**Limit**](../limit): Tweak rate limiting.
* [**Region Merge**](../region-merge): Tweak region merging.
* [**RocksDB**](../rocksdb): Tweak RocksDB configuration options.
* [**Titan**](../titan): Enable titan to improve performance with large values.
You can find an exhaustive list of all options, as well as what they do, in the documented [**full configuration template**](https://github.com/tikv/tikv/blob/release-3.0/etc/config-template.toml).

View File

@ -0,0 +1,100 @@
---
title: Limit Config
description: Learn how to configure scheduling rate limit on stores
menu:
"3.1-beta":
parent: Configure
weight: 4
---
This section describes how to configure scheduling rate limit, specifically, at the store level.
In TiKV, PD generates different scheduling operators based on the information gathered from TiKV and scheduling strategies. The operators are then sent to TiKV to perform scheduling on Regions. You can use `*-schedule-limit` to set speed limits on different operators, but this may cause performance bottlenecks in certain scenarios because these parameters function globally on the entire cluster. Rate limit at the store level allows you to control scheduling more flexibly with more refined granularities.
## How to configure scheduling rate limits on stores
PD provides the following two methods to configure scheduling rate limits on stores:
- Configure the rate limit using **`store-balance-rate`**.
{{< info >}}
The modification only takes effect on stores added after this configuration change, and will be applied to all stores in the cluster after you restart TiKV. If you want this change to work immediately on all stores or some individual stores before the change without restarting, combine this configuration with the `pd-ctl` tool method below. See [Sample usages](#sample-usages) for more details.
{{< /info >}}
`store-balance-rate` specifies the maximum number of scheduling tasks allowed for each store per minute. The scheduling steps include adding peers or learners. Set this parameter in the PD configuration file. The default value is 15. This configuration is persistent.
Use the `pd-ctl` tool to modify `store-balance-rate` and make it persistent.
Example:
```bash
» config set store-balance-rate 20
```
- Use the `pd-ctl` tool to view or modify the upper limit of the scheduling rate. The commands are:
{{< info >}}
This method is not persistent, and the configuration will revert after restarting TiKV.
{{< /info >}}
- **`stores show limit`**
Example:
```bash
# If store-balance-rate is set to 15, the corresponding rate for all stores should be 15.
» stores show limit
{
"4": {
"rate": 15
},
"5": {
"rate": 15
},
# ...
}
```
- **`stores set limit <rate>`**
Example:
```bash
# Set the upper limit of scheduling rate for all stores to be 20 scheduling tasks per minute.
» stores set limit 20
```
- **`store limit <store_id> <rate>`**
Example:
```bash
# Set the upper limit of scheduling speed for store 2 to be 10 scheduling tasks per minute.
» store limit 2 10
```
See [PD Control](../../reference/tools/pd-ctl/) for more detailed description of these commands.
## Sample usages
- The following example modifies the rate limit to 20 and applies immediately to all stores. The configuration is still valid after restart.
```bash
» config set store-balance-rate 20
» stores set limit 20
```
- The following example modifies the rate limit for all stores to 20 and applies immediately. After restart, the configuration becomes invalid, and the rate limit for all stores specified by `store-balance-rate` takes over.
```bash
» stores set limit 20
```
- The following example modifies the rate limit for store 2 to 20 and applies immediately. After restart, the configuration becomes invalid, and the rate limit for store 2 becomes the value specified by `store-balance-rate`.
```bash
» store limit 2 20
```

View File

@ -0,0 +1,126 @@
---
title: Namespace Config
description: Learn how to configure namespace in TiKV.
menu:
"3.1-beta":
parent: Configure
weight: 3
---
Namespacing can be used to meet the requirements of resource isolation. In this mechanism, TiKV supports dividing all the TiKV nodes in the cluster among multiple separate namespaces and classifying Regions into the corresponding namespace by using a custom namespace classifier.
In this case, there is actually a constraint for the scheduling policy: the namespace that a Region belongs to should match the namespace of TiKV where each replica of this Region resides. PD continuously performs the constraint check during runtime. When it finds unmatched namespaces, it will schedule the Regions to make the replica distribution conform to the namespace configuration.
In a typical cluster (which includes [TiDB](https://github.com/pingcap/tidb)) the most common resource isolation requirement is resource isolation based on the SQL table schema -- for example, using non-overlapping hosts to carry data for different services. Therefore, PD provides `tableNamespaceClassifier` based on the SQL table schema by default. You can also adjust the PD server's `--namespace-classifier` parameter to use another custom classifier.
## Configure namespace
You can use `pd-ctl` to configure the namespace based on the table schema. All related operations are integrated in the `table_ns` subcommand. Here is an example.
1. Create 2 namespaces:
```bash
./bin/area-pd-ctl
» table_ns
{
"count": 0,
"namespaces": []
}
» table_ns create ns1
» table_ns create ns2
» table_ns
{
"count": 2,
"namespaces": [
{
"ID": 30,
"Name": "ns1"
},
{
"ID": 31,
"Name": "ns2"
}
]
}
```
Then two namespaces, `ns1` and `ns2`, are created. But they do not work because they are not bound with any TiKV nodes or tables.
2. Divide some TiKV nodes to the 2 namespaces:
```bash
» table_ns set_store 1 ns1
» table_ns set_store 2 ns1
» table_ns set_store 3 ns1
» table_ns set_store 4 ns2
» table_ns set_store 5 ns2
» table_ns set_store 6 ns2
» table_ns
{
"count": 2,
"namespaces": [
{
"ID": 30,
"Name": "ns1",
"store_ids": {
"1": true,
"2": true,
"3": true
}
},
{
"ID": 31,
"Name": "ns2",
"store_ids": {
"4": true,
"5": true,
"6": true
}
}
]
}
```
3. Divide some tables to the corresponding namespace (the table ID information can be obtained through TiDB's API):
```bash
» table_ns add ns1 1001
» table_ns add ns2 1002
» table_ns
{
"count": 2,
"namespaces": [
{
"ID": 30,
"Name": "ns1",
"table_ids": {
"1001": true
},
"store_ids": {
"1": true,
"2": true,
"3": true
}
},
{
"ID": 31,
"Name": "ns2",
"table_ids": {
"1002": true
},
"store_ids": {
"4": true,
"5": true,
"6": true
}
}
]
}
```
The namespace configuration is finished. PD will schedule the replicas of table 1001 to TiKV nodes 1,2, and 3 and schedule the replicas of table 1002 to TiKV nodes 4, 5, and 6.
In addition, PD supports some other `table_ns` subcommands, such as the `remove` and `rm_store` commands which remove the table and TiKV node from the specified namespace respectively. PD also supports setting different scheduling configurations within the namespace. For more details, see [PD Control User Guide](../../reference/tools/pd-ctl/).
When the namespace configuration is updated, the namespace constraint may be violated. It will take a while for PD to complete the scheduling process. You can view all Regions that violate the constraint using the `pd-ctl` command `region check incorrect-ns`.

View File

@ -0,0 +1,40 @@
---
title: Region Merge Config
description: Learn how to configure Region Merge in TiKV.
menu:
"3.1-beta":
parent: Configure
weight: 5
---
TiKV replicates a segment of data in Regions via the Raft state machine. As data writes increase, a Region Split happens when the size of the region or the number of keys has reached a threshold. Conversely, if the size of the Region and the amount of keys shrinks because of data deletion, we can use Region Merge to merge adjacent regions that are smaller. This relieves some stress on Raftstore.
## Merge process
Region Merge is initiated by the Placement Driver (PD). The steps are:
1. PD polls the meta information of Regions constantly.
2. If the region size is less than `max-merge-region-size` and the number of keys the region includes is less than `max-merge-region-keys`, PD performs Region Merge on the region with the smaller one of the two adjacent Regions.
**Note:**
- All replicas of the two Regions to be merged must locate on the same set of TiKVs (It is ensured by PD scheduler).
- Newly split Regions won't be merged within the period of time specified by `split-merge-interval`.
- Region Merge won't happen within the period of time specified by `split-merge-interval` after PD starts or restarts.
- Region Merge won't happen for two Regions that belong to different tables if `namespace-classifier = table` (default).
## Configure Region Merge
Region Merge is enabled by default. You can use `pd-ctl` or the PD configuration file to configure Region Merge.
To enable Region Merge, set the following parameters to a non-zero value:
- `max-merge-region-size`
- `max-merge-region-keys`
- `merge-schedule-limit`
You can use `split-merge-interval` to control the interval between the `split` and `merge` operations.
For detailed descriptions on the above parameters, refer to [PD Control](../../reference/tools/pd-ctl/).

View File

@ -0,0 +1,37 @@
---
title: RocksDB Config
description: Learn how to configure namespace in TiKV.
menu:
"3.1-beta":
parent: Configure
weight: 6
---
TiKV uses [RocksDB](https://rocksdb.org/) as its underlying storage engine for storing both [Raft logs](architecture#raft) and KV (key-value) pairs.
{{< info >}}
RocksDB was chosen for TiKV because it provides a highly customizable persistent key-value store that can be tuned to run in a variety of production environments, including pure memory, Flash, hard disks, or HDFS, it supports various compression algorithms, and it provides solid tools for production support and debugging.
{{< /info >}}
TiKV creates two RocksDB instances on each Node:
* A `rocksdb` instance that stores most TiKV data
* A `raftdb` that stores [Raft logs](architecture#raft) and has a single column family called `raftdb.defaultcf`
The `rocksdb` instance has three column families:
Column family | Purpose
:-------------|:-------
`rocksdb.defaultcf` | Stores actual KV pairs for TiKV
`rocksdb.writecf` | Stores commit information in the MVCC model
RocksDB can be configured on a per-column-family basis. Here's an example:
```toml
[rocksdb]
max-background-jobs = 8
```
### RocksDB configuration options
{{< config "rocksdb" >}}

View File

@ -0,0 +1,186 @@
---
title: Security Config
description: Keeping your TiKV deployment secure
menu:
"3.1-beta":
parent: Configure
weight: 1
---
This page discusses how to secure your TiKV deployment. Learn how to:
* Use [Transport Layer Security](#transport-layer-security-tls) to encrypt connections between TiKV nodes
* Use [On-Disk Encryption](#on-disk-encryption) to encrypt the data that TiKV reads and writes to disk
* [Report vulnerabilities](#reporting-vulnerabilities)
## Transport Layer Security (TLS)
Transport Layer Security is an standard protocol for protecting networking communications from tampering or inspection. TiKV uses OpenSSL, an industry standard, to implement it's TLS encryption.
It's often necessary to use TLS in situations where TiKV is being deployed or accessed from outside of a secure virtual local area network (VLAN). This includes deployments which cross the WAN (the public internet), which are part of an untrusted data center network, or where other untrustworthy users or services are active.
## Before you get started
Before you get started, review your infrastructure. Your organization may already use something like the [Kubernetes certificates API](https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/) to issue certificates. You will need the following for your deployment:
- A **Certificate Authority** (CA) certificate
- Individual unique **certificates** and **keys** for each TiKV or PD service
- One or many **certificates** and **keys** for TiKV clients depending on your needs.
If you have these, you can skip the optional section below.
If your organization doesn't yet have a public key infrastructure (PKI), you can create a simple Certificate Authority to issue certificates for the services in your deployment. The instructions below show you how to do this in a few quick steps:
### Optional: Generate a test certificate chain
Prepare certificates for each TiKV and PD node to be involved with the cluster.
It is recommended to prepare a separate server certificate for TiKV and the Placement Driver (PD), and make sure that they can authenticate each other. The clients of TiKV and PD can share one client certificate.
You can use multiple tools to generate self-signed certificates, such as `openssl`, `easy-rsa`, and `cfssl`.
Here is an example of generating self-signed certificates using [`easyrsa`](https://github.com/OpenVPN/easy-rsa/):
```bash
#! /bin/bash
set +e
mkdir -p easyrsa
cd easyrsa
curl -L https://github.com/OpenVPN/easy-rsa/releases/download/v3.0.6/EasyRSA-unix-v3.0.6.tgz \
| tar xzv --strip-components=1
./easyrsa init-pki \
&& ./easyrsa build-ca nopass
NUM_PD_NODES=3
for i in $(seq 1 $NUM_PD_NODES); do
./easyrsa gen-req pd$i nopass
./easyrsa sign-req server pd$i
done
NUM_TIKV_NODES=3
for i in $(seq 1 $NUM_TIKV_NODES); do
./easyrsa gen-req tikv$i nopass
./easyrsa sign-req server tikv$i
done
./easyrsa gen-req client nopass
./easyrsa sign-req server client
```
If you run this script, you'll need to interactively answer some questions and make some confirmations. You can answer with anything for the CA common name. For the PD and TiKV nodes, use the hostnames.
{{< info >}}
You can explore the `easyrsa/vars.example` file if you are hoping to write an unattended script.
{{< /info >}}
If the script runs successfully, you should have something like this:
```bash
$ ls easyrsa/pki/{ca.crt,issued,private}
easyrsa/pki/ca.crt
easyrsa/pki/issued:
client.crt pd1.crt pd2.crt pd3.crt tikv1.crt tikv2.crt tikv3.crt
easyrsa/pki/private:
ca.key client.key pd1.key pd2.key pd3.key tikv1.key tikv2.key tikv3.key
```
### Configure the TiKV Server Certificates
Specify the TLS options for TiKV certificate with the configuration file options:
```toml
# Using empty strings here means disabling secure connections.
[security]
# The path to the file that contains the PEM encoding of the servers CA certificates.
ca-path = "/path/to/ca.pem"
# The path to the file that contains the PEM encoding of the servers certificate chain.
cert-path = "/path/to/tikv-server-cert.pem"
# The path to the file that contains the PEM encoding of the servers private key.
key-path = "/path/to/tikv-server-key.pem"
```
You'll also need to **change the connection URL to `https://`**.
### Configure the PD Certificates
Specify the TLS options for PD certificate with the configuration file options:
```toml
[security]
# The path to the file that contains the PEM encoding of the servers CA certificates.
cacert-path = "/path/to/ca.pem"
# The path to the file that contains the PEM encoding of the servers certificate chain.
cert-path = "/path/to/pd-server-cert.pem"
# The path to the file that contains the PEM encoding of the servers private key.
key-path = "/path/to/pd-server-key.pem"
```
You'll also need to **change the connection URL to `https://`**.
### Configure the Client
When connecting your TiKV Client, you'll need to specify the TLS options. In this example, we build a configuration for the [Rust Client](https://github.com/tikv/client-rust):
```rust
let config = Config::new(/* ... */).with_security(
// The path to the file that contains the PEM encoding of the servers CA certificates.
"/path/to/ca.pem",
// The path to the file that contains the PEM encoding of the servers certificate chain.
"/path/to/client-cert.pem",
// The path to the file that contains the PEM encoding of the servers private key.
"/path/to/client-key.pem"
);
```
You'll also need to **change the connection URL to `https://`**.
### Connecting with `tikv-ctl` and `pd-ctl`
When using `pd-ctl` and `tikv-ctl` the relevant options will need to be specified:
```bash
pd-ctl \
--pd "https://127.0.0.1:2379" \
# The path to the file that contains the PEM encoding of the servers CA certificates.
--cacert "/path/to/ca.pem" \
# The path to the file that contains the PEM encoding of the servers certificate chain.
--cert "/path/to/client.pem" \
# The path to the file that contains the PEM encoding of the servers private key.
--key "/path/to/client-key.pem"
tikv-ctl \
--host "127.0.0.1:20160" \
# The path to the file that contains the PEM encoding of the servers CA certificates.
--ca-path "/path/to/ca.pem" \
# The path to the file that contains the PEM encoding of the servers certificate chain.
--cert-path "/path/to/client.pem" \
# The path to the file that contains the PEM encoding of the servers private key.
--key-path "/path/to/client-key.pem"
```
## On-Disk Encryption
TiKV currently does not offer built-in on disk encryption.
This means an actor with access to the directory could extract TiKV data from it. If TiKV offered build in on disk encryption, then an actor would not be able to access the data.
This feature is part of the planned [roadmap](https://github.com/tikv/tikv/blob/master/docs/ROADMAP.md#engine) under 'Pluggable Engine Interface'. *(See [Issue #3680](https://github.com/tikv/tikv/issues/3680) if you want to help.)*
If your use case only requires that the data be encrypted at the partition level, it is advised to use [`dm-crypt`](https://en.wikipedia.org/wiki/Dm-crypt). This will protect data if, for example, the disk is incorrectly disposed of or stolen.
## Reporting Vulnerabilities
For most vulnerabilities, you are invited to open a ['Bug Report'](https://github.com/tikv/tikv/issues/new?template=bug-report.md) on our issue tracker.
For situations where the vulnerability must be kept secret to maintain data security or integrity, you should contact a [maintainer](https://github.com/tikv/tikv/blob/master/MAINTAINERS.md), who are best equipped to handle these critical situations.
Examples of critical situations:
* You have discovered a bug in the TLS implementation of TiKV which could leak data.
* You have discovered a way to retrieve more data than expected from TiKV.
*Please do not disclose critical vulnerabilities publicly if you are unsure.*

View File

@ -0,0 +1,55 @@
---
title: Titan Config
description: Learn how to enable Titan in TiKV.
menu:
"3.1-beta":
parent: Configure
weight: 7
---
Titan is a plugin of RocksDB developed by PingCAP to provide key-value separation. The goal of Titan is to reduce write amplification of RocksDB when using large values.
## How Titan works
{{< figure
src="/img/docs/titan-architecture.png"
caption="Titan Architecture"
number="" >}}
Titan separates values from the LSM-tree during flush and compaction. While the actual value is stored in a blob file, the value in the LSM tree functions as the position index of the actual value. When a GET operation is performed, Titan obtains the blob index for the corresponding key from the LSM tree. Using the index, Titan identifies the actual value from the blob file and returns it. For more details on design and implementation of Titan, please refer to [Titan: A RocksDB Plugin to Reduce Write Amplification](https://pingcap.com/blog/titan-storage-engine-design-and-implementation/).
{{< info >}}
**Caveat:** Titan's improved write performance is at the cost of sacrificing storage space and range query performance. It's mostly recommended for scenarios of large values (>= 1KB).
{{< /info >}}
## How to enable Titan
As Titan has not reached an appropriate maturity to be applied in production, it is disabled in TiKV by default. Before enabling it, make sure you understand the caveat as mentioned above and that you have evaluated your scenarios and needs.
To enable Titan in TiKV, in the TiKV configuration file, specify:
```toml
[rocksdb.titan]
# Enables or disables `Titan`. Note that Titan is still an experimental feature.
# default: false
enabled = true
```
## How to fall back to RocksDB
If you find Titan does not help or is causing read or other performance issues, you can take the following steps to fall back to RocksDB:
1. Enter the fallback mode by specifying:
```bash
tikv-ctl --host 127.0.0.1:20160 modify-tikv-config -m kvdb -n default.blob_run_mode -v "kFallback"
```
{{< info >}}
Make sure you have already enabled Titan.
{{< /info >}}
2. Wait until the number of blob files reduced to 0. Alternatively, you can do this
quicky via `tikv-ctl compact-cluster`.
3. In the TiKV configuration file, specify `rocksdb.titan.enabled=false`, and restart TiKV.

View File

@ -0,0 +1,97 @@
---
title: Topology Config
description: Learn how to configure labels.
menu:
"3.1-beta":
parent: Configure
weight: 2
---
TiKV uses labels to label its location information and PD schedulers according to the topology of the cluster, to maximize TiKV's capability of disaster recovery. This document describes how to configure labels.
## TiKV reports the topological information
In order for PD to get the topology of the cluster, TiKV reports the topological information to PD according to the startup parameter or configuration of TiKV. Assume that the topology has three structures: zone > rack > host, use labels to specify the following information for each TiKV:
- Startup parameter:
```bash
tikv-server --labels zone=<zone>,rack=<rack>,host=<host>
```
- Configuration:
```toml
[server]
labels = "zone=<zone>,rack=<rack>,host=<host>"
```
## PD understands the TiKV topology
After getting the topology of the TiKV cluster, PD also needs to know the hierarchical relationship of the topology. You can configure it through the PD configuration or `pd-ctl`:
- PD configuration:
```toml
[replication]
max-replicas = 3
location-labels = ["zone", "rack", "host"]
```
- PD controller:
```toml
pd-ctl >> config set location-labels zone,rack,host
```
To make PD understand that the labels represents the TiKV topology, keep `location-labels` corresponding to the TiKV `labels` name. See the following example.
### Example
PD makes optimal scheduling according to the topological information. You just need to care about what kind of topology can achieve the desired effect.
If you use 3 replicas and hope that the TiDB cluster is always highly available even when a data zone goes down, you need at least 4 data zones.
Assume that you have 4 data zones, each zone has 2 racks, and each rack has 2 hosts. You can start 2 TiKV instances on each host as follows:
Startup TiKV:
```bash
# zone=z1
tikv-server --labels zone=z1,rack=r1,host=h1
tikv-server --labels zone=z1,rack=r1,host=h2
tikv-server --labels zone=z1,rack=r2,host=h1
tikv-server --labels zone=z1,rack=r2,host=h2
# zone=z2
tikv-server --labels zone=z2,rack=r1,host=h1
tikv-server --labels zone=z2,rack=r1,host=h2
tikv-server --labels zone=z2,rack=r2,host=h1
tikv-server --labels zone=z2,rack=r2,host=h2
# zone=z3
tikv-server --labels zone=z3,rack=r1,host=h1
tikv-server --labels zone=z3,rack=r1,host=h2
tikv-server --labels zone=z3,rack=r2,host=h1
tikv-server --labels zone=z3,rack=r2,host=h2
# zone=z4
tikv-server --labels zone=z4,rack=r1,host=h1
tikv-server --labels zone=z4,rack=r1,host=h2
tikv-server --labels zone=z4,rack=r2,host=h1
tikv-server --labels zone=z4,rack=r2,host=h2
```
Configure PD:
```bash
# use `pd-ctl` connect the PD:
$ pd-ctl
>> config set location-labels zone,rack,host
```
Now the cluster can work well. 16 TiKV instances are distributed across 4 data zones, 8 racks and 16 machines. In this case, PD schedules different replicas of each datum to different data zones.
- If one of the data zones goes down, the high availability of the TiDB cluster is not affected.
- If the data zone cannot recover within a period of time, PD will remove the replica from this data zone.
PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to reach a certain level of disaster recovery, deploy many machines in different sites according to the topology. The number of machines must be more than the number of `max-replicas`.

View File

@ -0,0 +1,548 @@
---
title: Ansible Deployment
description: Use TiDB-Ansible to deploy a TiKV cluster on multiple nodes.
menu:
"3.1-beta":
parent: Deploy
weight: 2
---
This guide describes how to install and deploy TiKV using Ansible. Ansible is an IT automation tool that can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates.
[TiDB-Ansible](https://github.com/pingcap/tidb-ansible) is a TiDB cluster deployment tool developed by PingCAP, based on Ansible playbook. TiDB-Ansible enables you to quickly deploy a new TiKV cluster which includes PD, TiKV, and the cluster monitoring modules.
## Prepare
Before you start, make sure you have:
1. Several target machines that meet the following requirements:
- 4 or more machines
+ A standard TiKV cluster contains 6 machines. You can use 4 machines for testing.
- CentOS 7.3 (64 bit) or later with Python 2.7 installed, x86_64 architecture (AMD64)
- Network between machines
> **Note:** When you deploy TiKV using Ansible, use SSD disks for the data directory of TiKV and PD nodes. Otherwise, the system will not perform well. For more details, see [Software and Hardware Requirements](../introduction).
2. A Control Machine that meets the following requirements:
> **Note:** The Control Machine can be one of the target machines.
- CentOS 7.3 (64 bit) or later with Python 2.7 installed
- Access to the Internet
- Git installed
## Step 1: Install system dependencies on the Control Machine
Log in to the Control Machine using the `root` user account, and run the corresponding command according to your operating system.
If you use a Control Machine installed with **CentOS 7**, run the following command:
```bash
$ yum -y install epel-release git curl sshpass
$ yum -y install python-pip
```
If you use a Control Machine installed with **Ubuntu**, run the following command:
```bash
$ apt-get -y install git curl sshpass python-pip
```
## Step 2: Create the `tidb` user on the Control Machine and generate the SSH key
Make sure you have logged in to the Control Machine using the `root` user account, and then run the following command.
**Create the `tidb` user.**
```bash
$ useradd -m -d /home/tidb tidb
```
**Set a password for the `tidb` user account:**
```bash
$ passwd tidb
```
Configure sudo without password for the `tidb` user account by adding `tidb ALL=(ALL) NOPASSWD: ALL` to the end of the sudo file:
```bash
$ visudo
tidb ALL=(ALL) NOPASSWD: ALL
```
**Generate the SSH key:**
Execute the `su` command to switch the user from `root` to `tidb`. Create the SSH key for the `tidb` user account and hit the Enter key when `Enter passphrase` is prompted. After successful execution, the SSH private key file is `/home/tidb/.ssh/id_rsa`, and the SSH public key file is `/home/tidb/.ssh/id_rsa.pub`.
```bash
$ su - tidb
$ ssh-keygen -t rsa
```
## Step 3: Download TiDB-Ansible to the Control Machine
1. Log in to the Control Machine using the `tidb` user account and enter the `/home/tidb` directory.
2. Download the corresponding TiDB-Ansible version from the [TiDB-Ansible project](https://github.com/pingcap/tidb-ansible). The default folder name is `tidb-ansible`.
- Download the 3.0 GA version:
```bash
$ git clone -b release-3.0 https://github.com/pingcap/tidb-ansible.git
```
- Download the master version:
```bash
$ git clone https://github.com/pingcap/tidb-ansible.git
```
> **Note:** It is required to download `tidb-ansible` to the `/home/tidb` directory using the `tidb` user account. If you download it to the `/root` directory, a privilege issue occurs.
If you have questions regarding which version to use, email to [info@pingcap.com](mailto:info@pingcap.com) for more information or [file an issue](https://github.com/pingcap/tidb-ansible/issues/new).
## Step 4: Install Ansible and its dependencies on the Control Machine
Make sure you have logged in to the Control Machine using the `tidb` user account.
It is required to use `pip` to install Ansible and its dependencies, otherwise a compatibility issue occurs. Currently, the tidb-ansible release-3.0 branch is compatible with Ansible 2.4 and Ansible 2.5.
**Install Ansible and the dependencies on the Control Machine:**
```bash
$ cd /home/tidb/tidb-ansible
$ sudo pip install -r ./requirements.txt
```
Ansible and the related dependencies are in the `tidb-ansible/requirements.txt` file.
**View the version of Ansible:**
```bash
$ ansible --version
ansible 2.5.0
```
## Step 5: Configure the SSH mutual trust and sudo rules on the Control Machine
Make sure you have logged in to the Control Machine using the `tidb` user account.
Add the IPs of your target machines to the `[servers]` section of the `hosts.ini` file.
```bash
$ cd /home/tidb/tidb-ansible
$ vi hosts.ini
[servers]
172.16.10.1
172.16.10.2
172.16.10.3
172.16.10.4
172.16.10.5
172.16.10.6
[all:vars]
username = tidb
ntp_server = pool.ntp.org
```
Run the following command and input the `root` user account password of your target machines.
```bash
$ ansible-playbook -i hosts.ini create_users.yml -u root -k
```
This step creates the `tidb` user account on the target machines, and configures the sudo rules and the SSH mutual trust between the Control Machine and the target machines.
> **Note:** To configure the SSH mutual trust and sudo without password manually, see [How to manually configure the SSH mutual trust and sudo without password](https://github.com/pingcap/docs/blob/master/dev/how-to/deploy/orchestrated/ansible.md#how-to-manually-configure-the-ssh-mutual-trust-and-sudo-without-password).
## Step 6: Install the NTP service on the target machines
> **Note:** If the time and time zone of all your target machines are same, the NTP service is on and is normally synchronizing time, you can ignore this step. See [How to check whether the NTP service is normal](https://github.com/pingcap/docs/blob/master/op-guide/ansible-deployment.md#how-to-check-whether-the-ntp-service-is-normal).
Make sure you have logged in to the Control Machine using the `tidb` user account, run the following command:
```bash
$ cd /home/tidb/tidb-ansible
$ ansible-playbook -i hosts.ini deploy_ntp.yml -u tidb -b
```
The NTP service is installed and started using the software repository that comes with the system on the target machines. The default NTP server list in the installation package is used. The related `server` parameter is in the `/etc/ntp.conf` configuration file.
To make the NTP service start synchronizing as soon as possible, the system executes the `ntpdate` command to set the local date and time by polling `ntp_server` in the `hosts.ini` file. The default server is `pool.ntp.org`, and you can also replace it with your NTP server.
## Step 7: Configure the CPUfreq governor mode on the target machine
For details about CPUfreq, see [the CPUfreq Governor documentation](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/power_management_guide/cpufreq_governors).
Set the CPUfreq governor mode to `performance` to make full use of CPU performance.
### Check the governor modes supported by the system
You can run the `cpupower frequency-info --governors` command to check the governor modes which the system supports:
```bash
$ cpupower frequency-info --governors
analyzing CPU 0:
available cpufreq governors: performance powersave
```
Taking the above code for example, the system supports the `performance` and `powersave` modes.
> **Note:** As the following shows, if it returns "Not Available", it means that the current system does not support CPUfreq configuration and you can skip this step.
```bash
$ cpupower frequency-info --governors
analyzing CPU 0:
available cpufreq governors: Not Available
```
### Check the current governor mode
You can run the `cpupower frequency-info --policy` command to check the current CPUfreq governor mode:
```bash
$ cpupower frequency-info --policy
analyzing CPU 0:
current policy: frequency should be within 1.20 GHz and 3.20 GHz.
The governor "powersave" may decide which speed to use
within this range.
```
As the above code shows, the current mode is `powersave` in this example.
### Change the governor mode
* You can run the following command to change the current mode to `performance`:
```bash
$ cpupower frequency-set --governor performance
```
* You can also run the following command to set the mode on the target machine in batches:
```bash
$ ansible -i hosts.ini all -m shell -a "cpupower frequency-set --governor performance" -u tidb -b
```
## Step 8: Mount the data disk ext4 filesystem with options on the target machines
Log in to the Control Machine using the `root` user account.
Format your data disks to the ext4 filesystem and mount the filesystem with the `nodelalloc` and `noatime` options. It is required to mount the `nodelalloc` option, or else the Ansible deployment cannot pass the test. The `noatime` option is optional.
> **Note:** If your data disks have been formatted to ext4 and have mounted the options, you can uninstall it by running the `$ umount /dev/nvme0n1` command, follow the steps starting from editing the `/etc/fstab` file, and remount the filesystem with options.
Take the `/dev/nvme0n1` data disk as an example:
View the data disk.
```bash
$ fdisk -l
Disk /dev/nvme0n1: 1000 GB
```
Create the partition table.
```bash
$ parted -s -a optimal /dev/nvme0n1 mklabel gpt -- mkpart primary ext4 1 -1
```
Format the data disk to the ext4 filesystem.
```bash
$ mkfs.ext4 /dev/nvme0n1
```
View the partition UUID of the data disk. (In this example, the UUID of `nvme0n1` is `c51eb23b-195c-4061-92a9-3fad812cc12f`.)
```bash
$ lsblk -f
NAME FSTYPE LABEL UUID MOUNTPOINT
sda
├─sda1 ext4 237b634b-a565-477b-8371-6dff0c41f5ab /boot
├─sda2 swap f414c5c0-f823-4bb1-8fdf-e531173a72ed
└─sda3 ext4 547909c1-398d-4696-94c6-03e43e317b60 /
sr0
nvme0n1 ext4 c51eb23b-195c-4061-92a9-3fad812cc12f
```
Edit the `/etc/fstab` file and add the mount options.
```bash
$ vi /etc/fstab
UUID=c51eb23b-195c-4061-92a9-3fad812cc12f /data1 ext4 defaults,nodelalloc,noatime 0 2
```
Mount the data disk.
```bash
$ mkdir /data1
$ mount -a
```
Check using the following command.
```bash
$ mount -t ext4
/dev/nvme0n1 on /data1 type ext4 (rw,noatime,nodelalloc,data=ordered)
```
If the filesystem is ext4 and `nodelalloc` is included in the mount options, you have successfully mount the data disk ext4 filesystem with options on the target machines.
## Step 9: Edit the `inventory.ini` file to orchestrate the TiKV cluster
Edit the `tidb-ansible/inventory.ini` file to orchestrate the TiKV cluster. The standard TiKV cluster contains 6 machines: 3 PD nodes and 3 TiKV nodes.
- Deploy at least 3 instances for TiKV.
- Do not deploy TiKV together with PD on the same machine.
- Use the first PD machine as the monitoring machine.
> **Note:**
>
> - Leave `[tidb_servers]` in the `inventory.ini` file empty, because this deployment is for the TiKV cluster, not the TiDB cluster.
> - It is required to use the internal IP address to deploy. If the SSH port of the target machines is not the default 22 port, you need to add the `ansible_port` variable. For example, `TiDB1 ansible_host=172.16.10.1 ansible_port=5555`.
You can choose one of the following two types of cluster topology according to your scenario:
- [The cluster topology of a single TiKV instance on each TiKV node](#option-1-use-the-cluster-topology-of-a-single-tikv-instance-on-each-tikv-node)
In most cases, it is recommended to deploy one TiKV instance on each TiKV node for better performance. However, if the CPU and memory of your TiKV machines are much better than the required in [Hardware and Software Requirements](../introduction), and you have more than two disks in one node or the capacity of one SSD is larger than 2 TB, you can deploy no more than 2 TiKV instances on a single TiKV node.
- [The cluster topology of multiple TiKV instances on each TiKV node](#option-2-use-the-cluster-topology-of-multiple-tikv-instances-on-each-tikv-node)
### Option 1: Use the cluster topology of a single TiKV instance on each TiKV node
| Name | Host IP | Services |
|-------|-------------|----------|
| node1 | 172.16.10.1 | PD1 |
| node2 | 172.16.10.2 | PD2 |
| node3 | 172.16.10.3 | PD3 |
| node4 | 172.16.10.4 | TiKV1 |
| node5 | 172.16.10.5 | TiKV2 |
| node6 | 172.16.10.6 | TiKV3 |
Edit the `inventory.ini` file as follows:
```ini
[tidb_servers]
[pd_servers]
172.16.10.1
172.16.10.2
172.16.10.3
[tikv_servers]
172.16.10.4
172.16.10.5
172.16.10.6
[monitoring_servers]
172.16.10.1
[grafana_servers]
172.16.10.1
[monitored_servers]
172.16.10.1
172.16.10.2
172.16.10.3
172.16.10.4
172.16.10.5
172.16.10.6
```
### Option 2: Use the cluster topology of multiple TiKV instances on each TiKV node
Take two TiKV instances on each TiKV node as an example:
| Name | Host IP | Services |
|-------|-------------|------------------|
| node1 | 172.16.10.1 | PD1 |
| node2 | 172.16.10.2 | PD2 |
| node3 | 172.16.10.3 | PD3 |
| node4 | 172.16.10.4 | TiKV1-1, TiKV1-2 |
| node5 | 172.16.10.5 | TiKV2-1, TiKV2-2 |
| node6 | 172.16.10.6 | TiKV3-1, TiKV3-2 |
```ini
[tidb_servers]
[pd_servers]
172.16.10.1
172.16.10.2
172.16.10.3
[tikv_servers]
TiKV1-1 ansible_host=172.16.10.4 deploy_dir=/data1/deploy tikv_port=20171 labels="host=tikv1"
TiKV1-2 ansible_host=172.16.10.4 deploy_dir=/data2/deploy tikv_port=20172 labels="host=tikv1"
TiKV2-1 ansible_host=172.16.10.5 deploy_dir=/data1/deploy tikv_port=20171 labels="host=tikv2"
TiKV2-2 ansible_host=172.16.10.5 deploy_dir=/data2/deploy tikv_port=20172 labels="host=tikv2"
TiKV3-1 ansible_host=172.16.10.6 deploy_dir=/data1/deploy tikv_port=20171 labels="host=tikv3"
TiKV3-2 ansible_host=172.16.10.6 deploy_dir=/data2/deploy tikv_port=20172 labels="host=tikv3"
[monitoring_servers]
172.16.10.1
[grafana_servers]
172.16.10.1
[monitored_servers]
172.16.10.1
172.16.10.2
172.16.10.3
172.16.10.4
172.16.10.5
172.16.10.6
...
[pd_servers:vars]
location_labels = ["host"]
```
Edit the parameters in the service configuration file:
1. For the cluster topology of multiple TiKV instances on each TiKV node, you need to edit the `block-cache-size` parameter in `tidb-ansible/conf/tikv.yml`:
- `rocksdb defaultcf block-cache-size(GB)`: MEM * 80% / number of TiKV instances * 30%
- `rocksdb writecf block-cache-size(GB)`: MEM * 80% / number of TiKV instances * 45%
- `rocksdb lockcf block-cache-size(GB)`: MEM * 80% / number of TiKV instances * 2.5% (128 MB at a minimum)
- `raftdb defaultcf block-cache-size(GB)`: MEM * 80% / number of TiKV instances * 2.5% (128 MB at a minimum)
2. For the cluster topology of multiple TiKV instances on each TiKV node, you need to edit the `high-concurrency`, `normal-concurrency` and `low-concurrency` parameters in the `tidb-ansible/conf/tikv.yml` file:
```
readpool:
coprocessor:
# Notice: if CPU_NUM > 8, default thread pool size for coprocessors
# will be set to CPU_NUM * 0.8.
# high-concurrency: 8
# normal-concurrency: 8
# low-concurrency: 8
```
Recommended configuration: `number of TiKV instances * parameter value = CPU_Vcores * 0.8`.
3. If multiple TiKV instances are deployed on a same physical disk, edit the `capacity` parameter in `conf/tikv.yml`:
- `capacity`: total disk capacity / number of TiKV instances (the unit is GB)
## Step 10: Edit variables in the `inventory.ini` file
Edit the `deploy_dir` variable to configure the deployment directory.
The global variable is set to `/home/tidb/deploy` by default, and it applies to all services. If the data disk is mounted on the `/data1` directory, you can set it to `/data1/deploy`. For example:
```bash
## Global variables
[all:vars]
deploy_dir = /data1/deploy
```
**Note:** To separately set the deployment directory for a service, you can configure the host variable while configuring the service host list in the `inventory.ini` file. It is required to add the first column alias, to avoid confusion in scenarios of mixed services deployment.
```bash
TiKV1-1 ansible_host=172.16.10.4 deploy_dir=/data1/deploy
```
Set the `deploy_without_tidb` variable to `True`.
```bash
deploy_without_tidb = True
```
{{< info >}}
**Note:** If you need to edit other variables, see [the variable description table](https://pingcap.com/docs/dev/how-to/deploy/orchestrated/ansible/#edit-other-variables-optional).
{{< /info >}}
## Step 11: Deploy the TiKV cluster
When `ansible-playbook` executes the Playbook, the default concurrency number is 5. If many target machines are deployed, you can add the `-f` parameter to specify the concurrency, such as `ansible-playbook deploy.yml -f 10`.
The following example uses `tidb` as the user who runs the service.
Check the `tidb-ansible/inventory.ini` file to make sure `ansible_user = tidb`.
```bash
## Connection
# ssh via normal user
ansible_user = tidb
```
Make sure the SSH mutual trust and sudo without password are successfully configured.
* Run the following command and if all servers return `tidb`, then the SSH mutual trust is successfully configured:
```bash
ansible -i inventory.ini all -m shell -a 'whoami'
```
* Run the following command and if all servers return `root`, then sudo without password of the `tidb` user is successfully configured:
```bash
ansible -i inventory.ini all -m shell -a 'whoami' -b
```
**Download the TiKV binary to the Control Machine.**
```bash
ansible-playbook local_prepare.yml
```
**Initialize the system environment and modify the kernel parameters.**
```bash
ansible-playbook bootstrap.yml
```
**Deploy the TiKV cluster.**
```bash
ansible-playbook deploy.yml
```
**Start the TiKV cluster.**
```bash
ansible-playbook start.yml
```
You can check whether the TiKV cluster has been successfully deployed using the following command:
```bash
curl 172.16.10.1:2379/pd/api/v1/stores
```
If you want to try the Go client, see [Try Two Types of APIs](../../../reference/clients/go).
## Stop the TiKV cluster
If you want to stop the TiKV cluster, run the following command:
```bash
ansible-playbook stop.yml
```
## Destroy the TiKV cluster
{{< warning >}}
**Warning:** Before you clean the cluster data or destroy the TiKV cluster, make sure you do not need it any more.
{{< /warning >}}
- If you do not need the data any more, you can clean up the data for test using the following command:
```bash
ansible-playbook unsafe_cleanup_data.yml
```
- If you do not need the TiKV cluster any more, you can destroy it using the following command:
```bash
ansible-playbook unsafe_cleanup.yml
```
> **Note:** If the deployment directory is a mount point, an error might be reported, but the implementation result remains unaffected. You can just ignore the error.

View File

@ -0,0 +1,161 @@
---
title: Binary Deployment
description: Use binary files to deploy a TiKV cluster on a single machine or on multiple nodes for testing.
menu:
"3.1-beta":
parent: Deploy
weight: 4
---
This guide describes how to deploy a TiKV cluster using binary files.
- To quickly understand and try TiKV, see [Deploy the TiKV cluster on a single machine](#deploy-the-tikv-cluster-on-a-single-machine).
- To try TiKV out and explore the features, see [Deploy the TiKV cluster on multiple nodes for testing](#deploy-the-tikv-cluster-on-multiple-nodes-for-testing).
{{< warning >}}
The TiKV team strongly recommends you use the [**Ansible Deployment**](../ansible/) method. It is the method our team uses when we assist with deployment.
Other methods are documented for informational purposes. We strongly recommend [consulting with our contributors](/chat) before depending on a cluster deployed without the Ansible scripts.
{{< /warning >}}
## Deploy the TiKV cluster on a single machine
This section describes how to deploy TiKV on a single machine installed with the Linux system. Take the following steps:
1. Download the official binary package.
```bash
# Download the package.
wget https://download.pingcap.org/tidb-latest-linux-amd64.tar.gz
wget http://download.pingcap.org/tidb-latest-linux-amd64.sha256
# Check the file integrity. If the result is OK, the file is correct.
sha256sum -c tidb-latest-linux-amd64.sha256
# Extract the package.
tar -xzf tidb-latest-linux-amd64.tar.gz
cd tidb-latest-linux-amd64
```
2. Start PD.
```bash
./bin/pd-server --name=pd1 \
--data-dir=pd1 \
--client-urls="http://127.0.0.1:2379" \
--peer-urls="http://127.0.0.1:2380" \
--initial-cluster="pd1=http://127.0.0.1:2380" \
--log-file=pd1.log
```
3. Start TiKV.
To start the 3 TiKV instances, open a new terminal tab or window, come to the `tidb-latest-linux-amd64` directory, and start the instances using the following command:
```bash
./bin/tikv-server --pd-endpoints="127.0.0.1:2379" \
--addr="127.0.0.1:20160" \
--data-dir=tikv1 \
--log-file=tikv1.log
./bin/tikv-server --pd-endpoints="127.0.0.1:2379" \
--addr="127.0.0.1:20161" \
--data-dir=tikv2 \
--log-file=tikv2.log
./bin/tikv-server --pd-endpoints="127.0.0.1:2379" \
--addr="127.0.0.1:20162" \
--data-dir=tikv3 \
--log-file=tikv3.log
```
You can use the [pd-ctl](https://github.com/pingcap/pd/tree/master/tools/pd-ctl) tool to verify whether PD and TiKV are successfully deployed:
```
./bin/pd-ctl store -d -u http://127.0.0.1:2379
```
If the state of all the TiKV instances is "Up", you have successfully deployed a TiKV cluster.
## Deploy the TiKV cluster on multiple nodes for testing
This section describes how to deploy TiKV on multiple nodes. If you want to test TiKV with a limited number of nodes, you can use one PD instance to test the entire cluster.
Assume that you have four nodes, you can deploy 1 PD instance and 3 TiKV instances. For details, see the following table:
| Name | Host IP | Services |
| :-- | :-- | :------------------- |
| Node1 | 192.168.199.113 | PD1 |
| Node2 | 192.168.199.114 | TiKV1 |
| Node3 | 192.168.199.115 | TiKV2 |
| Node4 | 192.168.199.116 | TiKV3 |
To deploy a TiKV cluster with multiple nodes for test, take the following steps:
1. Download the official binary package on each node.
```bash
# Download the package.
wget https://download.pingcap.org/tidb-latest-linux-amd64.tar.gz
wget http://download.pingcap.org/tidb-latest-linux-amd64.sha256
# Check the file integrity. If the result is OK, the file is correct.
sha256sum -c tidb-latest-linux-amd64.sha256
# Extract the package.
tar -xzf tidb-latest-linux-amd64.tar.gz
cd tidb-latest-linux-amd64
```
2. Start PD on Node1.
```bash
./bin/pd-server --name=pd1 \
--data-dir=pd1 \
--client-urls="http://192.168.199.113:2379" \
--peer-urls="http://192.168.199.113:2380" \
--initial-cluster="pd1=http://192.168.199.113:2380" \
--log-file=pd1.log
```
3. Log in and start TiKV on other nodes: Node2, Node3 and Node4.
Node2:
```bash
./bin/tikv-server --pd-endpoints="192.168.199.113:2379" \
--addr="192.168.199.114:20160" \
--data-dir=tikv1 \
--log-file=tikv1.log
```
Node3:
```bash
./bin/tikv-server --pd-endpoints="192.168.199.113:2379" \
--addr="192.168.199.115:20160" \
--data-dir=tikv2 \
--log-file=tikv2.log
```
Node4:
```bash
./bin/tikv-server --pd-endpoints="192.168.199.113:2379" \
--addr="192.168.199.116:20160" \
--data-dir=tikv3 \
--log-file=tikv3.log
```
You can use the [pd-ctl](https://github.com/pingcap/pd/tree/master/tools/pd-ctl) tool to verify whether PD and TiKV are successfully deployed:
```
./pd-ctl store -d -u http://192.168.199.113:2379
```
The result displays the store count and detailed information regarding each store. If the state of all the TiKV instances is "Up", you have successfully deployed a TiKV cluster.
## What's next?
If you want to try the Go client, see [Try Two Types of APIs](../../reference/clients/go/).

View File

@ -0,0 +1,22 @@
---
title: Docker Compose/Swarm
description: Use Docker Compose or Swarm to quickly deploy a TiKV testing cluster.
menu:
"3.1-beta":
parent: Deploy
weight: 5
---
This guide describes how to deploy a single-node TiKV cluster using Docker Compose, or a multi-node TiKV cluster using Docker Swarm.
{{< warning >}}
The TiKV team strongly recommends you use the [**Ansible Deployment**](../ansible/) method. It is the method our team uses when we assist with deployment.
Other methods are documented for informational purposes. We strongly recommend [consulting with our contributors](/chat) before depending on a cluster deployed without the Ansible scripts.
{{< /warning >}}
PingCAP (the original authors of TiKV, and a maintaining organization of TiKV), develops and maintains the Apache 2 licensed [TiDB docker-compose](https://github.com/pingcap/tidb-docker-compose). TiDB docker-compose is a collection of Docker Compose files which enable you to quickly 'test drive' TiKV as well as TiDB, TiSpark, and the monitoring tools.
TiDB docker-compose is compatible with Linux as well as Mac and Windows through [Docker Desktop](https://www.docker.com/products/docker-desktop).
We recommend using the [**Docker Swarm option**](https://github.com/pingcap/tidb-docker-compose#docker-swarm) option for using TiKV. If your Docker service is not able to support Swarm, the normal [Docker Compose](https://github.com/pingcap/tidb-docker-compose#quick-start) option requires some [manual configuration to interact with TiKV directly](https://github.com/pingcap/tidb-docker-compose#host-network-mode-linux) on non-Linux systems.

View File

@ -0,0 +1,162 @@
---
title: Docker Deployment
description: Use Docker to deploy a TiKV cluster on multiple nodes.
menu:
"3.1-beta":
parent: Deploy
weight: 3
---
This guide describes how to deploy a multi-node TiKV cluster using Docker.
{{< warning >}}
The TiKV team strongly recommends you use the [**Ansible Deployment**](../ansible/) method. It is the method our team uses when we assist with deployment.
Other methods are documented for informational purposes. We strongly recommend [consulting with our contributors](/chat) before depending on a cluster deployed without the Ansible scripts.
{{< /warning >}}
## Prerequisites
Make sure that Docker is installed on each machine.
For more details about prerequisites, see [Hardware and Software Requirements](../introduction).
## Deploy the TiKV cluster on multiple nodes
Assume that you have 6 machines with the following details:
| Name | Host IP | Services | Data Path |
| --------- | ------------- | ---------- | --------- |
| Node1 | 192.168.1.101 | PD1 | /data |
| Node2 | 192.168.1.102 | PD2 | /data |
| Node3 | 192.168.1.103 | PD3 | /data |
| Node4 | 192.168.1.104 | TiKV1 | /data |
| Node5 | 192.168.1.105 | TiKV2 | /data |
| Node6 | 192.168.1.106 | TiKV3 | /data |
If you want to test TiKV with a limited number of nodes, you can also use one PD instance to test the entire cluster.
### Step 1: Pull the latest images of TiKV and PD from Docker Hub
Start Docker and pull the latest images of TiKV and PD from [Docker Hub](https://hub.docker.com) using the following command:
```bash
docker pull pingcap/tikv:latest
docker pull pingcap/pd:latest
```
### Step 2: Log in and start PD
Log in to the three PD machines and start PD respectively:
**Start PD1 on Node1:**
```bash
docker run -d --name pd1 \
-p 2379:2379 \
-p 2380:2380 \
-v /etc/localtime:/etc/localtime:ro \
-v /data:/data \
pingcap/pd:latest \
--name="pd1" \
--data-dir="/data/pd1" \
--client-urls="http://0.0.0.0:2379" \
--advertise-client-urls="http://192.168.1.101:2379" \
--peer-urls="http://0.0.0.0:2380" \
--advertise-peer-urls="http://192.168.1.101:2380" \
--initial-cluster="pd1=http://192.168.1.101:2380,pd2=http://192.168.1.102:2380,pd3=http://192.168.1.103:2380"
```
**Start PD2 on Node2:**
```bash
docker run -d --name pd2 \
-p 2379:2379 \
-p 2380:2380 \
-v /etc/localtime:/etc/localtime:ro \
-v /data:/data \
pingcap/pd:latest \
--name="pd2" \
--data-dir="/data/pd2" \
--client-urls="http://0.0.0.0:2379" \
--advertise-client-urls="http://192.168.1.102:2379" \
--peer-urls="http://0.0.0.0:2380" \
--advertise-peer-urls="http://192.168.1.102:2380" \
--initial-cluster="pd1=http://192.168.1.101:2380,pd2=http://192.168.1.102:2380,pd3=http://192.168.1.103:2380"
```
**Start PD3 on Node3:**
```bash
docker run -d --name pd3 \
-p 2379:2379 \
-p 2380:2380 \
-v /etc/localtime:/etc/localtime:ro \
-v /data:/data \
pingcap/pd:latest \
--name="pd3" \
--data-dir="/data/pd3" \
--client-urls="http://0.0.0.0:2379" \
--advertise-client-urls="http://192.168.1.103:2379" \
--peer-urls="http://0.0.0.0:2380" \
--advertise-peer-urls="http://192.168.1.103:2380" \
--initial-cluster="pd1=http://192.168.1.101:2380,pd2=http://192.168.1.102:2380,pd3=http://192.168.1.103:2380"
```
### Step 3: Log in and start TiKV
Log in to the three TiKV machines and start TiKV respectively:
**Start TiKV1 on Node4:**
```bash
docker run -d --name tikv1 \
-p 20160:20160 \
-v /etc/localtime:/etc/localtime:ro \
-v /data:/data \
pingcap/tikv:latest \
--addr="0.0.0.0:20160" \
--advertise-addr="192.168.1.104:20160" \
--data-dir="/data/tikv1" \
--pd="192.168.1.101:2379,192.168.1.102:2379,192.168.1.103:2379"
```
**Start TiKV2 on Node5:**
```bash
docker run -d --name tikv2 \
-p 20160:20160 \
-v /etc/localtime:/etc/localtime:ro \
-v /data:/data \
pingcap/tikv:latest \
--addr="0.0.0.0:20160" \
--advertise-addr="192.168.1.105:20160" \
--data-dir="/data/tikv2" \
--pd="192.168.1.101:2379,192.168.1.102:2379,192.168.1.103:2379"
```
**Start TiKV3 on Node6:**
```bash
docker run -d --name tikv3 \
-p 20160:20160 \
-v /etc/localtime:/etc/localtime:ro \
-v /data:/data \
pingcap/tikv:latest \
--addr="0.0.0.0:20160" \
--advertise-addr="192.168.1.106:20160" \
--data-dir="/data/tikv3" \
--pd="192.168.1.101:2379,192.168.1.102:2379,192.168.1.103:2379"
```
You can check whether the TiKV cluster has been successfully deployed using the following command:
```bash
curl 192.168.1.101:2379/pd/api/v1/stores
```
If the state of all the TiKV instances is "Up", you have successfully deployed a TiKV cluster.
## What's next?
If you want to try the Go client, see [Try Two Types of APIs](../../reference/clients/go/).

View File

@ -0,0 +1,103 @@
---
title: Deploy
description: Prerequisites for deploying TiKV
menu:
"3.1-beta":
parent: Tasks
weight: 2
name: Deploy
---
Typical deployments of TiKV include a number of components:
* 3+ TiKV nodes
* 3+ Placement Driver (PD) nodes
* 1 Monitoring node
* 1 or more client application or query layer (like [TiDB](https://github.com/pingcap/tidb))
{{< info >}}
TiKV is deployed alongside a [Placement Driver](https://github.com/pingcap/pd/) (PD) cluster. PD is the cluster manager of TiKV, which periodically checks replication constraints to balance load and data automatically.
{{< /info >}}
Your **first steps** into TiKV require only the following:
* A modest machine that fulfills the [system requirements](#system-requirements).
* A running [Docker](https://docker.com) service.
After you set up the environment, follow through the [Try](../try) guide to get a test setup of TiKV running on your machine.
**Production** usage is typically done via automation requiring:
* A control machine (it can be one of your target servers) with [Ansible](https://www.ansible.com/) installed.
* Several (6+) machines fulfilling the [system requirements](#system-requirements) and at least up to [production specifications](#production-specifications).
* The ability to configure your infrastructure to allow the ports from [network requirements](#network-requirements).
If you have your production environment ready, follow through the [Ansible deployment](../ansible) guide. You may optionally choose unsupported manual [Docker deployment](../docker) or [binary deployment](../binary) strategies.
Finally, if you want to **build your own binary** TiKV you should consult the [README](https://github.com/tikv/tikv/blob/master/README.md) of the repository.
## System requirements
The **minimum** specifications for testing or developing TiKV or PD are:
* 2+ core
* 8+ GB RAM
* An SSD
TiKV hosts must support the x86-64 architecture and the SSE 4.2 instruction set.
TiKV works well in VMWare, KVM, and Xen virtual machines.
## Production Specifications
The **suggested PD** specifications for production are:
* 3+ nodes
* 4+ cores
* 8+ GB RAM, with no swap space.
* 200+ GB Optane, NVMe, or SSD drive
* 10 Gigabit ethernet (2x preferred)
* A Linux Operating System, PD is most widely tested on CentOS 7.
The **suggested TiKV** specifications for production are:
* 3+ nodes
* 16+ cores
* 32+ GB RAM, with no swap space.
* 200+ GB Optane, NVMe, or SSD drive (Under 1.5 TB capacity is ideal in our tests)
* 10 Gigabit ethernet (2x preferred)
* A Linux Operating System, PD is most widely tested on CentOS 7.
## Network requirements
TiKV deployments require **total connectivity of all services**. Each TiKV, PD, and client must be able to reach each all other and advertise the addresses of all other services to new services. This connectivity allows TiKV and PD to replicate and balance data resiliently across the entire deployment.
If the hosts are not already able to reach each other, it is possible to accomplish this through a Virtual Local Area Network (VLAN). Speak to your system administrator to explore your options.
TiKV requires the following network port configuration to run. Based on the TiKV deployment in actual environments, the administrator can open relevant ports in the network side and host side.
| Component | Default Port | Protocol | Description |
| :--:| :--: | :--: | :-- |
| TiKV | 20160 | gRPC | Client (such as Query Layers) port. |
| TiKV | 20180 | Text | Status port, Prometheus metrics at `/metrics`. |
| PD | 2379 | gRPC | The client port, for communication with clients. |
| PD | 2380 | gRPC | The server port, for communication with TiKV. |
{{< info >}}
If you are deploying tools alongside TiKV you may need to open or configure other ports. For example, port 3000 for the Grafana service.
{{< /info >}}
You can ensure your confguration is correct by creating echo servers on the ports/IPs by using `ncat` (from the `nmap` package):
```bash
ncat -l $PORT -k -c 'xargs -n1 echo'
```
Then from the other machines, verify that the echo server is reachable with `curl $IP:$PORT`.
## Optional: Configure Monitoring
TiKV can work with Prometheus and Grafana to provide a rich visual monitoring dashboard. This comes preconfigured if you use the [Ansible](../ansible) or [Docker Compose](../docker-compose) deployment methods.
We strongly recommend using an up-to-date version of Mozilla Firefox or Google Chrome when accessing Grafana.

View File

@ -0,0 +1,48 @@
---
title: Tasks
description: How to accomplish common tasks with TiKV
menu:
nav:
parent: Docs
weight: 2
"3.1-beta":
weight: 2
---
Learn to try, deploy, configure, monitor, and scale TiKV as you adopt the service into your project and infrastructure.
## [Try](../try/)
It's not always desirable to deploy a full production cluster of TiKV. If you just want to take TiKV for a spin, or get familiar with how it works, you may find yourself wanting to run TiKV locally.
In the [**try**](../try/) section you'll find out how to get started. [**Try TiKV**](../try/) teaches you how to get a copy of TiKV running on your machine with Docker. Then you'll connect and talk to the TiKV cluster using our [Rust client](../../reference/clients/rust).
## [Deploy](../deploy/introduction)
In the [**deploy**](../deploy/introduction) section you'll find several guides to help you deploy & integrate TiKV into your infrastructure.
Currently the best supported and most comprehensive deployment solution is to [**Deploy TiKV using Ansible**](../deploy/ansible/). In this guide you'll learn to deploy and maintain TiKV using the same scripts PingCAP deploys TiKV with inside of many of our [adopters](/adopters).
If you're determined to strike it out on your own, we've done our best to provide you with the tools you need to build your own solution. Start by choosing between the [**Docker**](../deploy/docker) and [**Binary**](../deploy/binary) options.
## [Configure](../configure/introduction)
Learn about how you can configure TiKV to meet your needs in the [**configure**](../configure/introduction) section. There you'll find a number of guides including:
* [**Security**](../configure/security): Use TLS security and review security procedures.
* [**Topology**](../configure/topology): Use location awareness to improve resilency and performance.
* [**Namespace**](../configure/namespace): Use namespacing to configure resource isolation.
* [**Limit**](../configure/limit): Tweak rate limiting.
* [**Region Merge**](../configure/region-merge): Tweak region merging.
* [**RocksDB**](../configure/rocksdb): Tweak RocksDB configuration options.
* [**Titan**](../configure/titan): Enable titan to improve performance with large values.
## [Monitor](../monitor/introduction)
Learn how to inspect a TiKV cluster in the [**Monitor**](../monitor/introduction) section. You'll find out how to [**check the component state interface or collect Prometheus metrics**](../monitor/tikv-cluster/), as well as review the [**key metrics**](../monitor/key-metrics/) to be aware of.
## [Scale](../scale/introduction)
As your dataset and workload change you'll eventually need to scale TiKV to meet these new demands. In the [**Scale**](../scale/introduction) section you'll find out how to grow and shrink your TiKV cluster.
If you deployed using Ansible, please check the [**Ansible Scaling**](../scale/ansible) guide.

View File

@ -0,0 +1,28 @@
---
title: Monitor
description: Monitor TiKV
menu:
"3.1-beta":
parent: Tasks
weight: 4
---
The TiKV monitoring framework adopts two open-source projects: [Prometheus](https://github.com/prometheus/prometheus) and [Grafana](https://github.com/grafana/grafana). TiKV uses Prometheus to store the monitoring and performance metrics, and uses Grafana to visualize these metrics.
## About Prometheus in TiKV
As a time series database, Prometheus has a multi-dimensional data model and flexible query language. Prometheus consists of multiple components. Currently, TiKV uses the following components:
- Prometheus Server: to scrape and store time series data
- Client libraries: to customize necessary metrics in the application
- Pushgateway: to receive the data from Client Push for the Prometheus main server
- AlertManager: for the alerting mechanism
{{< figure
src="/img/docs/prometheus-in-tikv.png"
caption="Prometheus in TiKV"
number="" >}}
## About Grafana in TiKV
[Grafana](https://github.com/grafana/grafana) is an open-source project for analyzing and visualizing metrics. TiKV uses Grafana to display the performance metrics.

View File

@ -0,0 +1,58 @@
---
title: Key Metrics
description: Learn some key metrics displayed on the Grafana Overview dashboard.
menu:
"3.1-beta":
parent: Monitor
weight: 2
---
If your TiKV cluster is deployed using Ansible or Docker Compose, the monitoring system is deployed at the same time. For more details, see [Overview of the TiKV Monitoring Framework](../../how-to/monitor/introduction/).
The Grafana dashboard is divided into a series of sub-dashboards which include Overview, PD, TiKV, and so on. You can use various metrics to help you diagnose the cluster.
For routine operations, you can get an overview of the component (PD, TiKV) status and the entire cluster from the Overview dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.
## Key metrics description
To understand the key metrics displayed on the Overview dashboard, check the following table:
Service | Panel Name | Description | Normal Range
---- | ---------------- | ---------------------------------- | --------------
Services Port Status | Services Online | the online nodes number of each service |
Services Port Status | Services Offline | the offline nodes number of each service |
PD | Storage Capacity | the total storage capacity of the TiKV cluster |
PD | Current Storage Size | the occupied storage capacity of the TiKV cluster |
PD | Number of Regions | the total number of Regions of the current cluster |
PD | Leader Balance Ratio | the leader ratio difference of the nodes with the biggest leader ratio and the smallest leader ratio | It is less than 5% for a balanced situation and becomes bigger when you restart a node.
PD | Region Balance Ratio | the region ratio difference of the nodes with the biggest Region ratio and the smallest Region ratio | It is less than 5% for a balanced situation and becomes bigger when you add or remove a node.
PD | Store Status -- Up Stores | the number of TiKV nodes that are up |
PD | Store Status -- Disconnect Stores | the number of TiKV nodes that encounter abnormal communication within a short time |
PD | Store Status -- LowSpace Stores | the number of TiKV nodes with an available space of less than 80% |
PD | Store Status -- Down Stores | the number of TiKV nodes that are down | The normal value is `0`. If the number is bigger than `0`, it means some node(s) are abnormal.
PD | Store Status -- Offline Stores | the number of TiKV nodes (still providing service) that are being made offline |
PD | Store Status -- Tombstone Stores | the number of TiKV nodes that are successfully offline |
PD | 99% completed_cmds_duration_seconds | the 99th percentile duration to complete a pd-server request | less than 5ms
PD | handle_requests_duration_seconds | the request duration of a PD request |
TiKV | leader | the number of leaders on each TiKV node |
TiKV | region | the number of Regions on each TiKV node |
TiKV | CPU | the CPU usage ratio on each TiKV node |
TiKV | Memory | the memory usage on each TiKV node |
TiKV | store size | the data amount on each TiKV node |
TiKV | cf size | the data amount on different CFs in the cluster |
TiKV | channel full | `No data points` is displayed in normal conditions. If a monitoring value displays, it means the corresponding TiKV node fails to handle the messages |
TiKV | server report failures | `No data points` is displayed in normal conditions. If `Unreachable` is displayed, it means TiKV encounters a communication issue. |
TiKV | scheduler pending commands | the number of commits on queue | Occasional value peaks are normal.
TiKV | coprocessor pending requests | the number of requests on queue | `0` or very small
TiKV | coprocessor executor count | the number of various query operations |
TiKV | coprocessor request duration | the time consumed by TiKV queries |
TiKV | raft store CPU | the CPU usage ratio of the raftstore thread | Currently, it is a single thread. A value of over 80% indicates that the CPU usage ratio is very high.
TiKV | Coprocessor CPU | the CPU usage ratio of the TiKV query thread, related to the application; complex queries consume a great deal of CPU |
System Info | Vcores | the number of CPU cores |
System Info | Memory | the total memory |
System Info | CPU Usage | the CPU usage ratio, 100% at a maximum |
System Info | Load [1m] | the overload within 1 minute |
System Info | Memory Available | the size of the available memory |
System Info | Network Traffic | the statistics of the network traffic |
System Info | TCP Retrans | the statistics about network monitoring and TCP |
System Info | IO Util | the disk usage ratio, 100% at a maximum; generally you need to consider adding a new node when the usage ratio is up to 80% ~ 90% |

View File

@ -0,0 +1,239 @@
---
title: Monitoring a Cluster
description: Learn how to monitor the state of a TiKV cluster.
menu:
"3.1-beta":
parent: Monitor
weight: 1
---
Currently, you can use two types of interfaces to monitor the state of the TiKV cluster:
- [The component state interface](#the-component-state-interface): use the HTTP interface to get the internal information of a component, which is called the component state interface.
- [The metrics interface](#the-metrics-interface): use the Prometheus interface to record the detailed information of various operations in the components, which is called the metrics interface.
## The component state interface
You can use this type of interface to monitor the basic information of components. This interface can get the details of the entire TiKV cluster and can act as the interface to monitor Keepalive.
### The PD server
The API address of the Placement Driver (PD) is `http://${host}:${port}/pd/api/v1/${api_name}`
The default port number is 2379.
For detailed information about various API names, see [PD API doc](https://download.pingcap.com/pd-api-v1.html).
You can use the interface to get the state of all the TiKV instances and the information about load balancing. It is the most important and frequently-used interface to get the state information of all the TiKV nodes. See the following example for the information about a 3-instance TiKV cluster deployed on a single machine:
```bash
curl http://127.0.0.1:2379/pd/api/v1/stores
{
"count": 3,
"stores": [
{
"store": {
"id": 1,
"address": "127.0.0.1:20161",
"version": "2.1.0-rc.2",
"state_name": "Up"
},
"status": {
"capacity": "937 GiB",
"available": "837 GiB",
"leader_weight": 1,
"region_count": 1,
"region_weight": 1,
"region_score": 1,
"region_size": 1,
"start_ts": "2018-09-29T00:05:47Z",
"last_heartbeat_ts": "2018-09-29T00:23:46.227350716Z",
"uptime": "17m59.227350716s"
}
},
{
"store": {
"id": 2,
"address": "127.0.0.1:20162",
"version": "2.1.0-rc.2",
"state_name": "Up"
},
"status": {
"capacity": "937 GiB",
"available": "837 GiB",
"leader_weight": 1,
"region_count": 1,
"region_weight": 1,
"region_score": 1,
"region_size": 1,
"start_ts": "2018-09-29T00:05:47Z",
"last_heartbeat_ts": "2018-09-29T00:23:45.65292648Z",
"uptime": "17m58.65292648s"
}
},
{
"store": {
"id": 7,
"address": "127.0.0.1:20160",
"version": "2.1.0-rc.2",
"state_name": "Up"
},
"status": {
"capacity": "937 GiB",
"available": "837 GiB",
"leader_count": 1,
"leader_weight": 1,
"leader_score": 1,
"leader_size": 1,
"region_count": 1,
"region_weight": 1,
"region_score": 1,
"region_size": 1,
"start_ts": "2018-09-29T00:05:47Z",
"last_heartbeat_ts": "2018-09-29T00:23:44.853636067Z",
"uptime": "17m57.853636067s"
}
}
]
}
```
## The metrics interface
You can use this type of interface to monitor the state and performance of the entire cluster. The metrics data is displayed in Prometheus and Grafana. See [Use Prometheus and Grafana](#use-prometheus-and-grafana) for how to set up the monitoring system.
You can get the following metrics for each component:
### The PD server
- the total number of times that the command executes
- the total number of times that a certain command fails
- the duration that a command succeeds
- the duration that a command fails
- the duration that a command finishes and returns result
### The TiKV server
- Garbage Collection (GC) monitoring
- the total number of times that the TiKV command executes
- the duration that Scheduler executes commands
- the total number of times of the Raft propose command
- the duration that Raft executes commands
- the total number of times that Raft commands fail
- the total number of times that Raft processes the ready state
## Use Prometheus and Grafana
This section introduces the deployment architecture of Prometheus and Grafana in TiKV, and how to set up and configure the monitoring system.
### The deployment architecture
See the following diagram for the deployment architecture:
{{< figure
src="/img/docs/monitor-architecture.png"
caption="Monitor architecture"
number="1" >}}
{{< info >}}
You must add the Prometheus Pushgateway addresses to the startup parameters of the PD and TiKV components.
{{< /info >}}
### Set up the monitoring system
See the following links for your reference:
- [Prometheus Pushgateway](https://github.com/prometheus/pushgateway)
- [Prometheus Server](https://github.com/prometheus/prometheus#install)
- [Grafana](http://docs.grafana.org/)
## Manual configuration
This section describes how to manually configure PD and TiKV, PushServer, Prometheus, and Grafana.
> **Note:** If your TiKV cluster is deployed using Ansible or Docker Compose, the configuration is automatically done, and generally, you do not need to configure it manually again. If your TiKV cluster is deployed using Docker, you can follow the configuration steps below.
### Configure PD and TiKV
+ PD: update the `toml` configuration file with the Pushgateway address and the the push frequency:
```toml
[metric]
# prometheus client push interval, set "0s" to disable prometheus.
interval = "15s"
# prometheus pushgateway address, leaves it empty will disable prometheus.
address = "host:port"
```
+ TiKV: update the `toml` configuration file with the Pushgateway address and the the push frequency. Set the `job` field to `"tikv"`.
```toml
[metric]
# the Prometheus client push interval. Setting the value to 0s stops Prometheus client from pushing.
interval = "15s"
# the Prometheus pushgateway address. Leaving it empty stops Prometheus client from pushing.
address = "host:port"
# the Prometheus client push job name. Note: A node id will automatically append, e.g., "tikv_1".
job = "tikv"
```
### Configure PushServer
Generally, you can use the default port `9091` and do not need to configure PushServer.
### Configure Prometheus
Add the Pushgateway address to the `yaml` configuration file:
```yaml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
# scrape_timeout is set to the global default value (10s).
external_labels:
cluster: 'test-cluster'
monitor: "prometheus"
scrape_configs:
- job_name: 'pd'
honor_labels: true # Do not overwrite job & instance labels.
static_configs:
- targets:
- '192.168.199.113:2379'
- '192.168.199.114:2379'
- '192.168.199.115:2379'
- job_name: 'tikv'
honor_labels: true # Do not overwrite job & instance labels.
static_configs:
- targets:
- '192.168.199.116:20180'
- '192.168.199.117:20180'
- '192.168.199.118:20180'
```
### Configure Grafana
#### Create a Prometheus data source
1. Log in to the Grafana Web interface.
- Default address: [http://localhost:3000](http://localhost:3000)
- Default account name: `admin`
- Default password: `admin`
2. Click the Grafana logo to open the sidebar menu.
3. Click "Data Sources" in the sidebar.
4. Click "Add data source".
5. Specify the data source information:
- Specify the name for the data source.
- For Type, select Prometheus.
- For Url, specify the Prometheus address.
- Specify other fields as needed.
6. Click "Add" to save the new data source.
#### Create a Grafana dashboard
1. Click the Grafana logo to open the sidebar menu.
2. On the sidebar menu, click "Dashboards" -> "Import" to open the "Import Dashboard" window.
3. Click "Upload .json File" to upload a JSON file (Examples can be found in [`tidb-ansible`](https://github.com/pingcap/tidb-ansible/tree/master/scripts)).
4. Click "Save & Open".
5. A Prometheus dashboard is created.

View File

@ -0,0 +1,374 @@
---
title: Ansible Scaling
description: Use TiDB-Ansible to scale out or scale in a TiKV cluster.
menu:
"3.1-beta":
parent: Scale
---
This document describes how to use TiDB-Ansible to scale out or scale in a TiKV cluster without affecting the online services.
> **Note:** This document applies to the TiKV deployment using Ansible. If your TiKV cluster is deployed in other ways, see [Scale a TiKV Cluster](../introduction).
Assume that the topology is as follows:
| Name | Host IP | Services |
| ---- | ------- | -------- |
| node1 | 172.16.10.1 | PD1, Monitor |
| node2 | 172.16.10.2 | PD2 |
| node3 | 172.16.10.3 | PD3 |
| node4 | 172.16.10.4 | TiKV1 |
| node5 | 172.16.10.5 | TiKV2 |
| node6 | 172.16.10.6 | TiKV3 |
## Scale out a TiKV cluster
This section describes how to increase the capacity of a TiKV cluster by adding a TiKV or PD node.
### Add TiKV nodes
For example, if you want to add two TiKV nodes (node101, node102) with the IP addresses `172.16.10.101` and `172.16.10.102`, take the following steps:
**Edit the `inventory.ini` file** and append the TiKV node information in `tikv_servers`:
```ini
[tidb_servers]
[pd_servers]
172.16.10.1
172.16.10.2
172.16.10.3
[tikv_servers]
172.16.10.4
172.16.10.5
172.16.10.6
172.16.10.101
172.16.10.102
[monitoring_servers]
172.16.10.1
[grafana_servers]
172.16.10.1
[monitored_servers]
172.16.10.1
172.16.10.2
172.16.10.3
172.16.10.4
172.16.10.5
172.16.10.6
172.16.10.101
172.16.10.102
```
Now the topology is as follows:
| Name | Host IP | Services |
| ---- | ------- | -------- |
| node1 | 172.16.10.1 | PD1, Monitor |
| node2 | 172.16.10.2 | PD2 |
| node3 | 172.16.10.3 | PD3 |
| node4 | 172.16.10.4 | TiKV1 |
| node5 | 172.16.10.5 | TiKV2 |
| node6 | 172.16.10.6 | TiKV3 |
| **node101** | **172.16.10.101** | **TiKV4** |
| **node102** | **172.16.10.102** | **TiKV5** |
**Initialize the newly added node:**
```bash
ansible-playbook bootstrap.yml -l 172.16.10.101,172.16.10.102
```
> **Note:** If an alias is configured in the `inventory.ini` file, for example, `node101 ansible_host=172.16.10.101`, use `-l` to specify the alias when executing `ansible-playbook`. For example, `ansible-playbook bootstrap.yml -l node101,node102`. This also applies to the following steps.
**Deploy the newly added node:**
```bash
ansible-playbook deploy.yml -l 172.16.10.101,172.16.10.102
```
**Start the newly added node:**
```bash
ansible-playbook start.yml -l 172.16.10.101,172.16.10.102
```
**Update the Prometheus configuration and restart:**
```bash
ansible-playbook rolling_update_monitor.yml --tags=prometheus
```
Monitor the status of the entire cluster and the newly added nodes by opening a browser to access the monitoring platform: `http://172.16.10.1:3000`.
### Add a PD node
To add a PD node (node103) with the IP address `172.16.10.103`, take the following steps:
**Edit the `inventory.ini` file** and append the PD node information in `pd_servers`:
```ini
[tidb_servers]
[pd_servers]
172.16.10.1
172.16.10.2
172.16.10.3
172.16.10.103
[tikv_servers]
172.16.10.4
172.16.10.5
172.16.10.6
[monitoring_servers]
172.16.10.1
[grafana_servers]
172.16.10.1
[monitored_servers]
172.16.10.1
172.16.10.2
172.16.10.3
172.16.10.103
172.16.10.4
172.16.10.5
172.16.10.6
```
Now the topology is as follows:
| Name | Host IP | Services |
| ---- | ------- | -------- |
| node1 | 172.16.10.1 | PD1, Monitor |
| node2 | 172.16.10.2 | PD2 |
| node3 | 172.16.10.3 | PD3 |
| **node103** | **172.16.10.103** | **PD4** |
| node4 | 172.16.10.4 | TiKV1 |
| node5 | 172.16.10.5 | TiKV2 |
| node6 | 172.16.10.6 | TiKV3 |
**Initialize the newly added node:**
```bash
ansible-playbook bootstrap.yml -l 172.16.10.103
```
**Deploy the newly added node:**
```bash
ansible-playbook deploy.yml -l 172.16.10.103
```
**Login the newly added PD node and edit the starting script:**
```bash
{deploy_dir}/scripts/run_pd.sh
```
* Remove the `--initial-cluster="xxxx" \` configuration.
* Add `--join="http://172.16.10.1:2379" \`. The IP address (`172.16.10.1`) can be any of the existing PD IP addresses in the cluster.
* Manually start the PD service in the newly added PD node:
```bash
{deploy_dir}/scripts/start_pd.sh
```
* Use `pd-ctl` to check whether the new node is added successfully:
```bash
./pd-ctl -u "http://172.16.10.1:2379"
```
> **Note:** `pd-ctl` is a command used to check the number of PD nodes.
* Apply a rolling update to the entire cluster:
```bash
ansible-playbook rolling_update.yml
```
* Update the Prometheus configuration and restart:
```bash
ansible-playbook rolling_update_monitor.yml --tags=prometheus
```
* Monitor the status of the entire cluster and the newly added node by opening a browser to access the monitoring platform: `http://172.16.10.1:3000`.
## Scale in a TiKV cluster
This section describes how to decrease the capacity of a TiKV cluster by removing a TiKV or PD node.
> **Warning:** In decreasing the capacity, if your cluster has a mixed deployment of other services, do not perform the following procedures. The following examples assume that the removed nodes have no mixed deployment of other services.
### Remove a TiKV node
To remove a TiKV node (node6) with the IP address `172.16.10.6`, take the following steps:
**Remove the node from the cluster using `pd-ctl`:**
View the store ID of node6:
```bash
./pd-ctl -u "http://172.16.10.1:2379" -d store
```
**Remove node6** from the cluster, assuming that the store ID is 10:
```bash
./pd-ctl -u "http://172.16.10.1:2379" -d store delete 10
```
Use Grafana or `pd-ctl` to check whether the node is successfully removed:
```bash
./pd-ctl -u "http://172.16.10.1:2379" -d store 10
```
> **Note:** It takes some time to remove the node. If the status of the node you remove becomes Tombstone, then this node is successfully removed.
After the node is successfully removed, **stop the services on node6**:
```bash
ansible-playbook stop.yml -l 172.16.10.6
```
**Edit the `inventory.ini` file** and remove the node information:
```ini
[tidb_servers]
[pd_servers]
172.16.10.1
172.16.10.2
172.16.10.3
[tikv_servers]
172.16.10.4
172.16.10.5
#172.16.10.6 # the removed node
[monitoring_servers]
172.16.10.1
[grafana_servers]
172.16.10.1
[monitored_servers]
172.16.10.1
172.16.10.2
172.16.10.3
172.16.10.4
172.16.10.5
#172.16.10.6 # the removed node
```
Now the topology is as follows:
| Name | Host IP | Services |
| ---- | ------- | -------- |
| node1 | 172.16.10.1 | PD1, Monitor |
| node2 | 172.16.10.2 | PD2 |
| node3 | 172.16.10.3 | PD3 |
| node4 | 172.16.10.4 | TiKV1 |
| node5 | 172.16.10.5 | TiKV2 |
| **node6** | **172.16.10.6** | **TiKV3 removed** |
**Update the Prometheus configuration and restart:**
```bash
ansible-playbook rolling_update_monitor.yml --tags=prometheus
```
Monitor the status of the entire cluster by opening a browser to access the monitoring platform: `http://172.16.10.1:3000`
### Remove a PD node
To remove a PD node (node2) with the IP address `172.16.10.2`, take the following steps:
**Remove the node from the cluster using `pd-ctl`:**
View the name of node2:
```bash
./pd-ctl -u "http://172.16.10.1:2379" -d member
```
**Remove node2** from the cluster, assuming that the name is pd2:
```bash
./pd-ctl -u "http://172.16.10.1:2379" -d member delete name pd2
```
Use Grafana or `pd-ctl` to check whether the node is successfully removed:
```bash
./pd-ctl -u "http://172.16.10.1:2379" -d member
```
After the node is successfully removed, **stop the services on node2**:
```bash
ansible-playbook stop.yml -l 172.16.10.2
```
**Edit the `inventory.ini` file** and remove the node information:
```ini
[tidb_servers]
[pd_servers]
172.16.10.1
#172.16.10.2 # the removed node
172.16.10.3
[tikv_servers]
172.16.10.4
172.16.10.5
172.16.10.6
[monitoring_servers]
172.16.10.1
[grafana_servers]
172.16.10.1
[monitored_servers]
172.16.10.1
#172.16.10.2 # the removed node
172.16.10.3
172.16.10.4
172.16.10.5
172.16.10.6
```
Now the topology is as follows:
| Name | Host IP | Services |
| ---- | ------- | -------- |
| node1 | 172.16.10.1 | PD1, Monitor |
| **node2** | **172.16.10.2** | **PD2 removed** |
| node3 | 172.16.10.3 | PD3 |
| node4 | 172.16.10.4 | TiKV1 |
| node5 | 172.16.10.5 | TiKV2 |
| node6 | 172.16.10.6 | TiKV3 |
**Perform a rolling update** to the entire TiKV cluster:
```bash
ansible-playbook rolling_update.yml
```
**Update the Prometheus configuration and restart**:
```bash
ansible-playbook rolling_update_monitor.yml --tags=prometheus
```
To monitor the status of the entire cluster, open a browser to access the monitoring platform: `http://172.16.10.1:3000`.

View File

@ -0,0 +1,125 @@
---
title: Scale
description: Scale TiKV
menu:
"3.1-beta":
parent: Tasks
weight: 5
---
You can scale out a TiKV cluster by adding nodes to increase the capacity without affecting online services. You can also scale in a TiKV cluster by deleting nodes to decrease the capacity without affecting online services.
> **Note:** If your TiKV cluster is deployed using Ansible, see [Scale the TiKV Cluster Using TiDB-Ansible](../ansible).
## Scale out or scale in PD
Before increasing or decreasing the capacity of PD, you can view details of the current PD cluster. Assume you have three PD servers with the following details:
| Name | ClientUrls | PeerUrls |
|:-----|:------------------|:------------------|
| pd1 | http://host1:2379 | http://host1:2380 |
| pd2 | http://host2:2379 | http://host2:2380 |
| pd3 | http://host3:2379 | http://host3:2380 |
Get the information about the existing PD nodes through `pd-ctl`:
```bash
./pd-ctl -u http://host1:2379
>> member
```
For the usage of `pd-ctl`, see [PD Control User Guide](../../reference/tools/pd-ctl/).
### Add a PD node dynamically
You can add a new PD node to the current PD cluster using the `join` parameter. To add `pd4`, use `--join` to specify the client URL of any PD server in the PD cluster, like:
```bash
./bin/pd-server --name=pd4 \
--client-urls="http://host4:2379" \
--peer-urls="http://host4:2380" \
--join="http://host1:2379"
```
### Remove a PD node dynamically
You can remove `pd4` using `pd-ctl`:
```bash
./pd-ctl -u http://host1:2379
>> member delete pd4
```
### Replace a PD node dynamically
You might want to replace a PD node in the following scenarios:
- You need to replace a faulty PD node with a healthy PD node.
- You need to replace a healthy PD node with a different PD node.
To replace a PD node, first add a new node to the cluster, migrate all the data from the node you want to remove, and then remove the node.
You can only replace one PD node at a time. If you want to replace multiple nodes, repeat the above steps until you have replaced all nodes. After completing each step, you can verify the process by checking the information of all nodes.
## Scale out or scale in TiKV
Before increasing or decreasing the capacity of TiKV, you can view details of the current TiKV cluster. Get the information about the existing TiKV nodes through `pd-ctl`:
```bash
./pd-ctl -u http://host1:2379
>> store
```
### Add a TiKV node dynamically
To add a new TiKV node dynamically, start a TiKV node on a new machine. The newly started TiKV node will automatically register in the existing PD of the cluster.
To reduce the pressure of the existing TiKV nodes, PD loads balance automatically, which means PD gradually migrates some data to the new TiKV node.
### Remove a TiKV node dynamically
To remove a TiKV node safely, you need to inform PD in advance. After that, PD is able to migrate the data on this TiKV node to other TiKV nodes, ensuring that data have enough replicas.
For example, to remove the TiKV node with the store id 1, you can complete this using `pd-ctl`:
```bash
./pd-ctl -u http://host1:2379
>> store delete 1
```
Then you can check the state of this TiKV node:
```bash
./pd-ctl -u http://host1:2379
>> store 1
{
"store": {
"id": 1,
"address": "127.0.0.1:21060",
"state": 1,
"state_name": "Offline"
},
"status": {
...
}
}
```
You can verify the state of this store using `state_name`:
- `state_name=Up`: This store is in service.
- `state_name=Disconnected`: The heartbeats of this store cannot be detected currently, which might be caused by a failure or network interruption.
- `state_name=Down`: PD does not receive heartbeats from the TiKV store for more than an hour (the time can be configured using `max-down-time`). At this time, PD adds a replica for the data on this store.
- `state_name=Offline`: This store is shutting down, but the store is still in service.
- `state_name=Tombstone`: This store is shut down and has no data on it, so the instance can be removed.
### Replace a TiKV node dynamically
You might want to replace a TiKV node in the following scenarios:
- You need to replace a faulty TiKV node with a healthy TiKV node.
- You need to replace a healthy TiKV node with a different TiKV node.
To replace a TiKV node, first add a new node to the cluster, migrate all the data from the node you want to remove, and then remove the node.
You can only replace one TiKV node at a time. To verify whether a node has been made offline, you can check the state information of the node in process. After verifying, you can make the next node offline.

View File

@ -0,0 +1,344 @@
---
title: Try
description: Try locally with Docker
menu:
"3.1-beta":
parent: Tasks
weight: 1
---
In this guide, you'll learn how to quickly get a tiny TiKV cluster running locally, then you'll use our Rust client to get, set, and scan data in TiKV. Then you'll learn how to quickly start and stop a TiKV cluster to accompany your development environment.
This guide won't worry about details such as security, resiliency, or production-readiness. (Those topics are covered more in detail in the [deploy](../../deploy/introduction) guides.) Nor will this guide cover how to develop TiKV itself (See [`CONTRIBUTING.md`](https://github.com/tikv/tikv/blob/master/CONTRIBUTING.md).) Instead, this guide focuses on an easy development experience and low resource consumption.
## Overview
In order to get a functioning TiKV service you will need to start a TiKV service and a PD service. PD works alongside TiKV to act as a coordinator and timestamp oracle.
Communication between TiKV, PD, and any services which use TiKV is done via gRPC. We provide clients for [several languages](../../../reference/clients/introduction/), and this guide will briefly show you how to use the Rust client.
Using Docker, you'll create pair of persistent services `tikv` and `pd` and learn to manage them easily. Then you'll write a simple Rust client application and run it from your local host. Finally, you'll learn how to quicky teardown and bring up the services, and review some basic limitations of this configuration.
![Architecture](../../../../img/docs/getting-started-docker.svg)
{{< warning >}}
In a production deployment there would be **at least** three TiKV services and three PD services spread among 6 machines. Most deployments also include kernel tuning, sysctl tuning, robust systemd services, firewalls, monitoring with prometheus, grafana dashboards, log collection, and more. Even still, to be sure of your resilience and security, consider consulting our [maintainers](https://github.com/tikv/tikv/blob/master/MAINTAINERS.md).
If you are interested in deploying for production we suggest investigating the [deploy](../../deploy/introduction) guides.
{{< /warning >}}
While it's possible to use TiKV through a query layer, like [TiDB](https://github.com/pingcap/tidb) or [Titan](https://github.com/distributedio/titan), you should refer to the user guides of those projects in order to set up test clusters. This guide only deals with TiKV, PD, and TiKV clients.
## Prerequisites
This guide assumes you have the following knowledge and tools at your disposal:
* Working knowledge about Docker (e.g. how to run or stop a container),
* A modern Docker daemon which can support `docker stack` and the Compose File 3.7 version, running on a machine with:
+ A modern (circa >2012) x86 64-bit processor (supporting SSE4.2)
+ At least 10 GB of free storage space
+ A modest amount of memory (4+ GB) available
+ A `ulimit.nofile` value higher than 82920 for the docker service
While this guide was written with Linux in mind, you can use any operating system as long as the Docker service is able to run Linux containers. You may need to make small adaptations to commands to suite your operating system, [let us know if you get stuck](https://github.com/tikv/website/issues/new) so we can fix it!
## Starting the stack
The maintainers from PingCAP publish battle-tested release images of both `pd` and `tikv` on [Docker Hub](https://hub.docker.com/u/pingcap). These are used in their [TiDB Cloud](https://pingcap.com/tidb-cloud/) kubernetes clusters as well as opt-in via their [`tidb-ansible`](https://github.com/pingcap/tidb-ansible) project.
For a TiKV client to interact with a TiKV cluster, it needs to be able to reach each PD and TiKV node. Since TiKV balances and replicates data across all nodes, and any node may be in charge of any particular *Region* of data, your client needs to be able to reach every node involved. (Replicas of your clients do not need to be able to reach each other.)
In the interest of making sure this guide can work for all platforms, it uses `docker stack` to deploy an ultra-minimal cluster that you can quickly tear down and bring back up again. This cluster won't feature security, persistence, or have static hostnames.
**Unless you've tried using `docker stack` before**, you may need to run `docker swarm init`. If you're unsure, it's best just to run it and ignore the error if you see one.
To begin, create a `stack.yml`:
```yml
version: "3.7"
x-defaults: &defaults
init: true
volumes:
- ./entrypoints:/entrypoints
environment:
SLOT: "{{.Task.Slot}}"
NAME: "{{.Task.Name}}"
entrypoint: /bin/sh
deploy:
replicas: 1
restart_policy:
condition: on-failure
delay: 5s
services:
pd:
<<: *defaults
image: pingcap/pd
hostname: "{{.Task.Name}}.tikv"
init: true
networks:
tikv:
aliases:
- pd.tikv
ports:
- "2379:2379"
- "2380:2380"
command: /entrypoints/pd.sh
tikv:
<<: *defaults
image: pingcap/tikv
hostname: "{{.Task.Name}}.tikv"
networks:
tikv:
aliases:
- tikv.tikv
ports:
- "20160:20160"
command: /entrypoints/tikv.sh
networks:
tikv:
name: "tikv"
driver: "overlay"
attachable: true
```
Then create `entrypoints/pd.sh` with the following:
```bash
#! /bin/sh
set -e
if [ $SLOT = 1 ]; then
exec ./pd-server \
--name $NAME \
--client-urls http://0.0.0.0:2379 \
--peer-urls http://0.0.0.0:2380 \
--advertise-client-urls http://`cat /etc/hostname`:2379 \
--advertise-peer-urls http://`cat /etc/hostname`:2380
else
exec ./pd-server \
--name $NAME \
--client-urls http://0.0.0.0:2379 \
--peer-urls http://0.0.0.0:2380 \
--advertise-client-urls http://`cat /etc/hostname`:2379 \
--advertise-peer-urls http://`cat /etc/hostname`:2380 \
--join http://pd.tikv:2379
fi
```
Last, an `entrypoints/tikv.sh` with the following:
```bash
#!/bin/sh
set -e
exec ./tikv-server \
--addr 0.0.0.0:20160 \
--status-addr 0.0.0.0:20180 \
--advertise-addr `cat /etc/hostname`:20160 \
--pd-endpoints pd.tikv:2379
```
Next, you can deploy the stack to Docker:
```bash
docker stack deploy --compose-file stack.yml tikv
```
The output should look like this:
```bash
Creating network tikv
Creating service tikv_pd
Creating service tikv_tikv
```
## Managing services
**Check the state of running services:**
```bash
$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
6ia0pefrd811 tikv_pd replicated 1/1 pingcap/pd:latest *:2379-2380->2379-2380/tcp
26u77puqmw4d tikv_tikv replicated 1/1 pingcap/tikv:latest *:20160->20160/tcp
```
**Turn off the services:**
{{< warning >}}
This will delete data!
{{</ warning >}}
```bash
$ docker service scale tikv_pd=0 tikv_tikv=0
tikv_pd scaled to 0
tikv_tikv scaled to 0
overall progress: 0 out of 0 tasks
verify: Service converged
overall progress: 0 out of 0 tasks
verify: Service converged
```
**Turn services back on:**
{{< info >}}
This creates brand new containers!
{{</ info >}}
```bash
$ docker service scale tikv_pd=1 tikv_tikv=1
tikv_pd scaled to 1
tikv_tikv scaled to 1
overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service converged
overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service converged
```
**Inquire into the metrics:**
Normally, these would be pulled by Prometheus, but it is human readable and functions as a basic liveliness test.
```bash
$ docker run --rm -ti --network tikv alpine sh -c "apk add curl; curl http://pd.tikv:2379/metrics"
# A lot of output...
$ docker run --rm -ti --network tikv alpine sh -c "apk add curl; curl http://tikv.tikv:20180/metrics"
# A lot of output...
```
**Inquire into the resource consuption of the containers:**
```bash
$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c4360f65ded3 tikv_tikv.1.a8sfm113yotkkv5klqtz5cvrn 0.36% 689MiB / 30.29GiB 2.22% 8.44kB / 7.42kB 0B / 0B 66
3f18cc8f415b tikv_pd.1.r58jn3kolaxgqdbyb8w2mcx8r 1.56% 22.21MiB / 30.29GiB 0.07% 8.11kB / 7.75kB 0B / 0B 21
```
**Remove the stack entirely:**
{{< warning >}}
This will delete data!
{{</ warning >}}
```bash
$ docker stack rm tikv
```
## Creating a project
Below, you'll use the Rust client, but you are welcome to use [any TiKV client](../../../reference/clients/introduction/).
Because you will eventually need to deploy the binary into the same network as the PD and TiKV nodes,
You can create a new example project then change into the directory:
```bash
cargo new tikv-example
cd tikv-example
```
{{< warning >}}
You will need to use a `nightly` toolchain that supports the `async`/`await` feature in order to use the TiKV client in the guide below. This is expected to become stable in Rust 1.38.0.
For now, please `echo 'nightly-2019-08-25' > ./rust-toolchain` in the project directory.
{{< /warning >}}
Next, you'll need to add the TiKV client as a dependency in the `Cargo.toml` file:
```toml
[dependencies]
tikv-client = { git = "https://github.com/tikv/client-rust.git" }
tokio = "0.2.0-alpha.4"
```
Then you can edit the `src/main.rs` file with the following:
```rust
use tikv_client::{Config, RawClient, Error};
#[tokio::main]
async fn main() -> Result<(), Error> {
let config = Config::new(vec!["http://pd.tikv:2379"]);
let client = RawClient::new(config)?;
let key = "TiKV".as_bytes().to_owned();
let value = "Works!".as_bytes().to_owned();
client.put(key.clone(), value.clone()).await?;
println!(
"Put: {} => {}",
std::str::from_utf8(&key).unwrap(),
std::str::from_utf8(&value).unwrap()
);
let returned: Vec<u8> = client.get(key.clone()).await?
.expect("Value should be present.").into();
assert_eq!(returned, value);
println!(
"Get: {} => {}",
std::str::from_utf8(&key).unwrap(),
std::str::from_utf8(&value).unwrap()
);
Ok(())
}
```
{{< info >}}
TiKV works with binary data to enable your project to store arbitrary data such as binaries or non-UTF-8 encoded data if necessary. While the Rust client accepts `String` values as well as `Vec<u8>`, it will only output `Vec<u8>`.
{{< /info >}}
Now, because the client needs to be part of the same network (`tikv`) as the PD and TiKV nodes, you must to build this binary into a Docker container. Create a `Dockerfile` in the root of your project directory with the following content:
```dockerfile
FROM ubuntu:latest
# Systemwide setup
RUN apt update
RUN apt install --yes build-essential protobuf-compiler curl cmake golang
# Create the non-root user.
RUN useradd builder -m -b /
USER builder
RUN mkdir -p ~/build/src
# Install Rust
COPY rust-toolchain /builder/build/
RUN curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain `cat /builder/build/rust-toolchain` -y
ENV PATH="/builder/.cargo/bin:${PATH}"
# Fetch, then prebuild all deps
COPY Cargo.toml rust-toolchain /builder/build/
RUN echo "fn main() {}" > /builder/build/src/main.rs
WORKDIR /builder/build
RUN cargo fetch
RUN cargo build --release
COPY src /builder/build/src
RUN rm -rf ./target/release/.fingerprint/tikv-example*
# Actually build the binary
RUN cargo build --release
ENTRYPOINT /builder/build/target/release/tikv-example
```
Next, build the image:
```bash
docker build -t tikv-example .
```
Then start the produced image:
```bash
docker run -ti --rm --network tikv tikv-example
```
At this point, you're ready to start developing against TiKV!
Want to keep reading? You can explore [Deep Dive TiKV](../../../deep-dive/introduction) to learn more about how TiKV works at a technical level.
Want to improve your Rust abilities? Some of our contributors work on creating [Practical Network Applications](https://github.com/pingcap/talent-plan/tree/master/rust), an open self guided study to master Rust while making fun distributed systems.

View File

@ -1,30 +0,0 @@
---
title: Byzantine Failure
menu:
docs:
parent: Consensus algorithm
weight: 2
---
Consensus algorithms are typically either *Byzantine Fault Tolerant*, or not. Succinctly, systems which can withstand Byzantine faults are able to withstand misbehaving peers. Most distributed systems you would use inside of a VLAN, such as Kafka, TiKV, and etcd, are not Byzantine Fault Tolerant.
In order to withstand Byzantine faults, the system must tolerate peers:
* actively spreading incorrect information,
* deliberately not spreading correct information,
* modifying information that would otherwise be correct.
This extends far beyond situations where the network link degrades and starts corrupting packets at the TCP layer. Those kinds of issues are easily tractable compared to a system being able to withstand active internal subversion.
In order to better understand Byzantine Fault Tolerance it helps to imagine the Byzantine Generals Problem:
> Several Byzantine generals and their armies have surrounded an enemy army inside a deep forest. Separate, they are not strong enough to defeat the enemy, but if they attack in a coordinated fashion they will succeed. They must all agree on a time to attack the enemy.
>
> In order to communicate, the generals can send messengers through the forest. These messages may or may not reach their destination. They could be kidnapped and replaced with imposters, converted to the enemy cause, or outright killed.
>
> How can the generals confidently coordinate a time to attack?
After thinking on the problem for a time, you can conclude that tackling such a problem introduces a tremendous amount of complexity and overhead to a system.
Separating Byzantine tolerant systems from non-tolerant systems helps with evaluation of systems. A non-tolerant system will almost always outperform a tolerant system.

View File

@ -1,21 +0,0 @@
---
title: CAP Theorem
menu:
docs:
parent: Consensus algorithm
weight: 1
---
In 2000, Eric Brewer presented [“Towards Robust Distributed Systems”](http://awoc.wolski.fi/dlib/big-data/Brewer_podc_keynote_2000.pdf) which detailed the CAP Theorem. Succinctly, the theorem declares that a distributed system may only choose two of the following three attributes:
* **Consistency:** Every read receives the most recent write or an error
* **Availability:** Every request receives a (non-error) response without necessarily considering the most recent write
* **Partitioning:** The system continues to operate despite an arbitrary number of messages being dropped (or delayed) between nodes
Traditional RDBMS, like [PostgreSQL](https://www.postgresql.org/), that provide [ACID](http://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf) guarantees, favor consistency over availability. BASE (Basic Availability, Soft-state, Eventual consistency) systems, like MongoDB and other NoSQL systems, favor availability over consistency.
In 2012, Daniel Abadi proposed that CAP was not sufficient to describe the trade-offs which occur when choosing the attributes of a distributed system. They described an expanded [PACELC](http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf) Theorem:
> Availability (**A**) and consistency (**C**) (as per the CAP theorem), but else (**E**), even when the system is running normally in the absence of partitions, one has to choose between latency (**L**) and consistency (**C**).
In order to support greater availability of data, most systems will replicate data between multiple peers. The system may also replicate a write ahead log offsite. In order to fulfill these availability guarantees the system must ensure a certain number of replications have occured before confirming an action. **More replication means more consistency but also more latency.**

View File

@ -1,16 +0,0 @@
---
title: Consensus algorithm
menu:
docs:
parent: Deep Dive
weight: 2
---
When building a distributed system one principal goal is often to build in fault-tolerance. That is, if one particular node in a network goes down, or if there is a network partition, the systems continues to operate. The cluster of nodes taking part in a distributed consensus protocol must come to agreement regarding values, and once that decision is reached, that choice is final, even if some nodes were in a faulty state at the time.
Distributed consensus algorithms often take the form of a replicated state machine and log. Each state machine accepts inputs from its log, and represents the value(s) to be replicated, for example, a change to a hash table. They allow a collection of machines to work as a coherent group that can survive the failures of some of its members.
Two well known distributed consensus algorithms are [Paxos](https://lamport.azurewebsites.net/pubs/paxos-simple.pdf) and [Raft](https://raft.github.io/raft.pdf). Paxos is used in systems like [Chubby](http://research.google.com/archive/chubby.html) by Google, and Raft is used in systems like [TiKV](https://github.com/tikv/tikv) or [etcd](https://github.com/coreos/etcd/tree/master/raft). Raft is generally seen as a more understandable and simpler to implement than Paxos.
In TiKV we harness Raft for distributed consensus. We found it much easier to understand both the algorithm, and how it will behave in even truly perverse scenarios.

View File

@ -1,28 +0,0 @@
---
title: Paxos
menu:
docs:
parent: Consensus algorithm
weight: 3
---
Paxos is a protocol that Leslie Lamport and others have written extensively about. The most succinct paper describing Paxos is ["Paxos Made Easy"](https://lamport.azurewebsites.net/pubs/paxos-simple.pdf) published by Lamport in 2001. The original paper ["The Part-Time Parliment"](http://lamport.azurewebsites.net/pubs/pubs.html#lamport-paxos) was published in 1989.
Paxos defines several roles, and each node in a cluster may perform in one or many roles. Each cluster has a **single eventually chosen leader**, and then some number of **learners** (which take action on the agreed upon request), **Acceptors** (which form quorums and act as "memory"), and **proposers** (which advocate for client requests and coordinate).
Unlike Raft, which represents a relatively concrete protocol, Paxos represents a *family* of protocols. Each variant has different tradeoffs.
A few variants of Paxos:
* Basic Paxos: The basic protocol, allowing consensus about a single value.
* Multi Paxos: Allow the protocol to handle a stream of messages with less overhead than Basic Paxos.
* [**Cheap Paxos**](https://lamport.azurewebsites.net/pubs/web-dsn-submission.pdf): Reduce number of nodes needed via dynamic reconfiguration in exchange for reduced burst fault tolerance.
* [**Fast Paxos**](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2005-112.pdf): Reduce the number of round trips in exchange for reduced fault tolerance.
* [**Byzantine Paxos**](http://pmg.csail.mit.edu/papers/osdi99.pdf): Withstanding Byzantine failure scenarios.
* [**Raft**](https://raft.github.io/raft.pdf): Described in the next chapter.
It has been noted in the industry that Paxos is notoriously difficult to learn. Algorithms such as Raft are designed deliberately to be more easy to understand.
Due to complexities in the protocol, and the range of possibilities, it can be difficult to ascertain the state of a system when things go wrong (or are going right).
The basic Paxos algorithm itself does not define things like how to handle leader failures or membership changes.

View File

@ -1,23 +0,0 @@
---
title: Raft
menu:
docs:
parent: Consensus algorithm
weight: 4
---
In 2014, Diego Ongaro and John Ousterhout presented the Raft algorithm. It is explained succinctly in a [paper](https://raft.github.io/raft.pdf) and detailed at length in a [thesis](https://ramcloud.stanford.edu/~ongaro/thesis.pdf).
Raft defines a strong, single leader and number of followers in a group of peers. The group represents a **replicated state machine**. Only the leader may service client requests. The leader replicates actions to the followers.
Each peer has a **durable write ahead log**. All peers append each action as an entry in the log immediately as they recieve it. When the quorum (the majority) of peers have confirmed that that the entry exists in their log, the leader commits the log, each peer then can apply the action to their state machine.
Raft guarantees **strong consistency** by having only one leader of the group which services all requests. All requests are then replicated to a quorum before being acted on, then confirmed with the requester. From the perspective of the cluster, the leader always has an up to date state machine.
The group is **available** when a majority of peers are able to coordinate. If the group is partitioned, only a partition containing the majority of the group can recover and resume servicing requests. If the cluster is split into three equal subgroups, for example, none of the subgroups will recover and service requests.
Raft supports **leader elections**. If the group leader fails, one of the followers will be elected the new leader. Its not possible for a stale leader to be elected. If a leader candidate is aware of requests which the other peers of a particular subgroup are not, it will be elected over those peers. Since only the majority of peers can form a quorum this means that in order to be elected a peer must be up to date.
Because of how leader elections work, Raft is not Byzantine fault tolerant. Any node is able to lie and subvert the cluster by starting an election and claiming to have log entries it didnt have.
Its possible to support actions such as changing the peer groups membership, or replicating log entries to non-voting followers (learners). In addition, several optimizations can be applied to Raft. Prevote can be used to introduce a pre-election by possible leaders, allowing them to gauge their ability to become a leader before potentially disrupting a cluster. Joint Consensus can support arbitrary group changes, allowing for better scaling. Batching and pipelining can help high throughput systems perform better.

View File

@ -1,104 +0,0 @@
---
title: Distributed SQL
menu:
docs:
parent: Distributed SQL over TiKV
weight: 2
---
By now we already know how [TiDB]'s relational structure is encoded into the Key-Value form with version. In this section, we will focus on the following questions:
* What happens when [TiDB] receives a SQL query?
* How does [TiDB] execute SQL queries in a distributed way?
## What happens when [TiDB] receives a SQL query?
Firstly, let's have a look at the following example:
```sql
select count(*) from t where a + b > 5;
```
{{< figure
src="/img/deep-dive/select-from-tidb.png"
caption="SQL query diagram"
number="1" >}}
As described in the above figure, when [TiDB] receives a SQL query from the client, it will process with the following steps:
1. [TiDB] receives a new SQL from the client.
2. [TiDB] prepares the processing plans for this request, meanwhile [TiDB] gets a timestamp from [PD] as the `start_ts` of this transaction.
3. [TiDB] tries to get the information schema (metadata of the table) from TiKV.
4. [TiDB] prepares Regions for each related key according to the information schema and the SQL query. Then [TiDB] gets information for the related Regions from [PD].
5. [TiDB] groups the related keys by Region.
6. [TiDB] dispatches the tasks to the related TiKV concurrently.
7. [TiDB] reassembles the data and returns the data to the client.
## How does [TiDB] execute SQL queries in a distributed way?
In short, [TiDB] splits the task by Regions and sends them to TiKV concurrently.
For the above example, we assume the rows with the primary key of table `t` are distributed in three Regions:
* Rows with the primary key in [0,100) are in Region 1.
* Rows with primary key in [100,1000) are in region 2.
* Rows with primary key in [1000,~) are in region 3.
Now we can do `count` and sum the result from the above three Regions.
{{< figure
src="/img/deep-dive/coprocessor-select.png"
caption="Coprocessor diagram"
number="2" >}}
### Exectors
Now we know [TiDB] splits a read task by Regions, but how does TiKV know what are its tasks to handle?
Here [TiDB] will send a Directed Acyclic Graph (DAG) to TiKV with each node as an executor.
{{< figure
src="/img/deep-dive/executors.jpg"
caption="Executors"
number="3" >}}
Supported executors:
* TableScan: Scans the rows with the primary key from the KV store.
* IndexScan: It will scan the index data from the KV store.
* Selection: performs a filter (mostly for `where`). The input is `TableScan` or `IndexScan`.
* Aggregation: performs an aggregation (e.g. `count(*)`,`sum(xxx)`). The input is `TableScan`,`IndexScan`, or`Selection`.
* TopN: sorts the data and returns the top n matches, for example, `order by xxx limit 10`. The input is `TableScan`,`IndexScan`, or`Selection`.
{{< figure
src="/img/deep-dive/executors-example.jpg"
caption="Executors example"
number="4" >}}
For the above example, we have the following executors on Region 1:
* Aggregation: `count(*)`.
* Selection: `a + b > 5`
* TableScan: `range:[0,100)`.
### Expression
We have executors as nodes in the DAG, but how do we describe columns, constants, and functions in an `Aggregation` or a `Selection`?
There are three types of expressions:
* Column: a column in the table.
* Constant: a constant, which could be a string, int, duration, and so on.
* Scalar function: describes a function.
{{< figure
src="/img/deep-dive/expression.jpg"
caption="Expression"
number="5" >}}
For the above example `select count(*) from t where a + b > 5`, we have:
* Column: a, b.
* Scalar functions: `+`,`>`.
* Constant: `5`.
[TiDB]: https://github.com/pingcap/tidb
[PD]: https://github.com/pingcap/pd

View File

@ -1,47 +0,0 @@
---
title: Distributed SQL over TiKV
menu:
docs:
parent: Deep Dive
weight: 8
---
TiKV is the storage layer for [TiDB], a distributed HTAP SQL database. So far,
we have only explained how a distributed transactional Key-Value database is
implemented. However this is still far from serving a SQL database. We will
explore and cover the following things in this chapter:
* **Storage**
In this section we will see how the TiDB relational structure (i.e. SQL table
records and indexes) are encoded into the Key-Value form in the latest
version. We will also explore a new Key-Value format that is going to be
implemented soon and some insights on even better Key-Value formats in future.
* **Distributed SQL (DistSQL)**
Storing data in a distributed manner using TiKV only utilizes distributed I/O
resources, while the TiDB node that receives SQL query is still in charge of
processing all rows. We can go a step further by delegating some processing
tasks into TiKV nodes. This way, we can utilize distributed CPU resources! In
this section, we will take a look at these supported physical plan executors
so far in TiKV and see how they enable TiDB executing SQL queries in a
distributed way.
* **TiKV Query Execution Engine**
When talking about executors we cannot ignore discussing the execution engine.
Although executors running on TiKV are highly simplified and limited, we still
need to carefully design the execution engine. It is critical to the
performance of the system. This section will cover the traditional Volcano
model execution engine used before TiKV 3.0, for example, how it works, pros
and cons, and the architecture.
* **Vectorization**
Vectorization is a technique that performs computing over a batch of values.
By introducing vectorization into the execution engine, we will achieve higher
performance. This section will introduce its theory and the architecture of
the vectorized execution engine introduced in TiKV 3.0.
[TiDB]: https://github.com/pingcap/tidb

View File

@ -1,63 +0,0 @@
---
title: Distributed Algorithms
menu:
docs:
parent: Distributed transaction
weight: 2
---
## Two-Phase Commit
In transaction processing, databases, and computer networking, the two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction, determining whether to commit or abort (rollback) the transaction. It is a specialized type of consensus protocol. The protocol achieves its goal even in many cases of temporary system failure (involving either process, network node, communication, etc. failures), and is thus widely used. However, it is not resilient to all possible failure scenarios, and in rare cases user (i.e. a system's administrator) intervention is needed to resolve failures. To aide in recovery from failure the protocol's participants log the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures. Many protocol variants exist that primarily differ in logging strategies and recovery mechanisms. Though expected to be used infrequently, recovery procedures compose a substantial portion of the protocol, due to many possible failure scenarios to be considered and supported by the protocol.
### Basic Algorithm of 2PC
#### `prepare` phase
The coordinator sends a `prepare` message to all cohorts and waits until it has received a reply from all cohorts.
#### `commit` phase
If the coordinator received an agreement message from all cohorts during the `prepare` phase, the coordinator sends a `commit` message to all the cohorts.
If any cohort votes `No` during the `prepare` phase (or the coordinator's timeout expires), the coordinator sends a `rollback` message to all the cohorts.
### Disadvantages of 2PC
The greatest disadvantage of the two-phase commit protocol is that it is a blocking protocol. If the coordinator fails permanently, some cohorts will never resolve their transactions: after a cohort has sent an agreement message to the coordinator, it will block until a `commit` or `rollback` is received.
For example, consider a transaction involving a coordinator `A` and the cohort `C1`. If `C1` receives a `prepare` message and responds to `A`, then `A` fails before sending `C1`
either a `commit` or `rollback` message, then `C1` will block forever.
### 2PC Practice in TiKV
In TiKV we adopt the [Percolator transaction model](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36726.pdf) which is a variant of two phase commit. To address the disadvantage of coordinator failures, percolator doesn't use any node as coordinator, instead it uses one of the keys involved in each transaction as a coordinator. We call the coordinating key the primary key, and the other keys secondary keys. Since each key has multiple replicas, and data is kept consistent between these replicas by using a consensus protocol (Raft in TiKV), one node's failure doesn't affect the accessibility of data. So Percolator can tolerate node fails permanently.
## Three-Phase Commit
Unlike the two-phase commit protocol (2PC), 3PC is non-blocking. Specifically, 3PC places an upper bound on the amount of time required before a transaction either commits or aborts. This property ensures that if a given transaction is attempting to commit via 3PC and holds some resource locks, it will release the locks after the timeout.
#### 1st phase
The coordinator receives a transaction request. If there is a failure at this point, the coordinator aborts the transaction. Otherwise, the coordinator sends a `canCommit?` message to the cohorts and moves to the waiting state.
#### 2nd phase
If there is a failure, timeout, or if the coordinator receives a `No` message in the waiting state, the coordinator aborts the transaction and sends an `abort` message to all cohorts. Otherwise the coordinator will receive `Yes` messages from all cohorts within the time window, so it sends `preCommit` messages to all cohorts and moves to the prepared state.
#### 3rd phase
If the coordinator succeeds in the prepared state, it will move to the commit state. However if the coordinator times out while waiting for an acknowledgement from a cohort, it will abort the transaction. In the case where an acknowledgement is received from the majority of cohorts, the coordinator moves to the commit state as well.
A two-phase commit protocol cannot dependably recover from a failure of both the coordinator and a cohort member during the Commit phase. If only the coordinator had failed, and no cohort members had received a commit message, it could safely be inferred that no commit had happened. If, however, both the coordinator and a cohort member failed, it is possible that the failed cohort member was the first to be notified, and had actually done the commit. Even if a new coordinator is selected, it cannot confidently proceed with the operation until it has received an agreement from all cohort members, and hence must block until all cohort members respond.
The three-phase commit protocol eliminates this problem by introducing the `Prepared-to-commit` state. If the coordinator fails before sending `preCommit` messages, the cohort will unanimously agree that the operation was aborted. The coordinator will not send out a `doCommit` message until all cohort members have acknowledged that they are `Prepared-to-commit`. This eliminates the possibility that any cohort member actually completed the transaction before all cohort members were aware of the decision to do so (an ambiguity that necessitated indefinite blocking in the two-phase commit protocol).
### Disadvantages of 3PC
The main disadvantage to this algorithm is that it cannot recover in the event the network is segmented in any manner. The original 3PC algorithm assumes a fail-stop model, where processes fail by crashing and crashes can be accurately detected, and does not work with network partitions or asynchronous communication.
The protocol requires at least three round trips to complete. This potentially causes a long latency in order to complete each transaction.
## Paxos Commit
The Paxos Commit algorithm runs a Paxos consensus algorithm on the commit/abort decision of each participant to achieve a transaction commit protocol that uses 2F + 1 coordinators and makes progress if at least F + 1 of them are working properly. Paxos Commit has the same stable-storage write delay, and can be implemented to have the same message delay in the fault-free case, as Two-Phase Commit, but it uses more messages. The classic Two-Phase Commit algorithm is obtained as the special F = 0 case of the Paxos Commit algorithm.
In the Two-Phase Commit protocol, the coordinator decides whether to abort or commit, records that decision in stable storage, and informs the cohorts of its decision. We could make that fault-tolerant by simply using a consensus algorithm to choose the committed/aborted decision, letting the cohorts be the client that proposes the consensus value. This approach was apparently first proposed by Mohan, Strong, and Finkelstein, who used a synchronous consensus protocol. However, in the normal case, the leader must learn that each cohort has prepared before it can try to get the value committed chosen. Having the cohorts tell the leader that they have prepared requires at least one message delay.

View File

@ -1,19 +0,0 @@
---
title: Distributed transaction
menu:
docs:
parent: Deep Dive
weight: 4
---
As TiKV is a distributed transactional key-value database, transaction is a core feature of TiKV. In this chapter we will talk about general implementations of distributed transaction and some implementation details in TiKV.
A database transaction, by definition, must be atomic, consistent, isolated and durable. Database practitioners often refer to these properties of database transactions using the acronym ACID.
Transactions provide an "all-or-nothing" proposition: each work-unit performed in a database must either complete in its entirety or have no effect whatsoever. Furthermore, the system must isolate each transaction from other transactions, results must conform to existing constraints in the database, and transactions that complete successfully must get written to durable storage.
A distributed transaction is a database transaction in which two or more network hosts are involved. Usually, hosts provide transactional resources, while the transaction manager is responsible for creating and managing a global transaction that encompasses all operations against such resources. Distributed transactions, as any other transactions, must have all four ACID properties.
A common algorithm for ensuring correct completion of a distributed transaction is the two-phase commit (2PC).
TiKV adopts Google's Percolator transaction model, a variant of 2PC.

View File

@ -1,69 +0,0 @@
---
title: Isolation Level
menu:
docs:
parent: Distributed transaction
weight: 1
---
Isolation is one of the ACID (Atomicity, Consistency, Isolation, Durability) properties. It determines how transaction integrity is visible to other users and systems. For example, when a user is creating a Purchase Order and has created the header, but not the Purchase Order lines, is the header available for other systems/users (carrying out concurrent operations, such as a report on Purchase Orders) to see?
A lower isolation level increases the ability of many users to access the same data at the same time, but increases the number of concurrency effects (such as dirty reads or lost updates) users might encounter. Conversely, a higher isolation level reduces the types of concurrency effects that users may encounter, but requires more system resources and increases the chances that one transaction will block another.
Most DBMSs offer a number of transaction isolation levels, which control the degree of locking that occurs when selecting data. For many database applications, the majority of database transactions can be constructed to avoid requiring high isolation levels (e.g. SERIALIZABLE level), thus reducing the locking overhead for the system. The programmer must carefully analyze database access code to ensure that any relaxation of isolation does not cause software bugs that are difficult to find. Conversely, if higher isolation levels are used, the possibility of deadlock is increased, which also requires careful analysis and programming techniques to avoid.
Since each isolation level is stronger than those below, in that no higher isolation level allows an action forbidden by a lower one, the standard permits a DBMS to run a transaction at an isolation level stronger than that requested (e.g., a "Read committed" transaction may actually be performed at a "Repeatable read" isolation level).
The isolation levels defined by the ANSI/ISO SQL standard are listed as follows.
### Serializable
This is the highest isolation level.
With a lock-based concurrency control DBMS implementation, serializability requires read and write locks (acquired on selected data) to be released at the end of the transaction. Also range-locks must be acquired when a SELECT query uses a ranged WHERE clause, especially to avoid the [phantom reads phenomenon](https://en.wikipedia.org/wiki/Isolation_(database_systems)#Phantom_reads).
When using non-lock based concurrency control, no locks are acquired; however, if the system detects a write collision among several concurrent transactions, only one of them is allowed to commit. See [snapshot isolation](#snapshot-isolation) for more details on this topic.
Per the SQL-92 standard:
> The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable. A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions. A serial execution is one in which each SQL-transaction executes to completion before the next SQL-transaction begins.
### Repeatable Read
In this isolation level, a lock-based concurrency control DBMS implementation keeps read and write locks (acquired on selected data) until the end of the transaction. However, range-locks are not managed, so phantom reads can occur.
Write skew is possible at this isolation level, a phenomenon where two writes are allowed to the same column(s) in a table by two different writers (who have previously read the columns they are updating), resulting in the column having data that is a mix of the two transactions.
Repeatable reads is the default isolation level for MySQL's InnoDB engine.
### Read Committed
In this isolation level, a lock-based concurrency control DBMS implementation keeps write locks (acquired on selected data) until the end of the transaction, but read locks are released as soon as the SELECT operation is performed (so the non-repeatable reads phenomenon can occur in this isolation level). As in the previous level, range-locks are not managed.
Putting it in simpler words, read committed is an isolation level that guarantees that any data read is committed at the moment it is read. It simply restricts the reader from seeing any intermediate, uncommitted, 'dirty' read. It makes no promise whatsoever that if the transaction re-issues the read, it will find the same data; data is free to change after it is read.
### Read Uncommitted
This is the lowest isolation level. In this level, dirty reads are allowed, so one transaction may see not-yet-committed changes made by other transactions.
### Snapshot Isolation
We mentioned 4 different isolation levels above, but TiDB doesnt adopt any of them. Instead, TiDB uses snapshot isolation as its default ioslation level. The main reason for it is that it allows better serializability, yet still avoids most of the concurrency anomalies that serializability avoids (but not always all).
TiDB is not alone: snapshot isolation also has been adopted by major database management systems such as InterBase, Firebird, Oracle, MySQL, PostgreSQL, SQL Anywhere, MongoDB and Microsoft SQL Server (2005 and later).
Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database, and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.
In practice snapshot isolation is implemented within multiversion concurrency control (MVCC), where generational values of each data item (versions) are maintained: MVCC is a common way to increase concurrency and performance by generating a new version of a database object each time the object is written, and allowing transactions' read operations of several last relevant versions (of each object). The prevalence of snapshot isolation has been seen as a refutation of the ANSI SQL-92 standards definition of isolation levels, as it exhibits none of the "anomalies" that the SQL standard prohibited, yet is not serializable (the anomaly-free isolation level defined by ANSI).
A transaction executing under snapshot isolation appears to operate on a personal snapshot of the database, taken at the start of the transaction. When the transaction concludes, it will successfully commit only if the values updated by the transaction have not been changed externally since the snapshot was taken. Such a write-write conflict will cause the transaction to abort.
In a write skew anomaly, two transactions (T1 and T2) concurrently read an overlapping data set (e.g. values V1 and V2), concurrently make disjoint updates (e.g. T1 updates V1, T2 updates V2), and finally concurrently commit, neither having seen the update performed by the other. Were the system serializable, such an anomaly would be impossible, as either T1 or T2 would have to occur "first", and be visible to the other. In contrast, snapshot isolation permits write skew anomalies.
As a concrete example, imagine V1 and V2 are two balances held by a single person, James. The bank will allow either V1 or V2 to run a deficit, provided the total held in both is never negative (i.e. V1 + V2 ≥ 0). Both balances are currently $100. James initiates two transactions concurrently, T1 withdrawing $200 from V1, and T2 withdrawing $200 from V2.
If the database guaranteed serializable transactions, the simplest way of coding T1 is to deduct $200 from V1, and then verify that V1 + V2 ≥ 0 still holds, aborting if not. T2 similarly deducts $200 from V2 and then verifies V1 + V2 ≥ 0. Since the transactions must serialize, either T1 happens first, leaving V1 = -$100, V2 = $100, and preventing T2 from succeeding (since V1 + (V2 - $200) would be -$200), or T2 happens first and similarly prevents T1 from committing.
If the database is under snapshot isolation (MVCC), however, T1 and T2 operate on private snapshots of the database: each deducts $200 from an account, and then verifies that the new total is zero, using the other account value that held when the snapshot was taken. Since neither update conflicts, both commit successfully, leaving V1 = V2 = -$100, and V1 + V2 = -$200.
In TiDB, you can use `SELECT … FOR UPDATE` statement to avoid write skew anomaly. In this case, TiDB will use locks to serialize writes together with MVCC to gain some of the performance gains and still support the stronger "serializability" level of isolation.

View File

@ -1,25 +0,0 @@
---
title: Locking
menu:
docs:
parent: Distributed transaction
weight: 3
---
To prevent lost updates and dirty reads, locking is employed to manage the actions of multiple concurrent users on a database. The two types of locking are pessimistic locking and optimistic locking.
## Pessimistic Locking
A user who reads a record with the intention of updating it places an exclusive lock on the record to prevent other users from manipulating it. This means no one else can manipulate that record until the user releases the lock. The downside is that users can be locked out for a very long time, thereby slowing the overall system response and causing frustration.
Pessimistic locking is mainly used in environments where write contention is heavy, where the cost of protecting data through locks is less than the cost of rolling back transactions if concurrency conflicts occur. Pessimistic concurrency is best implemented when lock times will be short, as in programmatic processing of records. Pessimistic concurrency requires a persistent connection to the database and is not a scalable option when users are interacting with data, because records might be locked for relatively large periods of time. It is not appropriate for use in web application development.
## Optimistic Locking
This allows multiple concurrent users access to the database whilst the system keeps a copy of the initial-read made by each user. When a user wants to update a record, the application first determines whether another user has changed the record since it was last read. The application does this by comparing the initial-read held in memory to the database record to verify any changes made to the record. Any discrepancies between the initial-read and the database record violates concurrency rules and causes the system to disregard any update request; an error message is reported and the user must start the update process again. It improves database performance by reducing the amount of locking required, thereby reducing the load on the database server. It works efficiently with tables that require limited updates since no users are locked out. However, some updates may fail. The downside is constant update failures due to high volumes of update requests from multiple concurrent users - it can be frustrating for users.
Optimistic locking is appropriate in environments where there is low write contention for data, or where read-only access to data is required.
## Practice in TiKV
We use the Percolator Transaction model in TiKV, which uses an optimistic locking strategy. It reads database values, writes them tentatively, then checks whether other transactions have modified data that this transaction has used (read or written). This includes transactions that completed after this transaction's start time, and optionally, transactions that are still active at validation time. If there is no conflict, all changes take effect. If there is a conflict it is resolved, typically by aborting the transaction. In TiKV, the whole transaction is restarted with a new timestamp.

View File

@ -1,84 +0,0 @@
---
title: Optimized Percolator
menu:
docs:
parent: Distributed transaction
weight: 6
---
As said in [previous chapter](../percolator), TiKV makes use of Percolator's transaction algorithm. In TiKV's implementation, there are some optimizations on Percolator. In this chapter, we will introduce these optimizations in TiKV.
## Parallel Prewrite
In practice, for a single transaction, we don't want to do prewrites one by one. When there are dozens of TiKV nodes in the cluster, we hope the prewrite can be executed concurrently on these TiKV nodes.
In TiKV's implementation, when committing a transaction, the keys in the transaction will be divided into several batches and each batch will be prewritten in parallel. It doesn't matter whether the primary key is written first.
If a conflict happens during a transaction's prewrite phase, the prewrite process will be canceled and rollback will be performed on all keys affected by the transaction. Doing rollback on a key will leave a `Rollback` record in `CF_WRITE`(Percolator's `write` column), which is not described in Google's Percolator paper. The `Rollback` record is a mark to indicate that the transaction with `start_ts` in the record has been rolled back, and if a prewrite request arrives later than the rollback request, the prewrite will not succeed. This situation may be caused by network issues. The correctness won't be broken if we allow the prewrite to succeed. However, the key will be locked and unavailable until the lock's TTL expires.
## Short Value in Write Column
As mentioned in [Percolator in TiKV](../percolator/#percolator-in-tikv), TiKV uses RocksDB's column families to save different columns of Percolator. Different column families of RocksDB are actually different LSM-Trees. When we access a value, we need to search firstly the `CF_WRITE` to find the `start_ts` of the next record, and then the corresponding record in `CF_DEFAULT`. If a value is very small, it is wasteful to search RocksDB twice.
The optimization in TiKV is to avoid handling `CF_DEFAULT` for short values. If the value is short enough, it will not be put into `CF_DEFAULT` during the prewrite phase. Instead, it will be embedded in the lock and saved in `CF_LOCK`. Then in the commit phase, the value will be moved out of the lock and inlined in the write record. Therefore, we can access and manipulate short values without having to handle `CF_DEFAULT`.
## Point Read Without Timestamp
Timestamps are critical to providing isolation for transactions. For every transaction, we allocate a unique `start_ts` for it, and ensures transaction T can only see the data committed before T's `start_ts`.
But if transaction T does nothing but reads a single key, is it really necessary to allocate it a `start_ts`? The answer is no. We can simply read the newest version directly, because it's equivalent to reading with `start_ts` which is exactly the instant when the key is read. It's even ok to read a locked key, because it's equivalent to reading with the `start_ts` allocated before the lock's `start_ts`.
## Calculated Commit Timestamp
{{< warning >}}
This optimization hasn't been finished yet, but will be available in the future. [RFC](https://github.com/tikv/rfcs/pull/25).
{{</ warning >}}
To provide Snapshot Isolation, we must ensure all transactional reads are
repeatable. The `commit_ts` should be large enough so that the transaction will
not be committed before a valid read. Otherwise, Repeatable Read will be broken.
For example:
1. Txn1 gets `start_ts` 100
2. Txn2 gets `start_ts` 200
3. Txn2 reads key `"k1"` and gets value `"1"`
4. Txn1 prewrites `"k1"` with value `"2"`
5. Txn1 commits with `commit_ts` 101
6. Tnx2 reads key `"k1"` and gets value `"2"`
Txn2 reads `"k1"` twice but gets two different results. If `commit_ts` is
allocated from PD, this will not happen, because Txn2's first read must happen
before Txn1's prewrite while Txn1's `commit_ts` must be requested after
finishing prewrite. And as a result, Txn2's `commit_ts` must be larger than
Txn1's `start_ts`.
On the other hand, `commit_ts` can't be arbitrarily large. If the `commit_ts` is
ahead of the actual time, the committed data may be unreadable by other new
transactions, which breaks integrity. We are not sure whether a timestamp is
ahead of the actual time if we don't ask PD.
To conclude, in order not to break the Snapshot Isolation and the integrity, a
valid range for `commit_ts` should be:
```text
max{start_ts, max_read_ts_of_written_keys} < commit_ts <= now
```
So here comes a method to calculate the commit_ts:
```text
commit_ts = max{start_ts, region_1_max_read_ts, region_2_max_read_ts, ...} + 1
```
where `region_N_max_read_ts` is the maximum timestamp of all reads on the
region, for all regions involved in the transaction.
## Single Region 1PC
{{< warning >}}
This optimization haven't been finished yet, but will be available in the future.
{{</ warning >}}
For non-distributed databases, it's easy to provide ACID transactions; but for distributed databases, usually 2PC (two-phase commit) is required to make transactions ACID. Percolator provides such a 2PC algorithm, which is adopted by TiKV.
Considering that write batches are done atomically in a single Region, we come up with this realization that if a transaction affects only one region, 2PC is actually unnecessary. Once there is no write conflict, the transaction can be committed directly. Based on [the previous optimization](#calculated-commit-ts), the `commit_ts` can be set to `max_read_ts` of the region directly. In this way, we saved an RPC and a write operation (including a Raft committing and RocksDB writing) in TiKV for single-region transactions.

View File

@ -1,160 +0,0 @@
---
title: Percolator
menu:
docs:
parent: Distributed transaction
weight: 5
---
TiKV supports distributed transactions, which is inspired by Google's [Percolator](https://ai.google/research/pubs/pub36726.pdf). In this section, we will briefly introduce Percolator and how we make use of it in TiKV.
## What is Percolator?
*Percolator* is a system built by Google for incremental processing on a very large data set. Since this is just a brief introduction, you can view the full paper [here](https://ai.google/research/pubs/pub36726#) for more details. If you are already very familiar with it, you can skip this section and go directly to [Percolator in TiKV](#Percolator-in-TiKV)
Percolator is built based on Google's BigTable, a distributed storage system that supports single-row transactions. Percolator implements distributed transactions in ACID snapshot-isolation semantics, which is not supported by BigTable. A column `c` of Percolator is actually divided into the following internal columns of BigTable:
* `c:lock`
* `c:write`
* `c:data`
* `c:notify`
* `c:ack_O`
Percolator also relies on a service named *timestamp oracle*. The timestamp oracle can produce timestamps in a strictly increasing order. All read and write operations need to apply for timestamps from the timestamp oracle, and a timestamp coming from the timestamp oracle will be used as the time when the read/write operation happens.
Percolator is a multi-version storage, and a data item's version is represented by the timestamp when the transaction was committed.
For example,
| key | v:data | v:lock | v"write |
|-----|--------|--------|---------|
|k1 |14:"value2"<br/>12:<br/>10:"value1"|14:primary<br/>12:<br/>10:|14:<br/>12:data@10<br/>10:|
The table shows different versions of data for a single cell. The state shown in the table means that for key `k1`, value `"value1"` was committed at timestamp `12`. Then there is an uncommitted version whose value is `"value2"`, and it's uncommitted because there's a lock. You will understand why it is like this after understanding how transactions work.
The remaining columns, `c:notify` and `c:ack_O`, are used for Percolator's incremental processing. After a modification, `c:notify` column is used to mark the modified cell to be dirty. Users can add some *observers* to Percolator which can do user-specified operations when they find data of their observed columns has changed. To find whether data is changed, the observers continuously scan the `notify` columns to find dirty cells. `c:ack_O` is the "acknowledgment" column of observer `O`, which is used to prevent a row from being incorrectly notified twice. It saves the timestamp of the observer's last execution.
### Writing
Percolator's transactions are committed by a 2-phase commit (2PC) algorithm. Its two phases are `Prewrite` and `Commit`.
In `Prewrite` phase:
1. Get a timestamp from the timestamp oracle, and we call the timestamp `start_ts` of the transaction.
2. For each row involved in the transaction, put a lock in the `lock` column and write the value to the `data` column with the timestamp `start_ts`. One of these locks will be chosen as the *primary* lock, while others are *secondary* locks. Each lock contains the transaction's `start_ts`. Each secondary lock, in addition, contains the location of the primary lock.
* If there's already a lock or newer version than `start_ts`, the current transaction will be rolled back because of write conflict.
And then, in the`Commit` phase:
1. Get another timestamp, namely `commit_ts`.
2. Remove the primary lock, and at the same time write a record to the `write` column with timestamp `commit_ts`, whose value records the transaction's `start_ts`. If the primary lock is missing, the commit fails.
3. Repeat the process above for all secondary locks.
Once step 2 (committing the primary) is done, the whole transaction is done. It doesn't matter if the process of committing the secondary locks failed.
Let's see the example from the paper of Percolator. Assume we are writing two rows in a single transaction. At first, the data looks like this:
| key | bal:data | bal:lock | bal:write |
|-----|--------------|-----------|-----------------|
| Bob | 6:<br/>5:$10 | 6:<br/>5: | 6:data@5<br/>5: |
| Joe | 6:<br/>5:$2 | 6:<br/>5: | 6:data@5<br/>5: |
This table shows Bob and Joe's balance. Now Bob wants to transfer his $7 to Joe's account. The first step is `Prewrite`:
1. Get the `start_ts` of the transaction. In our example, it's `7`.
2. For each row involved in this transaction, put a lock in the `lock` column, and write the data to the `data` column. One of the locks will be chosen as the primary lock.
After `Prewrite`, our data looks like this:
| key | bal:data | bal:lock | bal:write |
|-----|----------|----------|-----------|
| Bob | 7:$3<br/>6:<br>5:$10 | 7:primary<br/>6:<br/>5: | 7:<br/>6:data@5<br/>5: |
| Joe | 7:$9<br/>6:<br/>5:$2 | 7:primary@Bob.bal<br/>6:<br/>5: | 7:<br/>6:data@5<br/>5: |
Then `Commit`:
1. Get the `commit_ts`, in our case, `8`.
2. Commit the primary: Remove the primary lock and write the commit record to the `write` column.
| key | bal:data | bal:lock | bal:write |
|-----|----------|----------|-----------|
| Bob | 8:<br/>7:$3<br/>6:<br>5:$10 | 8:<br/>7:<br/>6:<br/>5: | 8:data@7<br/>7:<br/>6:data@5<br/>5: |
| Joe | 7:$9<br/>6:<br/>5:$2 | 7:primary@Bob.bal<br/>6:<br/>5: | 7:<br/>6:data@5<br/>5: |
3. Commit all secondary locks to complete the writing process.
| key | bal:data | bal:lock | bal:write |
|-----|----------|----------|-----------|
| Bob | 8:<br/>7:$3<br/>6:<br>5:$10 | 8:<br/>7:<br/>6:<br/>5: | 8:data@7<br/>7:<br/>6:data@5<br/>5: |
| Joe | 8:<br/>7:$9<br/>6:<br/>5:$2 | 8:<br/>7:<br/>6:<br/>5: | 8:data@7<br/>7:<br/>6:data@5<br/>5: |
### Reading
Reading from Percolator also requires a timestamp. The procedure to perform a read operation is as follows:
1. Get a timestamp `ts`.
2. Check if the row we are going to read is locked with a timestamp in the range `[0, ts]`.
* If there is a lock with the timestamp in range `[0, ts]`, it means the row was locked by an earlier-started transaction. Then we are not sure whether that transaction will be committed before or after `ts`. In this case the reading will backoff and try again then.
* If there is no lock or the lock's timestamp is greater than `ts`, the read can continue.
3. Get the latest record in the row's `write` column whose `commit_ts` is in range `[0, ts]`. The record contains the `start_ts` of the transaction when it was committed.
4. Get the row's value in the `data` column whose timestamp is exactly `start_ts`. Then the value is what we want.
For example, consider this table again:
| key | bal:data | bal:lock | bal:write |
|-----|----------|----------|-----------|
| Bob | 8:<br/>7:$3<br/>6:<br>5:$10 | 8:<br/>7:<br/>6:<br/>5: |8:data@7<br/>7:<br/>6:data@5<br/>5: |
| Joe | 7:$9<br/>6:<br/>5:$2 | 7:primary@Bob.bal<br/>6:<br/>5: | 7:<br/>6:data@5<br/>5: |
Let's read Bob's balance.
1. Get a timestamp. Assume it's `9`.
2. Check the lock of the row. The row of Bob is not locked, so we continue.
3. Get the latest record in the `write` column committed before `9`. We get a record with `commit_ts` equals to `8`, and`start_ts` `7`, which means, its corresponding data is at timestamp `7` in the `data` column.
4. Get the value in the `data` column with timestamp `7`. `$3` is the result to the read.
This algorithm provides us with the abilities of both lock-free read and historical read. In the above example, if we specify that we want to read at time point `7`, then we will see the write record at timestamp `6`, giving us the result `$10` at timestamp `5`.
### Handling Conflicts
Conflicts are identified by checking the `lock` column. A row can have many versions of data, but it can have at most one lock at any time.
When we are performing a write operation, we try to lock every affected row in the `Prewrite` phase. If we failed to lock some of these rows, the whole transaction will be rolled back. Using an optimistic lock algorithm, sometimes Percolator's transactional write may encounter performance regressions in scenarios where conflicts occur frequently.
To roll back a row, just simply remove its lock and its corresponding value in `data` column.
### Tolerating crashes
Percolator has the ability to survive crashes without breaking data integrity.
First, let's see what will happen after a crash. A crash may happen during `Prewrite`, `Commit` or between these two phases. We can simply divide these conditions into two types: before committing the primary, or after committing the primary.
So, when a transaction `T1` (either reading or writing) finds that a row `R1` has a lock which belongs to an earlier transaction `T0`, `T1` doesn't simply rollback itself immediately. Instead, it checks the state of `T0`'s primary lock.
* If the primary lock has disappeared and there's a record `data @ T0.start_ts` in the `write` column, it means that `T0` has been successfully committed. Then row `R1`'s stale lock can also be committed. Usually we call this `rolling forward`. After this, the new transaction `T1` resumes.
* If the primary lock has disappeared with nothing left, it means the transaction has been rolled back. Then row `R1`'s stale lock should also be rolled back. After this, `T1` resumes.
* If the primary lock exists but it's too old (we can determine this by saving the wall time to locks), it indicates that the transaction has crashed before being committed or rolled back. Roll back `T1` and it will resume.
* Otherwise, we consider transaction `T0` to be still running. `T1` can rollback itself, or try to wait for a while to see whether `T0` will be committed before `T1.start_ts`.
## Percolator in TiKV
TiKV is a distributed transactional key-value storage engine. Each key-value pair can be regarded as a row in Percolator.
TiKV internally uses RocksDB, a key-value storage engine library, to persist data to local disk. RocksDB's atomic write batch and TiKV's transaction scheduler make it atomic to read and write a single user key, which is a requirement of Percolator.
RocksDB provides a feature named *Column Family* (hereafter referred to as *CF*). An instance of RocksDB may have multiple CFs, and each CF is a separated key namespace and has its own LSM-Tree. However different CFs in the same RocksDB instance uses a common WAL, providing the ability to write to different CFs atomically.
We divide a RocksDB instance to three CFs: `CF_DEFAULT`, `CF_LOCK` and `CF_WRITE`, which corresponds to Percolator's `data` column, `lock` column and `write` column respectively. There's an extra CF named `CF_RAFT` which is used to save some metadata of Raft, but that's beside our topic. The `notify` and `ack_O` columns are not present in TiKV, because for now TiKV doesn't need the ability of incremental processing.
Then, we need to represent different versions of a key. We can simply compound a key and a timestamp as an internal key, which can be used in RocksDB. However, since a key can have at most one lock at a time, we don't need to add a timestamp to the key in `CF_LOCK`. Hence the content of each CF:
* `CF_DEFAULT`: `(key, start_ts)` -> `value`
* `CF_LOCK`: `key` -> `lock_info`
* `CF_WRITE`: `(key, commit_ts)` -> `write_info`
Our approach to compound user keys and timestamps together is:
1. Encode the user key to [memcomparable](https://github.com/facebook/mysql-5.6/wiki/MyRocks-record-format#memcomparable-format)
2. Bitwise invert the timestamp (an unsigned int64) and encode it into big-endian bytes.
3. Append the encoded timestamp to the encoded key.
For example, key `"key1"` and timestamp `3` will be encoded as `"key1\x00\x00\x00\x00\xfb\xff\xff\xff\xff\xff\xff\xff\xfe"`, where the first 9 bytes is the memcomparable-encoded key and the remaining 8 bytes is the inverted timestamp in big-endian. In this way, different versions of the same key are always adjacent in RocksDB; and for each key, newer versions are always before older ones.
There are some differences between TiKV and the Percolator's paper. In TiKV, records in `CF_WRITE` has four different types: `Put`, `Delete`, `Rollback` and `Lock`. Only `Put` records need a corresponding value in `CF_DEFAULT`. When rolling back transactions, we don't simply remove the lock but writes a `Rollback` record in `CF_WRITE`. Different from Percolator's lock, the `Lock` type of write records in TiKV is produced by queries like `SELECT ... FOR UPDATE` in TiDB. For keys affected by this query, they are not only the objects for read, but the reading is also part of a write operation. To guarantee to be in snapshot-isolation, we make it acts like a write operation (though it doesn't write anything) to ensure the keys are locked and won't change before committing the transaction.

View File

@ -1,21 +0,0 @@
---
title: Timestamp Oracle
menu:
docs:
parent: Distributed transaction
weight: 4
---
The timestamp oracle plays a significant role in the Percolator Transaction model, it is a server that hands out timestamps in strictly increasing order, a property required for correct operation of the snapshot isolation protocol.
Since every transaction requires contacting the timestamp oracle twice, this service must scale well. The timestamp oracle periodically allocates a range of timestamps by writing the highest allocated timestamp to stable storage; then with that allocated range of timestamps, it can satisfy future requests strictly from memory. If the timestamp oracle restarts, the timestamps will jump forward to the maximum allocated timestamp. Timestamps never go "backwards".
To save RPC overhead (at the cost of increasing transaction latency) each timestamp requester batches timestamp requests across transactions by maintaining only one pending RPC to the oracle. As the oracle becomes more loaded, the batching naturally increases to compensate. Batching increases the scalability of the oracle but does not affect the timestamp guarantees.
The transaction protocol uses strictly increasing timestamps to guarantee that Get() returns all committed writes before the transactions start timestamp. To see how it provides this guarantee, consider a transaction R reading at timestamp TR and a transaction W that committed at timestamp TW < TR; we will show that R sees Ws writes. Since TW < TR, we know that the timestamp oracle gave out TW before or in the same batch as TR; hence, W requested TW before R received TR. We know that R cant do reads before receiving its start timestamp TR and that W wrote locks before requesting its commit timestamp TW . Therefore, the above property guarantees that W must have at least written all its locks before R did any reads; Rs Get() will see either the fully committed write record or the lock, in which case W will block until the lock is released. Either way, Ws write is visible to Rs Get().
In our system, the timestamp oracle has been embeded into Placement Driver (PD). PD is the management component with a "God view" and is responsible for storing metadata and conducting load balancing.
## Practice in TiKV
We use batching and preallocating techniques to increase the timestamp oracle's throughput, and also we use a Raft group to tolerate node failure, but there are still some disadvantages to allocating timestamps from a single node. One disadvantage is that the timestamp oracle can't be scaled to multiple nodes. Another is that when the current Raft leader fails, there is a gap wherein the system cannot allocate a timestamp before a new leader has been elected. Finally, when the timestamp requestor is located at a remote datacenter the requestor must tolerate the high latency caused by the network round trip. There are some potential solutions for this final case, such as Google Spanners TrueTime mechanism and HLCs (Hybrid Logical Clocks).

View File

@ -1,49 +0,0 @@
---
title: Deep Dive
menu:
docs:
weight: 4
nav:
parent: Docs
weight: 4
---
[TiKV](https://github.com/tikv/tikv) is a distributed, transactional key-value database. It has been widely adopted in many critical production environments &mdash; see the [TiKV adopters](https://github.com/tikv/tikv/blob/master/docs/adopters.md). It has also been accepted by the [Cloud Native Computing Foundation](https://www.cfnc.org) as a [Sandbox project](https://www.cncf.io/blog/2018/08/28/cncf-to-host-tikv-in-the-sandbox/) in August, 2018.
TiKV is fully [ACID](https://en.wikipedia.org/wiki/ACID_(computer_science)) compliant and features automatic horizontal scalability, global data consistency, geo-replication, and many other features. It can be used as a building block for other high-level services. For example, we have already used TiKV to support [TiDB](https://github.com/pingcap/tidb) - a next-generation [HTAP](https://en.wikipedia.org/wiki/Hybrid_transactional/analytical_processing_(HTAP)) database.
In this book, we will introduce everything about TiKV, including why we built it and how we continue to improve it, what problems we have met, what the core technologies are and why, etc. We hope that through this book, you can develop a deep understanding of TiKV, build your knowledge of distributed programming, or even get inspired to build your own distributed system. 😎
## History
In the middle of 2015, we decided to build a database which solved MySQL's scaling problems. At that time, the most common way to increase MySQL's scalability was to build a proxy on top of MySQL that distributes the load more efficiently, but we don't think that's the best way.
As far as we knew, proxy-based solutions have following problems:
+ Building a proxy on top of the MySQL servers cannot guarantee ACID compliance. Notably, the cross-node transactions are not supported natively.
+ It poses great challenges for business flexibility because the users have to worry about the data distribution and design their sharding keys carefully to avoid inefficient queries.
+ The high availability and data consistency of MySQL can't be guaranteed easily based on the traditional Master-Slave replication.
Although building a proxy based on MySQL directly might be easy at the beginning, we still decided to chose another way, a more difficult path &mdash; to build a distributed, MySQL compatible database from scratch.
Fortunately, Google met the same problem and had already published some papers to describe how they built [Spanner](http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf) and [F1](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41344.pdf) to solve it. Spanner is a globally distributed, externally consistent database and F1 is a distributed SQL database based on Spanner. Inspired by Spanner and F1, we knew we could do the same thing. So we started to build TiDB - a stateless MySQL layer like F1. After we released TiDB, we knew we needed an underlying Spanner-like database so we began to develop TiKV.
## Architecture
The diagram below shows the architecture of TiKV:
{{< figure
src="/img/basic-architecture.png"
caption="The architecture of TiKV"
alt="TiKV architecture diagram"
width="70" >}}
In this illustration, there are four TiKV instances in the cluster and each instance uses one [RocksDB](https://github.com/facebook/rocksdb) to save data. On top of RocksDB, TiKV uses the [Raft](https://raft.github.io/) consensus algorithm to replicate the data. In practice, we use at least three replicas to keep data safe and consistent, and these replicas form a Raft group.
We use the traditional [multiversion concurrency control](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) (MVCC) mechanism and have built a distributed transaction layer above the Raft layer. We also provide a Coprocessor framework so that users can push down their computing logic to the storage layer.
All the network communications are through [gRPC](https://grpc.io/) so that contributors can develop their own clients easily.
The whole cluster is managed and scheduled by a central service: the [Placement Driver](https://github.com/pingcap/pd) (PD).
As you can see, the hierarchy of TiKV is clear and easy to understand, and we will give more detailed explanation later.

View File

@ -1,148 +0,0 @@
---
title: B-Tree vs LSM-Tree
menu:
docs:
parent: Key-value engine
weight: 1
---
The [B-tree](https://en.wikipedia.org/wiki/B-tree) and the [Log-Structured Merge-tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) (LSM-tree) are the two most widely used data structures for data-intensive applications to organize and store data. However, each of them has its own advantages and disadvantages. This article aims to use quantitative approaches to compare these two data structures.
## Metrics
In general, there are three critical metrics to measure the performance of a data structure: write amplification, read amplification, and space amplification. This section aims to describe these metrics.
For hard disk drives (HDDs), the cost of disk seek is enormous, such that the performance of random read/write is worse than that of sequential read/write. This article assumes that flash-based storage is used so we can ignore the cost of disk seeks.
### Write Amplification
_Write amplification_ is the ratio of the amount of data written to the storage device versus the amount of data written to the database.
For example, if you are writing 10 MB to the database and you observe 30 MB disk write rate, your write amplification is 3.
Flash-based storage can be written to only a finite number of times, so write amplification will decrease the flash lifetime.
There is another write amplification associated with flash memory and SSDs because flash memory must be erased before it can be rewritten.
### Read Amplification
_Read amplification_ is the number of disk reads per query.
For example, if you need to read 5 pages to answer a query, read amplification is 5.
Note that the units of write amplification and read amplification are different. Write amplification measures how much more data is written than the application thought it was writing, whereas read amplification counts the number of disk reads to perform a query.
Read amplification is defined separately for point query and range queries. For range queries the range length matters (the number of rows to be fetched).
Caching is a critical factor for read amplification. For example, with a B-tree in the cold-cache case, a point query requires \\(O(log_BN)\\) disk reads, whereas in the warm-cache case the internal nodes of the B-tree are cached, and so a B-tree requires at most one disk read per query.
### Space Amplification
_Space amplification_ is the ratio of the amount of data on the storage device versus the amount of data in the database.
For example, if you put 10MB in the database and this database uses 100MB on the disk, then the space amplification is 10.
Generally speaking, a data structure can optimize for at most two from read, write, and space amplification. This means one data structure is unlikely to be better than another at all three. For example a B-tree has less read amplification than an LSM-tree while an LSM-tree has less write amplification than a B-tree.
## Analysis
The B-tree is a generalization of [binary search tree](https://en.wikipedia.org/wiki/Binary_search_tree) in which a node can have more than two children. There are two kinds of node in a B-tree, internal nodes, and leaf nodes. A leaf node contains data records and has no children, whereas an internal node can have a variable number of child nodes within some pre-defined range. Internal nodes may be joined or split. An example of a B-tree appears in *Figure 1*.
{{< figure
src="/img/deep-dive/b-tree.png"
caption="B-tree"
number="1" >}}
> Figure 1. The root node is shown at the top of the tree, and in this case happens to contain a single pivot (20), indicating that records with key k where k ≤ 20 are stored in the first child, and records with key k where k > 20 are stored in the second child. The first child contains two pivot keys (11 and 15), indicating that records with key k where k ≤ 11 is stored in the first child, those with 11 < k 15 are stored in the second child, and those with k > 15 are stored in the third child. The leftmost leaf node contains three values (3, 5, and 7).
The term B-tree may refer to a specific design or a general class of designs. In the narrow sense, a B-tree stores keys in its internal nodes but need not store those keys in the records at the leaves. The [B+ tree](https://en.wikipedia.org/wiki/B%2B_tree) is one of the most famous variations of B-tree. The idea behind the B+ tree is that internal nodes only contain keys, and an additional level which contains values is added at the bottom with linked leaves.
Like other search trees, an LSM-tree contains key-value pairs. It maintains data in two or more separate components (sometimes called `SSTable`s), each of which is optimized for its respective underlying storage medium; the data in the low level component is efficiently merged with the data in the high level component in batches. An example of LSM-tree appears in *Figure 2*.
{{< figure
src="/img/deep-dive/lsm-tree.png"
caption="LSM-tree"
number="2" >}}
> Figure 2. The LSM-tree contains \\(k\\) components. Data starts in \\(C_0\\), then gets merged into the \\(C_1\\). Eventually the \\(C_1\\) is merged into the \\(C_2\\), and so forth.
An LSM-tree periodically performs _compaction_ to merge several `SSTable`s into one new `SSTable` which contains only the live data from the input `SSTable`s. Compaction helps the LSM-tree to recycle space and reduce read amplification. There are two kinds of _compaction strategy_: Size-tiered compaction strategy (STCS) and Level-based compaction strategy (LBCS). The idea behind STCS is to compact small `SSTable`s into medium `SSTable`s when the LSM-tree has enough small `SSTable`s and compact medium `SSTable`s into large `SSTable`s when LSM-tree has enough medium `SSTable`s. The idea of LBCS is to organize data into levels and each level contains one sorted run. Once a level accumulates enough data, some of the data at this level will be compacted to the higher level.
This section discusses the write amplification and read amplification of B+tree and Level-Based LSM-tree.
### B+ Tree
In the B+ tree, copies of the keys are stored in the internal nodes; the keys and records are stored in leaves; in addition, a leaf node may include a pointer to the next leaf node to increase sequential access performance.
To simplify the analysis, assume that the block size of the tree is \\(B\\) measured in bytes, and keys, pointers, and records are constant size, so that each internal node contains \\(O(B)\\) children and each leaf contains \\(O(B)\\) data records. (The root node is a special case, and can be nearly empty in some situations.) Under all these assumptions, the depth of a B+ tree is
$$
O(log_BN/B)
$$
where \\(N\\) is the size of the database.
#### Write Amplification
For the worst-case insertion workloads, every insertion requires writing the leaf block containing the record, so the write amplification is \\(B\\).
#### Read Amplification
The number of disk reads per query is at most \\(O(log_BN/B)\\), which is the depth of the tree.
### Level-Based LSM-tree
In the Level-based LSM-tree, data is organized into levels. Each level contains one sorted run. Data starts in level 0, then gets merged into the level 1 run. Eventually the level 1 run is merged into the level 2 run, and so forth. Each level is constrained in its sizes. Growth factor \\(k\\) is specified as the magnification of data size at each level.
$$
level_i = level_{i-1} * k
$$
We can analyze the Level-based LSM-tree as follows. If the growth factor is \\(k\\) and the smallest level is a single file of size \\(B\\), then the number of levels is
$$
Θ(log_kN/B)
$$
where \\(N\\) is the size of the database. In order to simplify the analysis, we assume that database size is stable and grows slowly over time, so that the size of database will be nearly equal to the size of last level.
#### Write Amplification
Data must be moved out of each level once, but data from a given level is merged repeatedly with data from the previous level. On average, after being first written into a level, each data item is remerged back into the same level about \\(k/2\\) times. So the total write amplification is
$$
Θ(k*log_kN/B)
$$
#### Read Amplification
To perform a short range query in the cold cache case, we must perform a binary search
on each of the levels.
For the highest \\(level_i\\), the data size is \\(O(N)\\), so that it performs \\(O(logN/B)\\) disk reads.
For the previous \\(level_{i-1}\\), the data size is \\(O(N/k)\\), so that it performs \\(O(log(N/(kB))\\) disk reads.
For \\(level_{i-2}\\), the data size is \\(O(N/k^2)\\), so that it performs \\(O(log(N/k^2B)\\) disk reads.
For \\(level_{i-n}\\), the data size is \\(O(N/k^n)\\), so that it performs \\(O(log(N/k^nB)\\) disk reads.
So that the total number of disk reads is
$$
R = O(logN/B) + O(log(N/(kB)) + O(log(N/k^2B) + ... + O(log(N/k^nB) + 1 = O((log^2N/B)/logk)
$$
## Summary
The following table shows the summary of various kinds of amplification:
| Data Structure | Write Amplification | Read Amplification |
| :------------------: | :-----------------: | :----------------------: |
| B+ tree | \\(Θ(B)\\) | \\(O(log_BN/B)\\) |
| Level-Based LSM-tree | \\(Θ(klog_kN/B)\\) | \\(Θ((log^2N/B)/logk)\\) |
> Table 1. A summary of the write amplification and read amplification for range queries.
Through comparing various kinds of amplification between B+ tree and Level-based LSM-tree, we can come to a conclusion that Level-based LSM-tree has a better write performance than B+ tree while its read performance is not as good as B+ tree. The main purpose for TiKV to use LSM-tree instead of B-tree as its underlying storage engine is because using cache technology to promote read performance is much easier than promote write performance.

View File

@ -1,35 +0,0 @@
---
title: Key-value engine
menu:
docs:
parent: Deep Dive
weight: 3
---
A key-value engine serves as the bottommost layer in a key-value
database, unless you are going to build your own file system or
operating system. A key-value engine is crucial for a database because
it manages all the persistent data directly.
Most key-value engines provide some common interfaces like `Get`,
`Put` and `Delete`. Some engines also allow you to iterate the
key-values in order efficiently, and most will provide special
features for added efficiency.
Choosing a key value engine is the first step to build a
database. Here are some important things we need to consider:
- _The data structure_. Different data structures are optimized for
different workloads. Some are good for reads and some are good for
writes, etc.
- _Maturity_. We don't need a storage engine to be fancy but we want it
to be reliable. Buggy engines ruin everything you build on top of
them. We recommend using a battle-tested storage engine which has
been adopted by a lot of users.
- _Performance_. The performance of the storage engine limits the
overall performance of the whole database. So make sure the storage
engine meets your expectation and has the potential to improve along
with the database.
In this chapter, we will do a comparison between two well-known data
structures and guide you through the storage engine used in TiKV.

View File

@ -1,163 +0,0 @@
---
title: RocksDB
menu:
docs:
parent: Key-value engine
weight: 2
---
RocksDB is a persistent key-value store for fast storage environment.
Here are some highlight features from RocksDB:
1. RocksDB uses a log structured database engine, written entirely in
C++, for maximum performance. Keys and values are just
arbitrarily-sized byte streams.
2. RocksDB is optimized for fast, low latency storage such as flash
drives and high-speed disk drives. RocksDB exploits the full
potential of high read/write rates offered by flash or RAM.
3. RocksDB is adaptable to different workloads. From database storage
engines such as MyRocks to application data caching to embedded
workloads, RocksDB can be used for a variety of data needs.
4. RocksDB provides basic operations such as opening and closing a
database, reading and writing to more advanced operations such as
merging and compaction filters.
TiKV uses RocksDB because RocksDB is mature and high-performance. In
this section, we will explore how TiKV uses RocksDB. We won't talk
about basic features like `Get`, `Put`, `Delete`, and `Iterate` here
because their usage is simple and clear and works well too. Instead,
we'll focus some special features used in TiKV below.
## [Prefix Bloom Filter](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter)
A Bloom Filter is a magical data structure that uses a little resource
but helps a lot. We won't explain the whole algorithm here. If you are
not familiar with Bloom Filters, you can think it as a black box
inside a dataset, which can tell you if a key *probably* exists or
*definitely* does not without actually searching the
dataset. Sometimes Bloom Filter gives you a false-positive answer
although it rarely happens.
TiKV uses a Bloom Filter as well as a variant which is called Prefix
Bloom Filter (PBF). Instead of telling you if a whole key exists in a
dataset or not, PBF tells you if there are some other keys with the
same prefix exists. Since PBF only stores the unique prefixes instead
of all unique whole keys, it can save some memory too with the down
side of having larger false positive rate.
TiKV supports MVCC, which means that there can be multiple versions
for the same row stored in RocksDB. All versions of the same row share
the same prefix (the row key) but have different timestamps as a suffix. When
we want to read a row, we usually don't know about the exact version
to read, but only want to read the latest version at a specific
timestamp. This is where PBF shines. PBF can filter out data which is
impossible to contain keys with the same prefix as the row key we
provided. Then we just need to search in the data that may contain
different versions of the row key and locate the specific version we
want.
## TableProperties
RocksDB allows us to register some table properties collectors. When
RocksDB builds an SST file, it passes the sorted key-values one by one
to the callback of each collector so that we can collect whatever we
want. Then when the SST file is finished, the collected properties
will be stored inside the SST file too.
We use this feature to optimize two functionalities.
The first one is for Split Check. Split check is a worker to check if
regions are large enough to split. We have to scan all the data
within a region to calculate the size of the region at the original
implementation, which is resource consuming. With the `TableProperties`
feature, we record the size of small sub-ranges in each SST file so
that we can calculate the approximate size of a region from the table
properties without scanning any data at all.
Another one is for MVCC Garbage Collection (GC). GC is a process to
clean up garbage versions (versions older than the configured
lifetime) of each row. If we have no idea whether a region contains
some garbage versions or not, we have to check all regions
periodically. To skip unnecessary garbage collection, we record some
MVCC statistics (e.g. the number of rows and the number of versions)
in each SST file. So before checking every region row by row, we check
the table properties to see if it is necessary to do garbage
collection on a region.
## CompactRange
From time to time, some regions may contain a lot of tombstone entries
because of GC or other delete operations. Tombstone entries are not
good for scan performance and waste disk space as well.
So with the `TableProperties` feature, we can check every region
periodically to see if it contains a lot of tombstones. If it does, we
will compact the region range manually to drop tombstone entries and
release disk space.
We also use `CompactRange` to recover RocksDB from some mistakes like
incompatible table properties across different TiKV versions.
## [EventListener](https://github.com/facebook/rocksdb/wiki/EventListener)
`EventListener` allows us to listen to some special events, like
flush, compaction or write stall condition change. When a specific
event is triggered or finished, RocksDB will invoke our callbacks with
some information about the event.
TiKV listens to the compaction event to observe the region size
changes. As mentioned above, we calculate the approximate size of each
region from the table properties. The approximate size will be
recorded in the memory so that we don't need to calculate it again and
again if nothing has changed. However, during compactions, some
entries are dropped so the approximate size of some regions should be
updated. That's why we listen to the compaction events and recalculate
the approximate size of some regions when necessary.
## [IngestExternalFile](https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files)
RocksDB allows us to generate an SST file outside and then ingest the
file into RocksDB directly. This feature can potentially save a lot of
IO because RocksDB is smart enough to ingest a file to the bottom
level if possible, which can reduce write amplification because the
ingested file doesn't need to be compacted again and again.
We use this feature to handle Raft snapshot. For example, when we want
to add a replica to a new server. We can first generate a snapshot
file from another server and then send the file to the new
server. Then the new server can ingest that file into its RocksDB
directly, which saves lots of work!
We also use this feature to import a huge amount of data into TiKV. We
have some tools to generate sorted SST files from different data
sources and then ingest those files into different TiKV servers. This
is super fast compared to writing key-values to a TiKV cluster in the
usual way.
## DeleteFilesInRange
Previously, TiKV used the straightforward way to delete a range of data,
which is scanning all the keys in the range and then delete them one
by one. However, disk space would not release until the tombstones have
been compacted. Even worse, disk space usage will actually increase
temporarily because of newly written tombstones.
As time goes on, users store more and more data in TiKV until their
disk space is insufficient. Then users will try to drop some tables or
add more stores and expect the disk space usage to decrease in a short
time. But TiKV didn't meet expectations with this method. We first tried
to solve this by using the `DeleteRange` feature in RocksDB. However,
`DeleteRange` turns out to be unstable and can not release disk space
fast enough.
A faster way to release disk space is to delete some files directly,
which leads us to the `DeleteFilesInRange` feature. But this feature
is not perfect, it is quite dangerous because it breaks the snapshot
consistency. If you acquire a snapshot from RocksDB,
use `DeleteFilesInRange` to delete some files, then try to read that data
you will find that some of it is missing. So we
should use this feature carefully.
TiKV uses `DeleteFilesInRange` to destroy tombstone regions and GC
dropped tables. Both cases have a prerequisite that the dropped range
must not be accessed anymore.

View File

@ -1,16 +0,0 @@
---
title: Resource scheduling
menu:
docs:
parent: Deep Dive
weight: 7
---
In a distributed database environment, resource scheduling needs to meet the following requirements:
- Keeping data highly available: The scheduler needs to be able to manage data redundancy to keep the cluster available when some nodes fail.
- Balance server load: The scheduler needs to balance the load to prevent a single node from becoming a performance bottleneck for the entire system.
- Scalability: The scheduler needs to be able to scale to thousands of nodes.
- Fault tolerance: The scheduling process must not be stopped by the breaking down caused by a single node failure.
In the TiKV cluster, resource scheduling is done by the Placement Driver (PD). In this chapter, we will first introduce the design of two scheduling systems (Kubernetes and Mesos), followed by the design and implementation of scheduler and placement in PD.

View File

@ -1,36 +0,0 @@
---
title: Kubernetes
menu:
docs:
parent: Resource scheduling
weight: 1
---
Kubernetes is a Docker-based open source container cluster management system initiated and maintained by the Google team. It supports not only common cloud platforms but also internal data centers.
Kubernetes built a container scheduling service which is designed to allow users to manage cloud container clusters through Kubernetes clusters without the need for complex setup tasks. The system will automatically select the appropriate working node to perform specific container cluster scheduling processing.
The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on.
## Scheduling process
The scheduling process is mainly divided into 2 steps. In the _predicate_ step, the scheduler filters out nodes that do not satisfy required conditions. And in the _priority_ step, the scheduler sorts the nodes that meet all of the fit predicates, and then chooses the best one.
### Predicate stage
The scheduler provides some predicates algorithms by default. For instance, the `HostNamePred` predicate checks if the hostname matches the requested hostname; the `PodsFitsResourcePred` checks if a node has sufficient resources, such as CPU, memory, GPU, opaque int resources and so on, to run a pod. Relevant code can be found in [kubernetes/pkg/scheduler/algorithm/predicates/](https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/algorithm/predicates).
### Priority stage
In the priority step, the scheduler uses the `PrioritizeNodes` function to rank all nodes by calling each priority functions sequentially:
- Each priority function is expected to set a score of 0-10 where 0 is the lowest priority score (least preferred node) and 10 is the highest.
- Add all (weighted) scores for each node to get a total score.
- Select the node with the highest score.
## References
1. [`kube-scheduler` documentation](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/)
2. [Kubernetes introduction (in Chinese)](https://yeasy.gitbooks.io/docker_practice/kubernetes/)
3. [How does Kubernetes' scheduler work](http://carmark.github.io/2015/12/21/How-does-Kubernetes-scheduler-work/)
4. [Kubernetes Scheduling (in Chinese)](https://zhuanlan.zhihu.com/p/27754017)

View File

@ -1,69 +0,0 @@
---
title: Mesos
menu:
docs:
parent: Resource scheduling
weight: 2
---
Mesos was originally launched by UC Berkeley's AMPLab in 2009. It is licensed under Apache and now operated by Mesosphere, Inc.
Mesos can abstract and schedule the resources of the entire data center (including CPU, memory, storage, network, etc.). This allows multiple applications to run in a cluster at the same time without needing to care about the physical distribution of resources.
Mesos has many compelling features, including:
- Support for large scale scenarios with tens of thousands of nodes (adopted by Apple, Twitter, eBay, etc.)
- Support for multiple application frameworks, including Marathon, Singularity, Aurora, etc.
- High Availability (relies on ZooKeeper)
- Support for Docker, LXC and other container techniques
- Providing APIs for several popular languages, including Python, Java, C++, etc.
- A simple and easy-to-use WebUI
## Architecture
It is important to notice that Mesos itself is only a resource scheduling framework. It is not a complete application management platform, so Mesos can't work only on its own. However, based on Mesos, it is relatively easy to provide distributed operation capabilities for various application management frameworks or middleware platforms. Multiple frameworks can also run in a single Mesos cluster at the same time, improving overall resource utilization efficiency.
{{< figure
src="/img/deep-dive/mesos-architecture.png"
caption="The architecture of Mesos"
number="1" >}}
## Components
Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves.
- Mesos master
The _master_ sees the global information, and is responsible for resource scheduling and logical control between different _frameworks_. The _frameworks_ need to be registered to _master_ in order to be used. It uses Zookeeper to achieve HA.
- Mesos salve
The _slave_ is responsible for reporting the resource status (idle resources, running status, etc.) on the slave node to _master_, and is responsible for isolating the local resources to perform the specific tasks assigned by master.
- Frameworks
Each _framework_ consists of two components: a _scheduler_ that registers with the _master_ to be offered resources, and an _executor_ process that is launched on _slave_ nodes to run the _framework_s tasks.
## Resource scheduling
To support the sophisticated schedulers of today's frameworks, Mesos introduces a distributed two-level scheduling mechanism called _resource offers_.
Each resource offer is a list of free resources (for example, <1Core CPU, 2GB RAM>) on multiple slaves. While the _master_ decides how many resources to offer to each framework according to an organizational policy, the frameworks schedulers select which of the offered resources to use. When a framework accepts offered resources, it passes Mesos a description of the tasks it wants to launch on them.
{{< figure
src="/img/deep-dive/mesos-scheduling.png"
caption="Mesos scheduling"
number="2" >}}
The figure shows an example of how resource scheduling works:
1. Slave 1 reports to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation policy module, which tells it that framework 1 should be offered all available resources.
2. The master sends a resource offer describing what is available on slave 1 to framework 1.
3. The frameworks scheduler replies to the master with information about two tasks to run on the slave, using <2 CPUs, 1 GB RAM> for the first task, and <1 CPUs, 2 GB RAM> for the second task.
4. Finally, the master sends the tasks to the slave, which allocates appropriate resources to the frameworks executor, which in turn launches the two tasks (depicted with dotted-line borders in the figure). Because 1 CPU and 1 GB of RAM are still unallocated, the allocation module may now offer them to framework 2.
To maintain a thin interface and enable frameworks to evolve independently, Mesos does not require frameworks to specify their resource requirements or constraints. Instead, Mesos gives frameworks the ability to reject offers. A framework can reject resources that do not satisfy its constraints in order to wait for ones that do. Thus, the rejection mechanism enables frameworks to support arbitrarily complex resource constraints while keeping Mesos simple and scalable.
## References
1. [Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center](https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf)
2. [Mesos introduction (in Chinese)](https://yeasy.gitbooks.io/docker_practice/mesos/intro.html)

Some files were not shown because too many files have changed in this diff Show More