---
title: Distributed SQL
aliases: ['/docs/deep-dive/distributed-sql/dist-sql']
menu:
    nav:
        parent: Distributed SQL over TiKV
        weight: 2
---

By now we already know how [TiDB]'s relational structure is encoded into the Key-Value form with version.  In this section, we will focus on the following questions:

* What happens when [TiDB] receives a SQL query?
* How does [TiDB] execute SQL queries in a distributed way?
  
## What happens when [TiDB] receives a SQL query?

Firstly, let's have a look at the following example:

```sql
select count(*) from t where a + b  > 5;
```

{{< figure
    src="/img/deep-dive/select-from-tidb.png"
    caption="SQL query diagram"
    number="1" >}}

As described in the above figure, when [TiDB] receives a SQL query from the client, it will process with the following steps:

1. [TiDB] receives a new SQL from the client.
2. [TiDB] prepares the processing plans for this request, meanwhile [TiDB] gets a timestamp from [PD] as the `start_ts` of this transaction.
3. [TiDB] tries to get the information schema (metadata of the table) from TiKV.
4. [TiDB] prepares Regions for each related key according to the information schema and the SQL query. Then [TiDB] gets information for the related Regions from [PD].
5. [TiDB] groups the related keys by Region.
6. [TiDB] dispatches the tasks to the related TiKV concurrently.
7. [TiDB] reassembles the data and returns the data to the client.

## How does [TiDB] execute SQL queries in a distributed way?

In short, [TiDB] splits the task by Regions and sends them to TiKV concurrently.

For the above example, we assume the rows with the primary key of table `t` are distributed in three Regions:

* Rows with the primary key in [0,100) are in Region 1.
* Rows with primary key in [100,1000) are in region 2.
* Rows with primary key in [1000,~) are in region 3.

Now we can do `count` and sum the result from the above three Regions.

{{< figure
    src="/img/deep-dive/coprocessor-select.png"
    caption="Coprocessor diagram"
    number="2" >}}

### Exectors

Now we know [TiDB] splits a read task by Regions, but how does TiKV know what are its tasks to handle?
Here [TiDB] will send a Directed Acyclic Graph (DAG) to TiKV with each node as an executor.

{{< figure
    src="/img/deep-dive/executors.jpg"
    caption="Executors"
    number="3" >}}

Supported executors:

* TableScan: Scans the rows with the primary key from the KV store.
* IndexScan: It will scan the index data from the KV store.
* Selection: performs a filter (mostly for `where`). The input is `TableScan` or `IndexScan`.
* Aggregation: performs an aggregation (e.g. `count(*)`,`sum(xxx)`). The input is `TableScan`,`IndexScan`, or`Selection`.
* TopN: sorts the data and returns the top n matches, for example, `order by xxx limit 10`. The input is `TableScan`,`IndexScan`, or`Selection`.

{{< figure
    src="/img/deep-dive/executors-example.jpg"
    caption="Executors example"
    number="4" >}}

For the above example, we have the following executors on Region 1:

* Aggregation: `count(*)`.
* Selection: `a + b > 5`
* TableScan: `range:[0,100)`.

### Expression

We have executors as nodes in the DAG, but how do we describe columns, constants, and functions in an `Aggregation` or a `Selection`?
There are three types of expressions:

* Column: a column in the table.
* Constant: a constant, which could be a string, int, duration, and so on.
* Scalar function: describes a function.

{{< figure
    src="/img/deep-dive/expression.jpg"
    caption="Expression"
    number="5" >}}

For the above example `select count(*) from t where a + b > 5`, we have:

* Column: a, b.
* Scalar functions: `+`,`>`.
* Constant: `5`.

[TiDB]: https://github.com/pingcap/tidb
[PD]: https://github.com/pingcap/pd