mirror of https://github.com/tikv/migration.git
2.1 KiB
2.1 KiB
Quick Start
Spark SST Data Source enables users to decode SST files generated by RawKV backup to Key-Value pairs using Spark.
Install tikv-client-java
git clone git@github.com:tikv/client-java.git
mvn --file client-java/pom.xml clean install -DskipTests
Build sst-data-source project
git clone git@github.com:tikv/migration.git
cd migration
mvn clean package -DskipTests -am -pl sst-data-source
Export SST
br backup raw \
--pd 127.0.0.1:2379 \
--storage "hdfs:///path/to/sst/" \
--start s \
--end t \
--format raw \
--cf default
Run SSTDataSourceExample
spark-submit \
--master local[*] \
--jars /path/to/tikv-client-java-3.3.0-SNAPSHOT.jar \
--class org.tikv.datasources.sst.example.SSTDataSourceExample \
sst-data-source/target/sst-data-source-0.0.1-SNAPSHOT.jar \
hdfs:///path/to/sst/
Call Spark SST Data Source
Also we can write a self-contained application to decode sst files.
def main(args: Array[String]): Unit = {
val sstFilePath = "hdfs:///path/to/sst/"
val df = spark.read
.format("sst")
.load(sstFilePath)
df.printSchema()
df.count()
df.show(false)
}
The output of df.printSchema() is as follows:
root
|-- key: binary (nullable = false)
|-- value: binary (nullable = true)
Parameters
| Key | Default Value | Description |
|---|---|---|
path |
- | The path to the SST Files, e.g. hdfs:/path/to/sst/ |
enable-ttl |
false | Whether the TiKV Cluster enables ttl |
Spark Version
Default Spark version is 3.0.2. If you want to use other Spark version, please compile with the following command:
mvn clean package -DskipTests -Dspark.version.compile=3.1.1
Develop
To format the code, please run mvn mvn-scalafmt_2.12:format or mvn clean package -DskipTests.