mirror of https://github.com/tensorflow/models.git
123 lines
3.7 KiB
Markdown
123 lines
3.7 KiB
Markdown
## Download and preprocess Criteo TB dataset
|
|
|
|
[Apache Beam](https://beam.apache.org) enables distributed preprocessing of the
|
|
dataset and can be run on
|
|
[Google Cloud Dataflow](https://cloud.google.com/dataflow/). The preprocessing
|
|
scripts can be run locally via DirectRunner provided that the local host has
|
|
enough CPU/Memory/Storage.
|
|
|
|
Install required packages.
|
|
|
|
```bash
|
|
python3 setup.py install
|
|
```
|
|
|
|
|
|
Set up the following environment variables, replacing bucket-name with the name
|
|
of your Cloud Storage bucket and project name with your GCP project name.
|
|
|
|
```bash
|
|
export STORAGE_BUCKET=gs://bucket-name
|
|
export PROJECT=my-gcp-project
|
|
export REGION=us-central1
|
|
```
|
|
|
|
Note: If running locally above environment variables won't be needed and instead
|
|
of gs://bucket-name a local path can be used, also consider passing smaller
|
|
`max_vocab_size` argument.
|
|
|
|
|
|
1. Download raw
|
|
[Criteo TB dataset](https://labs.criteo.com/2013/12/download-terabyte-click-logs/)
|
|
to a GCS bucket.
|
|
|
|
Organize the data in the following way:
|
|
|
|
* The files day_0.gz, day_1.gz, ..., day_22.gz in
|
|
${STORAGE_BUCKET}/criteo_raw/train/
|
|
|
|
* The file day_23.gz in ${STORAGE_BUCKET}/criteo_raw/test/
|
|
|
|
2. Shard the raw training/test data into multiple files.
|
|
|
|
```bash
|
|
python3 shard_rebalancer.py \
|
|
--input_path "${STORAGE_BUCKET}/criteo_raw/train/*" \
|
|
--output_path "${STORAGE_BUCKET}/criteo_raw_sharded/train/train" \
|
|
--num_output_files 1024 --filetype csv --runner DataflowRunner \
|
|
--project ${PROJECT} --region ${REGION}
|
|
```
|
|
|
|
|
|
```bash
|
|
python3 shard_rebalancer.py \
|
|
--input_path "${STORAGE_BUCKET}/criteo_raw/test/*" \
|
|
--output_path "${STORAGE_BUCKET}/criteo_raw_sharded/test/test" \
|
|
--num_output_files 64 --filetype csv --runner DataflowRunner \
|
|
--project ${PROJECT} --region ${REGION}
|
|
```
|
|
|
|
3. Generate vocabulary and preprocess the data.
|
|
|
|
Generate vocabulary:
|
|
|
|
```bash
|
|
python3 criteo_preprocess.py \
|
|
--input_path "${STORAGE_BUCKET}/criteo_raw_sharded/*/*" \
|
|
--output_path "${STORAGE_BUCKET}/criteo/" \
|
|
--temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
|
|
--vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000000 \
|
|
--project ${PROJECT} --region ${REGION}
|
|
```
|
|
Vocabulary for each feature is going to be generated to
|
|
`${STORAGE_BUCKET}/criteo_vocab/tftransform_tmp/feature_??_vocab` files.
|
|
Vocabulary size can be found as `wc -l <feature_vocab_file>`.
|
|
|
|
Preprocess training and test data:
|
|
|
|
```bash
|
|
python3 criteo_preprocess.py \
|
|
--input_path "${STORAGE_BUCKET}/criteo_raw_sharded/train/*" \
|
|
--output_path "${STORAGE_BUCKET}/criteo/train/train" \
|
|
--temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
|
|
--runner DataflowRunner --max_vocab_size 5000000 \
|
|
--project ${PROJECT} --region ${REGION}
|
|
```
|
|
|
|
```bash
|
|
python3 criteo_preprocess.py \
|
|
--input_path "${STORAGE_BUCKET}/criteo_raw_sharded/test/*" \
|
|
--output_path "${STORAGE_BUCKET}/criteo/test/test" \
|
|
--temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
|
|
--runner DataflowRunner --max_vocab_size 5000000 \
|
|
--project ${PROJECT} --region ${REGION}
|
|
```
|
|
|
|
|
|
4. (Optional) Re-balance the dataset.
|
|
|
|
```bash
|
|
python3 shard_rebalancer.py \
|
|
--input_path "${STORAGE_BUCKET}/criteo/train/*" \
|
|
--output_path "${STORAGE_BUCKET}/criteo_balanced/train/train" \
|
|
--num_output_files 8192 --filetype csv --runner DataflowRunner \
|
|
--project ${PROJECT} --region ${REGION}
|
|
```
|
|
|
|
```bash
|
|
python3 shard_rebalancer.py \
|
|
--input_path "${STORAGE_BUCKET}/criteo/test/*" \
|
|
--output_path "${STORAGE_BUCKET}/criteo_balanced/test/test" \
|
|
--num_output_files 1024 --filetype csv --runner DataflowRunner \
|
|
--project ${PROJECT} --region ${REGION}
|
|
```
|
|
|
|
At this point training and test data are in the buckets:
|
|
|
|
* `${STORAGE_BUCKET}/criteo_balanced/train/`
|
|
* `${STORAGE_BUCKET}/criteo_balanced/test/`
|
|
|
|
All other buckets can be removed.
|
|
|
|
|