History

Sanyam Kapoor 767c90ff20 Refactor dataflow pipelines (#197 ) * Update to a new dataflow package * [WIP] updating docstrings, fixing redundancies * Limit the scope of Github Transform pipeline, make everything unicode * Add ability to start github pipelines from transformed bigquery dataset * Upgrade batch prediction pipeline to be modular * Fix lint errors * Add write disposition to BigQuery transform * Update documentation format * Nicer names for modules * Add unicode encoding to parsed function docstring tuples * Use Apache Beam options parser to expose all CLI arguments		2018-07-27 06:26:56 -07:00
..
docker	Integrate batch prediction (#184 )	2018-07-23 16:26:23 -07:00
kubeflow	Integrate batch prediction (#184 )	2018-07-23 16:26:23 -07:00
src	Refactor dataflow pipelines (#197 )	2018-07-27 06:26:56 -07:00
ui	Integrate nmslib (#194 )	2018-07-23 17:17:24 -07:00
.dockerignore	Integrate batch prediction (#184 )	2018-07-23 16:26:23 -07:00
.gitignore	Extension of T2T Ksonnet component (#149 )	2018-06-25 15:09:22 -07:00
README.md	Integrate batch prediction (#184 )	2018-07-23 16:26:23 -07:00
create_secrets.sh	Extension of T2T Ksonnet component (#149 )	2018-06-25 15:09:22 -07:00

README.md

Semantic Code Search

This demo implements End-to-End Semantic Code Search on Kubeflow. It is based on the public Github Dataset hosted on BigQuery.

Setup

Prerequisites

Python 2.7 (with pip)
Python virtualenv
Node
Docker
Ksonnet

NOTE: Apache Beam lacks Python3 support and hence the version requirement.

Google Cloud Setup

Install gcloud CLI
Setup Application Default Credentials

$ gcloud auth application-default login

Enable Dataflow via Command Line (or use the Google Cloud Console)

$ gcloud services enable dataflow.googleapis.com

Create a Google Cloud Project and Google Storage Bucket.
Authenticate with Google Container Registry to push Docker images

$ gcloud auth configure-docker

See Google Cloud Docs for more.

Create Kubernetes Secrets

This is needed for deployed pods in the Kubernetes cluster to access Google Cloud resources.

$ PROJECT=my-project ./create_secrets.sh

NOTE: Use create_secrets.sh -d to remove any side-effects of the above step.

Python Environment Setup

This demo needs multiple Python versions and virtualenv is an easy way to create isolated environments.

$ virtualenv -p $(which python2) env2.7

This creates a env2.7 environment folder.

To use the environment,

$ source env2.7/bin/activate

See Virtualenv Docs for more.

NOTE: The env2.7 environment must be activated for all steps now onwards.

Python Dependencies

To install dependencies, run the following commands

(env2.7) $ pip install https://github.com/kubeflow/batch-predict/tarball/master
(env2.7) $ pip install src/

This will install everything needed to run the demo code.

Node Dependencies

$ pushd ui && npm i && popd

Build and Push Docker Images

The docker directory contains Dockerfiles for each target application with its own build.sh. This is needed to run the training jobs in Kubeflow cluster.

To build the Docker image for training jobs

$ ./docker/t2t/build.sh

To build the Docker image for Code Search UI

$ ./docker/ui/build.sh

Optionally, to push these images to GCR, one must export the PROJECT=<my project name> environment variable and use the appropriate build script.

See GCR Pushing and Pulling Images for more.

Pipeline

1. Data Pre-processing

This step takes in the public Github dataset and generates function and docstring token pairs. Results are saved back into a BigQuery table. It is done via a Dataflow job.

(env2.7) $ export GCS_DIR=gs://kubeflow-examples/t2t-code-search
(env2.7) $ code-search-preprocess -r DataflowRunner -o code_search:function_docstrings \
              -p kubeflow-dev -j process-github-archive --storage-bucket ${GCS_DIR} \
              --machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16

2. Model Training

We use tensor2tensor to train our model.

(env2.7) $ t2t-trainer --generate_data --problem=github_function_docstring --model=similarity_transformer --hparams_set=transformer_tiny \
                      --data_dir=${GCS_DIR}/data --output_dir=${GCS_DIR}/output \
                      --train_steps=100 --eval_steps=10 \
                      --t2t_usr_dir=src/code_search/t2t

A Dockerfile based on Tensorflow is provided along which has all the dependencies for this part of the pipeline. By default, it is based off Tensorflow CPU 1.8.0 for Python3 but can be overridden in the Docker image build. This script builds and pushes the docker image to Google Container Registry.

3. Model Export

We use t2t-exporter to export our trained model above into the TensorFlow SavedModel format.

(env2.7) $ t2t-exporter --problem=github_function_docstring --model=similarity_transformer --hparams_set=transformer_tiny \
                      --data_dir=${GCS_DIR}/data --output_dir=${GCS_DIR}/output \
                      --t2t_usr_dir=src/code_search/t2t

4. Batch Prediction for Code Embeddings

We run another Dataflow pipeline to use the exported model above and get a high-dimensional embedding of each of our code example. Specify the model version (which is a UNIX timestamp) from the output directory. This should be the name of a folder at path ${GCS_DIR}/output/export/Servo

(env2.7) $ export MODEL_VERSION=<put_unix_timestamp_here>

Now, start the job,

(env2.7) $ export SAVED_MODEL_DIR=${GCS_DIR}/output/export/Servo/${MODEL_VERSION}
(env2.7) $ code-search-predict -r DataflowRunner --problem=github_function_docstring -i "${GCS_DIR}/data/*.csv" \
              --data-dir "${GCS_DIR}/data" --saved-model-dir "${SAVED_MODEL_DIR}"
              -p kubeflow-dev -j batch-predict-github-archive --storage-bucket ${GCS_DIR} \
              --machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16

5. Create an NMSLib Index

Using the above embeddings, we will now create an NMSLib index which will serve as our search index for new incoming queries.

(env2.7) $ export INDEX_FILE=  # TODO(sanyamkapoor): Add the index file
(env2.7) $ nmslib-create --data-file=${EMBEDDINGS_FILE} --index-file=${INDEX_FILE}

6. Run a TensorFlow Serving container

This will start a TF Serving container using the model export above and export it at port 8501.

$ docker run --rm -p8501:8501 gcr.io/kubeflow-images-public/tensorflow-serving-1.8 tensorflow_model_server \
             --rest_api_port=8501 --model_name=t2t_code_search --model_base_path=${GCS_DIR}/output/export/Servo

7. Serve the Search Engine

We will now serve the search engine via a simple REST interface

(env2.7) $ nmslib-serve --serving-url=http://localhost:8501/v1/models/t2t_code_search:predict \
                        --problem=github_function_docstring --data-dir=${GCS_DIR}/data --index-file=${INDEX_FILE}

8. Serve the UI

This will serve as the graphical wrapper on top of the REST search engine started in the previous step.

$ pushd ui && npm run build && popd
$ serve -s ui/build

Pipeline on Kubeflow

TODO

Acknowledgements

This project derives from hamelsmu/code_search.