Add a Jupyter notebook to be used for Kubeflow codelabs (#217)

* Add a Jupyter notebook to be used for Kubeflow codelabs * Add help command for create_function_embeddings module * Update README to point to Jupyter Notebook * Add prerequisites to readme * Update README and getting started with notebook guide * [wip] * Update noebook with BigQuery previews * Update notebook to automatically select the latest MODEL_VERSION
2018-08-13 21:43:26 -07:00 · 2018-08-13 21:43:26 -07:00 · a687c51036
parent a80c15b50e
commit a687c51036
2 changed files with 633 additions and 201 deletions
--- a/code_search/README.md
+++ b/code_search/README.md
@ -1,204 +1,54 @@
-# Semantic Code Search
-
-This demo implements End-to-End Semantic Code Search on Kubeflow. It is based on the public
-Github Dataset hosted on BigQuery.
-
-## Setup
-
-### Prerequisites
-
-* Python 2.7 (with `pip`)
-* Python `virtualenv`
-* Node
-* Docker
-* Ksonnet
-
-**NOTE**: `Apache Beam` lacks `Python3` support and hence the version requirement.
-
-### Google Cloud Setup
-
-* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI
-
-* Setup Application Default Credentials 
-```
-$ gcloud auth application-default login
-```
-
-* Enable Dataflow via Command Line (or use the Google Cloud Console)
-```
-$ gcloud services enable dataflow.googleapis.com
-```
-
-* Create a Google Cloud Project and Google Storage Bucket.
-
-* Authenticate with Google Container Registry to push Docker images
-```
-$ gcloud auth configure-docker
-```
-
-See [Google Cloud Docs](https://cloud.google.com/docs/) for more.
-
-### Python Environment Setup
-
-This demo needs multiple Python versions and `virtualenv` is an easy way to
-create isolated environments.
-
-```
-$ virtualenv -p $(which python2) env2.7 
-```
-
-This creates a `env2.7` environment folder.
-
-To use the environment,
-
-```
-$ source env2.7/bin/activate
-```
-
-See [Virtualenv Docs](https://virtualenv.pypa.io/en/stable/) for more. 
-
-**NOTE**: The `env2.7` environment must be activated for all steps now onwards.
-
-### Python Dependencies
-
-To install dependencies, run the following commands
-
-```
-(env2.7) $ pip install https://github.com/activatedgeek/batch-predict/tarball/fix-value-provider
-(env2.7) $ pip install src/
-```
-
-This will install everything needed to run the demo code.
-
-### Node Dependencies
-
-```
-$ pushd ui && npm i && popd
-```
-
-### Build and Push Docker Images
-
-The `docker` directory contains Dockerfiles for each target application with its own `build.sh`. This is needed
-to run the training jobs in Kubeflow cluster.
-
-To build the Docker image for training jobs
-
-```
-$ ./docker/t2t/build.sh
-```
-
-To build the Docker image for Code Search UI
-
-```
-$ ./docker/ui/build.sh
-```
-
-Optionally, to push these images to GCR, one must export the `PROJECT=<my project name>` environment variable
-and use the appropriate build script.
-
-See [GCR Pushing and Pulling Images](https://cloud.google.com/container-registry/docs/pushing-and-pulling) for more.
-
-# Pipeline
-
-## 1. Data Pre-processing
-
-This step takes in the public Github dataset and generates function and docstring token pairs.
-Results are saved back into a BigQuery table. It is done via a `Dataflow` job.
-
-```
-(env2.7) $ export GCS_DIR=gs://kubeflow-examples/t2t-code-search
-(env2.7) $ code-search-preprocess -r DataflowRunner -o code_search:function_docstrings \
-              -p kubeflow-dev -j process-github-archive --storage-bucket ${GCS_DIR} \
-              --machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16
-```
-
-## 2. Model Training
-
-We use `tensor2tensor` to train our model.
-
-```
-(env2.7) $ t2t-trainer --generate_data --problem=github_function_docstring --model=similarity_transformer --hparams_set=transformer_tiny \
-                      --data_dir=${GCS_DIR}/data --output_dir=${GCS_DIR}/output \
-                      --train_steps=100 --eval_steps=10 \
-                      --t2t_usr_dir=src/code_search/t2t
-```
-
-A `Dockerfile` based on Tensorflow is provided along which has all the dependencies for this part of the pipeline. 
-By default, it is based off Tensorflow CPU 1.8.0 for `Python3` but can be overridden in the Docker image build.
-This script builds and pushes the docker image to Google Container Registry.
-
-## 3. Model Export
-
-We use `t2t-exporter` to export our trained model above into the TensorFlow `SavedModel` format.
-
-```
-(env2.7) $ t2t-exporter --problem=github_function_docstring --model=similarity_transformer --hparams_set=transformer_tiny \
-                      --data_dir=${GCS_DIR}/data --output_dir=${GCS_DIR}/output \
-                      --t2t_usr_dir=src/code_search/t2t
-```
-
-## 4. Batch Prediction for Code Embeddings
-
-We run another `Dataflow` pipeline to use the exported model above and get a high-dimensional embedding of each of
-our code example. Specify the model version (which is a UNIX timestamp) from the output directory. This should be the name of 
-a folder at path `${GCS_DIR}/output/export/Servo`
-
-```
-(env2.7) $ export MODEL_VERSION=<put_unix_timestamp_here>
-```
-
-Now, start the job,
-
-```
-(env2.7) $ export SAVED_MODEL_DIR=${GCS_DIR}/output/export/Servo/${MODEL_VERSION}
-(env2.7) $ code-search-predict -r DataflowRunner --problem=github_function_docstring -i "${GCS_DIR}/data/*.csv" \
-              --data-dir "${GCS_DIR}/data" --saved-model-dir "${SAVED_MODEL_DIR}"
-              -p kubeflow-dev -j batch-predict-github-archive --storage-bucket ${GCS_DIR} \
-              --machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16
-```
-
-## 5. Create an NMSLib Index
-
-Using the above embeddings, we will now create an NMSLib index which will serve as our search index for
-new incoming queries.
-
-
-```
-(env2.7) $ export INDEX_FILE=  # TODO(sanyamkapoor): Add the index file
-(env2.7) $ nmslib-create --data-file=${EMBEDDINGS_FILE} --index-file=${INDEX_FILE}
-```
-
-
-## 6. Run a TensorFlow Serving container
-
-This will start a TF Serving container using the model export above and export it at port 8501.
-
-```
-$ docker run --rm -p8501:8501 gcr.io/kubeflow-images-public/tensorflow-serving-1.8 tensorflow_model_server \
-             --rest_api_port=8501 --model_name=t2t_code_search --model_base_path=${GCS_DIR}/output/export/Servo
-```
-
-## 7. Serve the Search Engine
-
-We will now serve the search engine via a simple REST interface
-
-```
-(env2.7) $ nmslib-serve --serving-url=http://localhost:8501/v1/models/t2t_code_search:predict \
-                        --problem=github_function_docstring --data-dir=${GCS_DIR}/data --index-file=${INDEX_FILE}
-```
-
-## 8. Serve the UI
-
-This will serve as the graphical wrapper on top of the REST search engine started in the previous step.
-
-```
-$ pushd ui && npm run build && popd
-$ serve -s ui/build
-```
-
-# Pipeline on Kubeflow
-
-TODO
+# Code Search on Kubeflow
+
+This demo implements End-to-End Code Search on Kubeflow.
+
+# Prerequisites
+
+**NOTE**: If using the JupyterHub Spawner on a Kubeflow cluster, use the Docker image 
+`gcr.io/kubeflow-images-public/kubeflow-codelab-notebook` which has baked all the pre-prequisites.
+
+* `Kubeflow Latest`
+  This notebook assumes a Kubeflow cluster is already deployed. See
+  [Getting Started with Kubeflow](https://www.kubeflow.org/docs/started/getting-started/).
+
+* `Python 2.7` (bundled with `pip`) 
+  For this demo, we will use Python 2.7. This restriction is due to [Apache Beam](https://beam.apache.org/),
+  which does not support Python 3 yet (See [BEAM-1251](https://issues.apache.org/jira/browse/BEAM-1251)).
+
+* `Google Cloud SDK`
+  This example will use tools from the [Google Cloud SDK](https://cloud.google.com/sdk/). The SDK 
+  must be authenticated and authorized. See
+  [Authentication Overview](https://cloud.google.com/docs/authentication/).
+  
+* `Ksonnet 0.12`
+  We use [Ksonnet](https://ksonnet.io/) to write Kubernetes jobs in a declarative manner to be run
+  on top of Kubeflow.
+
+# Getting Started
+
+To get started, follow the instructions below.
+
+**NOTE**: We will assume that the Kubeflow cluster is available at `kubeflow.example.com`. Make sure
+you replace this with the true FQDN of your Kubeflow cluster in any subsequent instructions.
+
+* Spawn a new JupyterLab instance inside the Kubeflow cluster by pointing your browser to
+  **https://kubeflow.example.com/hub** and clicking "**Start My Server**".
+
+* In the **Image** text field, enter `gcr.io/kubeflow-images-public/kubeflow-codelab-notebook:v20180808-v0.2-22-gcfdcb12`.
+  This image contains all the pre-requisites needed for the demo.
+  
+* Once spawned, you should be redirected to the notebooks UI. We intend to go to the JupyterLab home
+  page which is available at the URL - **https://kubeflow.example.com/user/<ACCOUNT_NAME>/lab**.
+  **TIP**: Simply point the browser to **/lab** instead of the **/tree** path in the URL.
+  
+* Spawn a new Terminal and run
+  ```
+  $ git clone --branch=master --depth=1 https://github.com/kubeflow/examples
+  ```
+  This will create an examples folder. It is safe to close the terminal now.
+  
+* Refresh the File Explorer (typically to the left) and navigate to `examples/code_search`. Open
+  the Jupyter notebook `code-search.ipynb` and follow it along.

 # Acknowledgements

--- a/code_search/code-search.ipynb
+++ b/code_search/code-search.ipynb
@ -0,0 +1,582 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "-"
+    }
+   },
+   "source": [
+    "# Code Search on Kubeflow\n",
+    "\n",
+    "This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string.\n",
+    "\n",
+    "**NOTE**: If you haven't already, see [kubeflow/examples/code_search](https://github.com/kubeflow/examples/tree/master/code_search) for instructions on how to get this notebook,."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install dependencies\n",
+    "\n",
+    "Let us install all the Python dependencies. Note that everything must be done with `Python 2`. This will take a while and only needs to be run once."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FIXME(sanyamkapoor): The Kubeflow Batch Prediction dependency is installed from a fork for reasons in\n",
+    "# kubeflow/batch-predict#9 and corresponding issue kubeflow/batch-predict#10\n",
+    "! pip2 install https://github.com/activatedgeek/batch-predict/tarball/fix-value-provider\n",
+    "\n",
+    "! pip2 install -r src/requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Only for BigQuery cells\n",
+    "! pip2 install pandas-gbq"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pandas.io import gbq"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Configure Variables\n",
+    "\n",
+    "This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Configuration Variables. Modify as desired.\n",
+    "\n",
+    "PROJECT = 'kubeflow-dev'\n",
+    "CLUSTER_NAME = 'kubeflow-latest'\n",
+    "CLUSTER_REGION = 'us-east1-d'\n",
+    "CLUSTER_NAMESPACE = 'kubeflow-latest'\n",
+    "\n",
+    "TARGET_DATASET = 'code_search'\n",
+    "WORKING_DIR = 'gs://kubeflow-examples/t2t-code-search/20180813'\n",
+    "WORKER_MACHINE_TYPE = 'n1-highcpu-32'\n",
+    "NUM_WORKERS = 16\n",
+    "\n",
+    "# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n",
+    "%env PROJECT $PROJECT\n",
+    "%env CLUSTER_NAME $CLUSTER_NAME\n",
+    "%env CLUSTER_REGION $CLUSTER_REGION\n",
+    "%env CLUSTER_NAMESPACE $CLUSTER_NAMESPACE\n",
+    "\n",
+    "%env TARGET_DATASET $TARGET_DATASET\n",
+    "%env WORKING_DIR $WORKING_DIR\n",
+    "%env WORKER_MACHINE_TYPE $WORKER_MACHINE_TYPE\n",
+    "%env NUM_WORKERS $NUM_WORKERS"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "###  Setup Authorization\n",
+    "\n",
+    "In a Kubeflow cluster, we already have the key credentials available with each pod and will re-use them to authenticate. This will allow us to submit `TFJob`s and execute `Dataflow` pipelines. We also set the new context for the Code Search Ksonnet application."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "# Activate Service Account provided by Kubeflow.\n",
+    "gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}\n",
+    "\n",
+    "# Get KUBECONFIG for the desired cluster.\n",
+    "gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${CLUSTER_REGION}\n",
+    "\n",
+    "# Set the namespace of the context.\n",
+    "kubectl config set contexts.$(kubectl config current-context).namespace ${CLUSTER_NAMESPACE}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Setup Ksonnet Application\n",
+    "\n",
+    "This will use the context we've set above and provide it as a new environment to the Ksonnet application."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd kubeflow\n",
+    "\n",
+    "ks env add code-search --context=$(kubectl config current-context)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Verify Version Information"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "echo \"Pip Version Info: \" && pip2 --version && echo\n",
+    "echo \"Google Cloud SDK Info: \" && gcloud --version && echo\n",
+    "echo \"Ksonnet Version Info: \" && ks version && echo\n",
+    "echo \"Kubectl Version Info: \" && kubectl version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## View Github Files\n",
+    "\n",
+    "This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"\"\"\n",
+    "  SELECT\n",
+    "    MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,\n",
+    "    c.content\n",
+    "  FROM\n",
+    "    `bigquery-public-data.github_repos.files` AS f\n",
+    "  JOIN\n",
+    "    `bigquery-public-data.github_repos.contents` AS c\n",
+    "  ON\n",
+    "    f.id = c.id\n",
+    "  JOIN (\n",
+    "      --this part of the query makes sure repo is watched at least twice since 2017\n",
+    "    SELECT\n",
+    "      repo\n",
+    "    FROM (\n",
+    "      SELECT\n",
+    "        repo.name AS repo\n",
+    "      FROM\n",
+    "        `githubarchive.year.2017`\n",
+    "      WHERE\n",
+    "        type=\"WatchEvent\"\n",
+    "      UNION ALL\n",
+    "      SELECT\n",
+    "        repo.name AS repo\n",
+    "      FROM\n",
+    "        `githubarchive.month.2018*`\n",
+    "      WHERE\n",
+    "        type=\"WatchEvent\" )\n",
+    "    GROUP BY\n",
+    "      1\n",
+    "    HAVING\n",
+    "      COUNT(*) >= 2 ) AS r\n",
+    "  ON\n",
+    "    f.repo_name = r.repo\n",
+    "  WHERE\n",
+    "    f.path LIKE '%.py' AND --with python extension\n",
+    "    c.size < 15000 AND --get rid of ridiculously long files\n",
+    "    REGEXP_CONTAINS(c.content, r'def ') --contains function definition\n",
+    "  GROUP BY\n",
+    "    c.content\n",
+    "  LIMIT\n",
+    "    10\n",
+    "\"\"\"\n",
+    "\n",
+    "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pre-Processing Github Files\n",
+    "\n",
+    "In this step, we will run a [Google Cloud Dataflow](https://cloud.google.com/dataflow/) pipeline (based on Apache Beam). A `Python 2` module `code_search.dataflow.cli.preprocess_github_dataset` has been provided which builds an Apache Beam pipeline. A list of all possible arguments can be seen via the following command."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd src\n",
+    "\n",
+    "python2 -m code_search.dataflow.cli.preprocess_github_dataset -h"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the Dataflow Job for Pre-Processing\n",
+    "\n",
+    "See help above for a short description of each argument. The values are being taken from environment variables defined earlier."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd src\n",
+    "\n",
+    "JOB_NAME=\"preprocess-github-dataset-$(date +'%Y%m%d-%H%M%S')\"\n",
+    "\n",
+    "python2 -m code_search.dataflow.cli.preprocess_github_dataset \\\n",
+    "        --runner DataflowRunner \\\n",
+    "        --project \"${PROJECT}\" \\\n",
+    "        --target_dataset \"${TARGET_DATASET}\" \\\n",
+    "        --data_dir \"${WORKING_DIR}/data\" \\\n",
+    "        --job_name \"${JOB_NAME}\" \\\n",
+    "        --temp_location \"${WORKING_DIR}/data/dataflow/temp\" \\\n",
+    "        --staging_location \"${WORKING_DIR}/data/dataflow/staging\" \\\n",
+    "        --worker_machine_type \"${WORKER_MACHINE_TYPE}\" \\\n",
+    "        --num_workers \"${NUM_WORKERS}\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When completed successfully, this should create a dataset in `BigQuery` named `target_dataset`. Additionally, it also dumps CSV files into `data_dir` which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"\"\"\n",
+    "  SELECT * \n",
+    "  FROM \n",
+    "    {}.token_pairs\n",
+    "  LIMIT\n",
+    "    10\n",
+    "\"\"\".format(TARGET_DATASET)\n",
+    "\n",
+    "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prepare Dataset for Training\n",
+    "\n",
+    "In this step we will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format. We will run this job on the Kubeflow cluster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd kubeflow\n",
+    "\n",
+    "ks apply code-search -c t2t-code-search-datagen"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Execute Tensorflow Training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd kubeflow\n",
+    "\n",
+    "ks apply code-search -c t2t-code-search-trainer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Export Tensorflow Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd kubeflow\n",
+    "\n",
+    "ks apply code-search -c t2t-code-search-exporter"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Compute Function Embeddings\n",
+    "\n",
+    "In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. A `Python 2` module `code_search.dataflow.cli.create_function_embeddings` has been provided for this purpose. A list of all possible arguments can be seen below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd src\n",
+    "\n",
+    "python2 -m code_search.dataflow.cli.create_function_embeddings -h"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Configuration\n",
+    "\n",
+    "First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo`. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`. Below, we automatically do that by selecting the folder which represents the latest timestamp."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash --out EXPORT_DIR_LS\n",
+    "\n",
+    "gsutil ls ${WORKING_DIR}/output/export/Servo | grep -oE \"([0-9]+)/$\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\\n') if ts])\n",
+    "\n",
+    "# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n",
+    "%env MODEL_VERSION $MODEL_VERSION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the Dataflow Job for Function Embeddings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd src\n",
+    "\n",
+    "python2 -m code_search.dataflow.cli.create_function_embeddings \\\n",
+    "        --runner DataflowRunner\n",
+    "        --project \"${PROJECT}\" \\\n",
+    "        --target_dataset \"${TARGET_DATASET}\" \\\n",
+    "        --problem github_function_docstring \\\n",
+    "        --data_dir \"${WORKING_DIR}/data\" \\\n",
+    "        --saved_model_dir \"${WORKING_DIR}/output/export/Servo/${MODEL_VERSION}\" \\\n",
+    "        --job_name compute-function-embeddings\n",
+    "        --temp_location \"${WORKING_DIR}/data/dataflow/temp\" \\\n",
+    "        --staging_location \"${WORKING_DIR}/data/dataflow/staging\" \\\n",
+    "        --worker_machine_type \"${WORKER_MACHINE_TYPE}\" \\\n",
+    "        --num_workers \"${NUM_WORKERS}\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"\"\"\n",
+    "  SELECT * \n",
+    "  FROM \n",
+    "    {}.function_embeddings\n",
+    "  LIMIT\n",
+    "    10\n",
+    "\"\"\".format(TARGET_DATASET)\n",
+    "\n",
+    "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create Search Index\n",
+    "\n",
+    "We now create the Search Index from the computed embeddings so that during a query we can do a k-Nearest Neighbor search to give out semantically similar results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd kubeflow\n",
+    "\n",
+    "ks apply code-search -c search-index-creator"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created in the `WORKING_DIR`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deploy an Inference Server\n",
+    "\n",
+    "We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd kubeflow\n",
+    "\n",
+    "ks apply code-search -c t2t-code-search-serving"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deploy Search UI\n",
+    "\n",
+    "We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "cd kubeflow\n",
+    "\n",
+    "ks apply code-search -c search-index-server"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}