examples/code_search/code-search.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "# Code Search on Kubeflow\n",
    "\n",
    "This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string.\n",
    "\n",
    "**NOTE**: If you haven't already, see [kubeflow/examples/code_search](https://github.com/kubeflow/examples/tree/master/code_search) for instructions on how to get this notebook,."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install dependencies\n",
    "\n",
    "Let us install all the Python dependencies. Note that everything must be done with `Python 2`. This will take a while the first time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Verify Version Information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "echo \"Pip Version Info: \" && python2 --version && python2 -m pip --version && echo\n",
    "echo \"Google Cloud SDK Info: \" && gcloud --version && echo\n",
    "echo \"Ksonnet Version Info: \" && ks version && echo\n",
    "echo \"Kubectl Version Info: \" && kubectl version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Pip Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "! python2 -m pip install -U pip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Code Search dependencies\n",
    "! python2 -m pip install --user https://github.com/kubeflow/batch-predict/tarball/master\n",
    "! python2 -m pip install --user -r src/requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# BigQuery Cell Dependencies\n",
    "! python2 -m pip install --user pandas-gbq"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NOTE: The RuntimeWarnings (if any) are harmless. See ContinuumIO/anaconda-issues#6678.\n",
    "from pandas.io import gbq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure Variables\n",
    "\n",
    "* This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps.\n",
    "* Set the following variables\n",
    "\n",
    "  * **PROJECT**: Set this to the GCP project you want to use\n",
    "      * If gcloud has a project set this will be used by default\n",
    "      * To use a different project or if gcloud doesn't have a project set you will need to configure one explicitly\n",
    "  * **WORKING_DIR**: Override this if you don't want to use the default configured below     \n",
    "  * **KS_ENV**: Set this to the name of the ksonnet environment you want to create"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "import subprocess\n",
    "# Configuration Variables. Modify as desired.\n",
    "\n",
    "PROJECT = subprocess.check_output([\"gcloud\", \"config\", \"get-value\", \"project\"]).strip()\n",
    "\n",
    "# Dataflow Related Variables.\n",
    "TARGET_DATASET = 'code_search'\n",
    "WORKING_DIR = 'gs://{0}_code_search/workingDir'.format(PROJECT)\n",
    "KS_ENV=getpass.getuser()\n",
    "\n",
    "# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n",
    "%env PROJECT $PROJECT\n",
    "%env TARGET_DATASET $TARGET_DATASET\n",
    "%env WORKING_DIR $WORKING_DIR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Setup Authorization\n",
    "\n",
    "In a Kubeflow cluster on GKE, we already have the Google Application Credentials mounted onto each Pod. We can simply point `gcloud` to activate that service account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Activate Service Account provided by Kubeflow.\n",
    "gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally, to interact with the underlying cluster, we configure `kubectl`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "kubectl config set-cluster kubeflow --server=https://kubernetes.default --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n",
    "kubectl config set-credentials jupyter --token \"$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\"\n",
    "kubectl config set-context kubeflow --cluster kubeflow --user jupyter\n",
    "kubectl config use-context kubeflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Collectively, these allow us to interact with Google Cloud Services as well as the Kubernetes Cluster directly to submit `TFJob`s and execute `Dataflow` pipelines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup Ksonnet Application\n",
    "\n",
    "We now point the Ksonnet application to the underlying Kubernetes cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "# Update Ksonnet to point to the Kubernetes Cluster\n",
    "ks env add code-search\n",
    "\n",
    "# Update Ksonnet to use the namespace where kubeflow is deployed. By default it's 'kubeflow'\n",
    "ks env set code-search --namespace=kubeflow\n",
    "\n",
    "# Update the Working Directory of the application\n",
    "sed -i'' \"s,gs://example/prefix,${WORKING_DIR},\" components/params.libsonnet\n",
    "\n",
    "# FIXME(sanyamkapoor): This command completely replaces previous configurations.\n",
    "# Hence, using string replacement in file.\n",
    "# ks param set t2t-code-search workingDir ${WORKING_DIR}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## View Github Files\n",
    "\n",
    "This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next.\n",
    "\n",
    "**WARNING**: The table is large and the query can take a few minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "  SELECT\n",
    "    MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,\n",
    "    c.content\n",
    "  FROM\n",
    "    `bigquery-public-data.github_repos.files` AS f\n",
    "  JOIN\n",
    "    `bigquery-public-data.github_repos.contents` AS c\n",
    "  ON\n",
    "    f.id = c.id\n",
    "  JOIN (\n",
    "      --this part of the query makes sure repo is watched at least twice since 2017\n",
    "    SELECT\n",
    "      repo\n",
    "    FROM (\n",
    "      SELECT\n",
    "        repo.name AS repo\n",
    "      FROM\n",
    "        `githubarchive.year.2017`\n",
    "      WHERE\n",
    "        type=\"WatchEvent\"\n",
    "      UNION ALL\n",
    "      SELECT\n",
    "        repo.name AS repo\n",
    "      FROM\n",
    "        `githubarchive.month.2018*`\n",
    "      WHERE\n",
    "        type=\"WatchEvent\" )\n",
    "    GROUP BY\n",
    "      1\n",
    "    HAVING\n",
    "      COUNT(*) >= 2 ) AS r\n",
    "  ON\n",
    "    f.repo_name = r.repo\n",
    "  WHERE\n",
    "    f.path LIKE '%.py' AND --with python extension\n",
    "    c.size < 15000 AND --get rid of ridiculously long files\n",
    "    REGEXP_CONTAINS(c.content, r'def ') --contains function definition\n",
    "  GROUP BY\n",
    "    c.content\n",
    "  LIMIT\n",
    "    10\n",
    "\"\"\"\n",
    "\n",
    "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define an experiment\n",
    "\n",
    "* This solution consists of multiple jobs and servers that need to share parameters\n",
    "* To facilitate this we use [experiments.libsonnet](https://github.com/kubeflow/examples/blob/master/code_search/kubeflow/components/experiments.libsonnet) to define sets of parameters\n",
    "* Each set of parameters has a name corresponding to a key in the dictionary defined in **experiments.libsonnet**\n",
    "* You configure an experiment by defining a set of experiments in **experiments.libsonnet**\n",
    "* You then set the global parameter **experiment** to the name of the experiment defining the parameters you want\n",
    "  to use.\n",
    "\n",
    "To get started define your experiment\n",
    "\n",
    "* Create a new entry containing a set of values to be used for your experiment\n",
    "* Pick a suitable name for your experiment.\n",
    "* Set the following values\n",
    "\n",
    "    * **outputDir**: The GCS directory where the output should be written\n",
    "    * **train_steps**: Numer oftraining steps\n",
    "    * **eval_steps**: Number of steps to be used for eval\n",
    "    * **hparams_set**: The set of hyperparameters to use; see some suggestions [here](https://github.com/tensorflow/tensor2tensor#language-modeling).\n",
    "    \n",
    "        * **transformer_tiny** can be used to train a very small model suitable for ensuring the code works.\n",
    "    * **project**: Set this to the GCP project you can use\n",
    "    * **modelDir**: \n",
    "        * After training your model set this to a GCS directory containing the export model\n",
    "        * e.g gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/\n",
    "    * **problem**: set this to \"kf_github_function_docstring\",\n",
    "    * **model**: set this  \"kf_similarity_transformer\",\n",
    "    * **lookupFile**: set this to the GCS location of the CSV produced by the job to create the nmslib index of the embeddings for all GitHub data\n",
    "    * **indexFile*:* set this to the GCS location of the nmslib index for all the data in GitHub\n",
    "\n",
    "Configure your ksonnet environment to use your experiment\n",
    "\n",
    "* Open **kubeflow/environments/${ENVIRONMENT}/globals.libsonnet**\n",
    "* Define the following global parameters\n",
    "\n",
    "  * **experiment**: Name of your experiment; should correspond to a key defined in **experiments.libsonnet**\n",
    "  * **project**: Set this to the GCP project you can use\n",
    "  * **dataDir**: The data directory to be used by T2T  \n",
    "  * **workingDir**: Working directory\n",
    "  \n",
    "* Here's an [example](https://github.com/kubeflow/examples/blob/master/code_search/kubeflow/environments/cs_demo/globals.libsonnet) of what the contents should look like\n",
    "\n",
    "  ```\n",
    "  workingDir: \"gs://code-search-demo/20181104\",\n",
    "  dataDir: \"gs://code-search-demo/20181104/data\",\n",
    "  project: \"code-search-demo\",\n",
    "  experiment: \"demo-trainer-11-07-dist-sync-gpu\",\n",
    "  ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pre-Processing Github Files\n",
    "\n",
    "In this step, we use [Google Cloud Dataflow](https://cloud.google.com/dataflow/) to preprocess the data.\n",
    "\n",
    "* We use a K8s Job to run a python program `code_search.dataflow.cli.preprocess_github_dataset` that submits the Dataflow job\n",
    "* Once the job has been created it can be monitored using the Dataflow console\n",
    "* The parameter **target_dataset** specifies a BigQuery dataset to write the data to"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the BigQuery dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "bq mk ${PROJECT}:${TARGET_DATASET}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Submit the Dataflow Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "ks param set --env=code-search submit-preprocess-job targetDataset ${TARGET_DATASET}\n",
    "ks apply code-search -c submit-preprocess-job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When completed successfully, this should create a dataset in `BigQuery` named `target_dataset`. Additionally, it also dumps CSV files into `data_dir` which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "  SELECT * \n",
    "  FROM \n",
    "    {}.token_pairs\n",
    "  LIMIT\n",
    "    10\n",
    "\"\"\".format(TARGET_DATASET)\n",
    "\n",
    "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This pipeline also writes a set of CSV files which contain function and docstring pairs delimited by a comma. Here, we list a subset of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "LIMIT=10\n",
    "\n",
    "gsutil ls ${WORKING_DIR}/data/*.csv | head -n ${LIMIT}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Dataset for Training\n",
    "\n",
    "We will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format.\n",
    "\n",
    "**TIP**: Use `ks show` to view the Resource Spec submitted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks show code-search -c t2t-code-search-datagen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks apply code-search -c t2t-code-search-datagen"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once this job finishes, the data directory should have a vocabulary file and a list of `TFRecords` prefixed by the problem name which in our case is `github_function_docstring_extended`. Here, we list a subset of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "LIMIT=10\n",
    "\n",
    "gsutil ls ${WORKING_DIR}/data/vocab*\n",
    "gsutil ls ${WORKING_DIR}/data/*train* | head -n ${LIMIT}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Execute Tensorflow Training\n",
    "\n",
    "Once, the `TFRecords` are generated, we will use `t2t-trainer` to execute the training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks show code-search -c t2t-code-search-trainer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks apply code-search -c t2t-code-search-trainer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will generate TensorFlow model checkpoints which is illustrated below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "gsutil ls ${WORKING_DIR}/output/*ckpt*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export Tensorflow Model\n",
    "\n",
    "We now use `t2t-exporter` to export the `TFModel`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks show code-search -c t2t-code-search-exporter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks apply code-search -c t2t-code-search-exporter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once completed, this will generate a TensorFlow `SavedModel` which we will further use for both online (via `TF Serving`) and offline inference (via `Kubeflow Batch Prediction`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "gsutil ls ${WORKING_DIR}/output/export/Servo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute Function Embeddings\n",
    "\n",
    "In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. A `Python 2` module `code_search.dataflow.cli.create_function_embeddings` has been provided for this purpose. A list of all possible arguments can be seen below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configuration\n",
    "\n",
    "First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo` as seen above. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`. Below, we automatically do that by selecting the folder which represents the latest timestamp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash --out EXPORT_DIR_LS\n",
    "\n",
    "gsutil ls ${WORKING_DIR}/output/export/Servo | grep -oE \"([0-9]+)/$\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# WARNING: This routine will fail if no export has been completed successfully.\n",
    "MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\\n') if ts])\n",
    "\n",
    "# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n",
    "%env MODEL_VERSION $MODEL_VERSION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Modify experiments.libsonnet and set **modelDir** to the directory computed above"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the Dataflow Job for Function Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "ks apply code-search -c submit-code-embeddings-job        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "  SELECT * \n",
    "  FROM \n",
    "    {}.function_embeddings\n",
    "  LIMIT\n",
    "    10\n",
    "\"\"\".format(TARGET_DATASET)\n",
    "\n",
    "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pipeline also generates a set of CSV files which will be useful to generate the search index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "LIMIT=10\n",
    "\n",
    "gsutil ls ${WORKING_DIR}/data/*index*.csv | head -n ${LIMIT}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Search Index\n",
    "\n",
    "We now create the Search Index from the computed embeddings. This facilitates k-Nearest Neighbor search to for semantically similar results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks show code-search -c search-index-creator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks apply code-search -c search-index-creator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "gsutil ls ${WORKING_DIR}/code_search_index*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy the Web App\n",
    "\n",
    "* The included web app provides a simple way for users to submit queries\n",
    "* The web app includes two pieces\n",
    "\n",
    "     * A Flask app that serves a simple UI for sending queries\n",
    "        \n",
    "        * The flask app also uses nmslib to provide fast lookups\n",
    "        \n",
    "     * A TF Serving instance to compute the embeddings for search queries\n",
    "* The ksonnet components for the web app are in a separate ksonnet application ks-web-app\n",
    "\n",
    "    * A second web app is used so that we can optionally use [ArgoCD](https://github.com/argoproj/argo-cd) to keep the serving components up to date."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy an Inference Server\n",
    "\n",
    "We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/).\n",
    "\n",
    "* You need to set the parameter **modelBasePath** to the GCS directory where the model was exported\n",
    "* This will be a directory produced by the export model step\n",
    "    * e.g gs://code-search-demo/models/20181107-dist-sync-gpu/export/\n",
    "* Here are sample contents\n",
    "  ```\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/:\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/\n",
    "\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/:\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/saved_model.pbtxt\n",
    "\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/:\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/variables.data-00000-of-00001\n",
    "  gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/variables.index\n",
    "\n",
    "  ```\n",
    "\n",
    "* TFServing expects the modelBasePath to consist of numeric subdirectories corresponding to different versions of the model.\n",
    "* Each subdirectory will contain the saved model in protocol buffer along with the weights. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd ks-web-app\n",
    "\n",
    "ks param set --env=code-search modelBasePath ${MODEL_BASE_PATH}\n",
    "ks show code-search -c query-embed-server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd ks-web-app\n",
    "\n",
    "ks apply code-search -c query-embed-server"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy Search UI\n",
    "\n",
    "We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed.\n",
    "\n",
    "* We need to configure the index server to use the lookup file and CSV produced by the create search index step above\n",
    "* The values will be the values of the parameters **lookupFile** and **indexFile** that you set in experiments.libsonnet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd ks-web-app\n",
    "ks param set --env=code-search search-index-server lookupFile ${LOOKUP_FILE}\n",
    "ks param set --env=code-search search-index-server indexFile ${INDEX_FILE}\n",
    "ks show code-search -c search-index-server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd kubeflow\n",
    "\n",
    "ks apply code-search -c search-index-server"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}