From a687c51036d2a9aa4d45500e690a1bffcbde15e3 Mon Sep 17 00:00:00 2001 From: Sanyam Kapoor Date: Mon, 13 Aug 2018 21:43:26 -0700 Subject: [PATCH] Add a Jupyter notebook to be used for Kubeflow codelabs (#217) * Add a Jupyter notebook to be used for Kubeflow codelabs * Add help command for create_function_embeddings module * Update README to point to Jupyter Notebook * Add prerequisites to readme * Update README and getting started with notebook guide * [wip] * Update noebook with BigQuery previews * Update notebook to automatically select the latest MODEL_VERSION --- code_search/README.md | 252 +++------------ code_search/code-search.ipynb | 582 ++++++++++++++++++++++++++++++++++ 2 files changed, 633 insertions(+), 201 deletions(-) create mode 100644 code_search/code-search.ipynb diff --git a/code_search/README.md b/code_search/README.md index 038af3d7..5fb4f078 100644 --- a/code_search/README.md +++ b/code_search/README.md @@ -1,204 +1,54 @@ -# Semantic Code Search - -This demo implements End-to-End Semantic Code Search on Kubeflow. It is based on the public -Github Dataset hosted on BigQuery. - -## Setup - -### Prerequisites - -* Python 2.7 (with `pip`) -* Python `virtualenv` -* Node -* Docker -* Ksonnet - -**NOTE**: `Apache Beam` lacks `Python3` support and hence the version requirement. - -### Google Cloud Setup - -* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI - -* Setup Application Default Credentials -``` -$ gcloud auth application-default login -``` - -* Enable Dataflow via Command Line (or use the Google Cloud Console) -``` -$ gcloud services enable dataflow.googleapis.com -``` - -* Create a Google Cloud Project and Google Storage Bucket. - -* Authenticate with Google Container Registry to push Docker images -``` -$ gcloud auth configure-docker -``` - -See [Google Cloud Docs](https://cloud.google.com/docs/) for more. - -### Python Environment Setup - -This demo needs multiple Python versions and `virtualenv` is an easy way to -create isolated environments. - -``` -$ virtualenv -p $(which python2) env2.7 -``` - -This creates a `env2.7` environment folder. - -To use the environment, - -``` -$ source env2.7/bin/activate -``` - -See [Virtualenv Docs](https://virtualenv.pypa.io/en/stable/) for more. - -**NOTE**: The `env2.7` environment must be activated for all steps now onwards. - -### Python Dependencies - -To install dependencies, run the following commands - -``` -(env2.7) $ pip install https://github.com/activatedgeek/batch-predict/tarball/fix-value-provider -(env2.7) $ pip install src/ -``` - -This will install everything needed to run the demo code. - -### Node Dependencies - -``` -$ pushd ui && npm i && popd -``` - -### Build and Push Docker Images - -The `docker` directory contains Dockerfiles for each target application with its own `build.sh`. This is needed -to run the training jobs in Kubeflow cluster. - -To build the Docker image for training jobs - -``` -$ ./docker/t2t/build.sh -``` - -To build the Docker image for Code Search UI - -``` -$ ./docker/ui/build.sh -``` - -Optionally, to push these images to GCR, one must export the `PROJECT=` environment variable -and use the appropriate build script. - -See [GCR Pushing and Pulling Images](https://cloud.google.com/container-registry/docs/pushing-and-pulling) for more. - -# Pipeline - -## 1. Data Pre-processing - -This step takes in the public Github dataset and generates function and docstring token pairs. -Results are saved back into a BigQuery table. It is done via a `Dataflow` job. - -``` -(env2.7) $ export GCS_DIR=gs://kubeflow-examples/t2t-code-search -(env2.7) $ code-search-preprocess -r DataflowRunner -o code_search:function_docstrings \ - -p kubeflow-dev -j process-github-archive --storage-bucket ${GCS_DIR} \ - --machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16 -``` - -## 2. Model Training - -We use `tensor2tensor` to train our model. - -``` -(env2.7) $ t2t-trainer --generate_data --problem=github_function_docstring --model=similarity_transformer --hparams_set=transformer_tiny \ - --data_dir=${GCS_DIR}/data --output_dir=${GCS_DIR}/output \ - --train_steps=100 --eval_steps=10 \ - --t2t_usr_dir=src/code_search/t2t -``` - -A `Dockerfile` based on Tensorflow is provided along which has all the dependencies for this part of the pipeline. -By default, it is based off Tensorflow CPU 1.8.0 for `Python3` but can be overridden in the Docker image build. -This script builds and pushes the docker image to Google Container Registry. - -## 3. Model Export - -We use `t2t-exporter` to export our trained model above into the TensorFlow `SavedModel` format. - -``` -(env2.7) $ t2t-exporter --problem=github_function_docstring --model=similarity_transformer --hparams_set=transformer_tiny \ - --data_dir=${GCS_DIR}/data --output_dir=${GCS_DIR}/output \ - --t2t_usr_dir=src/code_search/t2t -``` - -## 4. Batch Prediction for Code Embeddings - -We run another `Dataflow` pipeline to use the exported model above and get a high-dimensional embedding of each of -our code example. Specify the model version (which is a UNIX timestamp) from the output directory. This should be the name of -a folder at path `${GCS_DIR}/output/export/Servo` - -``` -(env2.7) $ export MODEL_VERSION= -``` - -Now, start the job, - -``` -(env2.7) $ export SAVED_MODEL_DIR=${GCS_DIR}/output/export/Servo/${MODEL_VERSION} -(env2.7) $ code-search-predict -r DataflowRunner --problem=github_function_docstring -i "${GCS_DIR}/data/*.csv" \ - --data-dir "${GCS_DIR}/data" --saved-model-dir "${SAVED_MODEL_DIR}" - -p kubeflow-dev -j batch-predict-github-archive --storage-bucket ${GCS_DIR} \ - --machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16 -``` - -## 5. Create an NMSLib Index - -Using the above embeddings, we will now create an NMSLib index which will serve as our search index for -new incoming queries. - - -``` -(env2.7) $ export INDEX_FILE= # TODO(sanyamkapoor): Add the index file -(env2.7) $ nmslib-create --data-file=${EMBEDDINGS_FILE} --index-file=${INDEX_FILE} -``` - - -## 6. Run a TensorFlow Serving container - -This will start a TF Serving container using the model export above and export it at port 8501. - -``` -$ docker run --rm -p8501:8501 gcr.io/kubeflow-images-public/tensorflow-serving-1.8 tensorflow_model_server \ - --rest_api_port=8501 --model_name=t2t_code_search --model_base_path=${GCS_DIR}/output/export/Servo -``` - -## 7. Serve the Search Engine - -We will now serve the search engine via a simple REST interface - -``` -(env2.7) $ nmslib-serve --serving-url=http://localhost:8501/v1/models/t2t_code_search:predict \ - --problem=github_function_docstring --data-dir=${GCS_DIR}/data --index-file=${INDEX_FILE} -``` - -## 8. Serve the UI - -This will serve as the graphical wrapper on top of the REST search engine started in the previous step. - -``` -$ pushd ui && npm run build && popd -$ serve -s ui/build -``` - -# Pipeline on Kubeflow - -TODO +# Code Search on Kubeflow + +This demo implements End-to-End Code Search on Kubeflow. + +# Prerequisites + +**NOTE**: If using the JupyterHub Spawner on a Kubeflow cluster, use the Docker image +`gcr.io/kubeflow-images-public/kubeflow-codelab-notebook` which has baked all the pre-prequisites. + +* `Kubeflow Latest` + This notebook assumes a Kubeflow cluster is already deployed. See + [Getting Started with Kubeflow](https://www.kubeflow.org/docs/started/getting-started/). + +* `Python 2.7` (bundled with `pip`) + For this demo, we will use Python 2.7. This restriction is due to [Apache Beam](https://beam.apache.org/), + which does not support Python 3 yet (See [BEAM-1251](https://issues.apache.org/jira/browse/BEAM-1251)). + +* `Google Cloud SDK` + This example will use tools from the [Google Cloud SDK](https://cloud.google.com/sdk/). The SDK + must be authenticated and authorized. See + [Authentication Overview](https://cloud.google.com/docs/authentication/). + +* `Ksonnet 0.12` + We use [Ksonnet](https://ksonnet.io/) to write Kubernetes jobs in a declarative manner to be run + on top of Kubeflow. + +# Getting Started + +To get started, follow the instructions below. + +**NOTE**: We will assume that the Kubeflow cluster is available at `kubeflow.example.com`. Make sure +you replace this with the true FQDN of your Kubeflow cluster in any subsequent instructions. + +* Spawn a new JupyterLab instance inside the Kubeflow cluster by pointing your browser to + **https://kubeflow.example.com/hub** and clicking "**Start My Server**". + +* In the **Image** text field, enter `gcr.io/kubeflow-images-public/kubeflow-codelab-notebook:v20180808-v0.2-22-gcfdcb12`. + This image contains all the pre-requisites needed for the demo. + +* Once spawned, you should be redirected to the notebooks UI. We intend to go to the JupyterLab home + page which is available at the URL - **https://kubeflow.example.com/user//lab**. + **TIP**: Simply point the browser to **/lab** instead of the **/tree** path in the URL. + +* Spawn a new Terminal and run + ``` + $ git clone --branch=master --depth=1 https://github.com/kubeflow/examples + ``` + This will create an examples folder. It is safe to close the terminal now. + +* Refresh the File Explorer (typically to the left) and navigate to `examples/code_search`. Open + the Jupyter notebook `code-search.ipynb` and follow it along. # Acknowledgements diff --git a/code_search/code-search.ipynb b/code_search/code-search.ipynb new file mode 100644 index 00000000..374829c5 --- /dev/null +++ b/code_search/code-search.ipynb @@ -0,0 +1,582 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "-" + } + }, + "source": [ + "# Code Search on Kubeflow\n", + "\n", + "This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string.\n", + "\n", + "**NOTE**: If you haven't already, see [kubeflow/examples/code_search](https://github.com/kubeflow/examples/tree/master/code_search) for instructions on how to get this notebook,." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install dependencies\n", + "\n", + "Let us install all the Python dependencies. Note that everything must be done with `Python 2`. This will take a while and only needs to be run once." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# FIXME(sanyamkapoor): The Kubeflow Batch Prediction dependency is installed from a fork for reasons in\n", + "# kubeflow/batch-predict#9 and corresponding issue kubeflow/batch-predict#10\n", + "! pip2 install https://github.com/activatedgeek/batch-predict/tarball/fix-value-provider\n", + "\n", + "! pip2 install -r src/requirements.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Only for BigQuery cells\n", + "! pip2 install pandas-gbq" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pandas.io import gbq" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configure Variables\n", + "\n", + "This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Configuration Variables. Modify as desired.\n", + "\n", + "PROJECT = 'kubeflow-dev'\n", + "CLUSTER_NAME = 'kubeflow-latest'\n", + "CLUSTER_REGION = 'us-east1-d'\n", + "CLUSTER_NAMESPACE = 'kubeflow-latest'\n", + "\n", + "TARGET_DATASET = 'code_search'\n", + "WORKING_DIR = 'gs://kubeflow-examples/t2t-code-search/20180813'\n", + "WORKER_MACHINE_TYPE = 'n1-highcpu-32'\n", + "NUM_WORKERS = 16\n", + "\n", + "# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n", + "%env PROJECT $PROJECT\n", + "%env CLUSTER_NAME $CLUSTER_NAME\n", + "%env CLUSTER_REGION $CLUSTER_REGION\n", + "%env CLUSTER_NAMESPACE $CLUSTER_NAMESPACE\n", + "\n", + "%env TARGET_DATASET $TARGET_DATASET\n", + "%env WORKING_DIR $WORKING_DIR\n", + "%env WORKER_MACHINE_TYPE $WORKER_MACHINE_TYPE\n", + "%env NUM_WORKERS $NUM_WORKERS" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup Authorization\n", + "\n", + "In a Kubeflow cluster, we already have the key credentials available with each pod and will re-use them to authenticate. This will allow us to submit `TFJob`s and execute `Dataflow` pipelines. We also set the new context for the Code Search Ksonnet application." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "# Activate Service Account provided by Kubeflow.\n", + "gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}\n", + "\n", + "# Get KUBECONFIG for the desired cluster.\n", + "gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${CLUSTER_REGION}\n", + "\n", + "# Set the namespace of the context.\n", + "kubectl config set contexts.$(kubectl config current-context).namespace ${CLUSTER_NAMESPACE}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup Ksonnet Application\n", + "\n", + "This will use the context we've set above and provide it as a new environment to the Ksonnet application." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd kubeflow\n", + "\n", + "ks env add code-search --context=$(kubectl config current-context)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Verify Version Information" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "echo \"Pip Version Info: \" && pip2 --version && echo\n", + "echo \"Google Cloud SDK Info: \" && gcloud --version && echo\n", + "echo \"Ksonnet Version Info: \" && ks version && echo\n", + "echo \"Kubectl Version Info: \" && kubectl version" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## View Github Files\n", + "\n", + "This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "query = \"\"\"\n", + " SELECT\n", + " MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,\n", + " c.content\n", + " FROM\n", + " `bigquery-public-data.github_repos.files` AS f\n", + " JOIN\n", + " `bigquery-public-data.github_repos.contents` AS c\n", + " ON\n", + " f.id = c.id\n", + " JOIN (\n", + " --this part of the query makes sure repo is watched at least twice since 2017\n", + " SELECT\n", + " repo\n", + " FROM (\n", + " SELECT\n", + " repo.name AS repo\n", + " FROM\n", + " `githubarchive.year.2017`\n", + " WHERE\n", + " type=\"WatchEvent\"\n", + " UNION ALL\n", + " SELECT\n", + " repo.name AS repo\n", + " FROM\n", + " `githubarchive.month.2018*`\n", + " WHERE\n", + " type=\"WatchEvent\" )\n", + " GROUP BY\n", + " 1\n", + " HAVING\n", + " COUNT(*) >= 2 ) AS r\n", + " ON\n", + " f.repo_name = r.repo\n", + " WHERE\n", + " f.path LIKE '%.py' AND --with python extension\n", + " c.size < 15000 AND --get rid of ridiculously long files\n", + " REGEXP_CONTAINS(c.content, r'def ') --contains function definition\n", + " GROUP BY\n", + " c.content\n", + " LIMIT\n", + " 10\n", + "\"\"\"\n", + "\n", + "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pre-Processing Github Files\n", + "\n", + "In this step, we will run a [Google Cloud Dataflow](https://cloud.google.com/dataflow/) pipeline (based on Apache Beam). A `Python 2` module `code_search.dataflow.cli.preprocess_github_dataset` has been provided which builds an Apache Beam pipeline. A list of all possible arguments can be seen via the following command." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd src\n", + "\n", + "python2 -m code_search.dataflow.cli.preprocess_github_dataset -h" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run the Dataflow Job for Pre-Processing\n", + "\n", + "See help above for a short description of each argument. The values are being taken from environment variables defined earlier." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd src\n", + "\n", + "JOB_NAME=\"preprocess-github-dataset-$(date +'%Y%m%d-%H%M%S')\"\n", + "\n", + "python2 -m code_search.dataflow.cli.preprocess_github_dataset \\\n", + " --runner DataflowRunner \\\n", + " --project \"${PROJECT}\" \\\n", + " --target_dataset \"${TARGET_DATASET}\" \\\n", + " --data_dir \"${WORKING_DIR}/data\" \\\n", + " --job_name \"${JOB_NAME}\" \\\n", + " --temp_location \"${WORKING_DIR}/data/dataflow/temp\" \\\n", + " --staging_location \"${WORKING_DIR}/data/dataflow/staging\" \\\n", + " --worker_machine_type \"${WORKER_MACHINE_TYPE}\" \\\n", + " --num_workers \"${NUM_WORKERS}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When completed successfully, this should create a dataset in `BigQuery` named `target_dataset`. Additionally, it also dumps CSV files into `data_dir` which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "query = \"\"\"\n", + " SELECT * \n", + " FROM \n", + " {}.token_pairs\n", + " LIMIT\n", + " 10\n", + "\"\"\".format(TARGET_DATASET)\n", + "\n", + "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prepare Dataset for Training\n", + "\n", + "In this step we will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format. We will run this job on the Kubeflow cluster." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd kubeflow\n", + "\n", + "ks apply code-search -c t2t-code-search-datagen" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Execute Tensorflow Training" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd kubeflow\n", + "\n", + "ks apply code-search -c t2t-code-search-trainer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Export Tensorflow Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd kubeflow\n", + "\n", + "ks apply code-search -c t2t-code-search-exporter" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Compute Function Embeddings\n", + "\n", + "In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. A `Python 2` module `code_search.dataflow.cli.create_function_embeddings` has been provided for this purpose. A list of all possible arguments can be seen below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd src\n", + "\n", + "python2 -m code_search.dataflow.cli.create_function_embeddings -h" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configuration\n", + "\n", + "First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo`. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`. Below, we automatically do that by selecting the folder which represents the latest timestamp." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash --out EXPORT_DIR_LS\n", + "\n", + "gsutil ls ${WORKING_DIR}/output/export/Servo | grep -oE \"([0-9]+)/$\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\\n') if ts])\n", + "\n", + "# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n", + "%env MODEL_VERSION $MODEL_VERSION" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run the Dataflow Job for Function Embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd src\n", + "\n", + "python2 -m code_search.dataflow.cli.create_function_embeddings \\\n", + " --runner DataflowRunner\n", + " --project \"${PROJECT}\" \\\n", + " --target_dataset \"${TARGET_DATASET}\" \\\n", + " --problem github_function_docstring \\\n", + " --data_dir \"${WORKING_DIR}/data\" \\\n", + " --saved_model_dir \"${WORKING_DIR}/output/export/Servo/${MODEL_VERSION}\" \\\n", + " --job_name compute-function-embeddings\n", + " --temp_location \"${WORKING_DIR}/data/dataflow/temp\" \\\n", + " --staging_location \"${WORKING_DIR}/data/dataflow/staging\" \\\n", + " --worker_machine_type \"${WORKER_MACHINE_TYPE}\" \\\n", + " --num_workers \"${NUM_WORKERS}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "query = \"\"\"\n", + " SELECT * \n", + " FROM \n", + " {}.function_embeddings\n", + " LIMIT\n", + " 10\n", + "\"\"\".format(TARGET_DATASET)\n", + "\n", + "gbq.read_gbq(query, dialect='standard', project_id=PROJECT)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create Search Index\n", + "\n", + "We now create the Search Index from the computed embeddings so that during a query we can do a k-Nearest Neighbor search to give out semantically similar results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd kubeflow\n", + "\n", + "ks apply code-search -c search-index-creator" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created in the `WORKING_DIR`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deploy an Inference Server\n", + "\n", + "We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd kubeflow\n", + "\n", + "ks apply code-search -c t2t-code-search-serving" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deploy Search UI\n", + "\n", + "We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "cd kubeflow\n", + "\n", + "ks apply code-search -c search-index-server" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.15" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}