mirror of https://github.com/kubeflow/examples.git
Add a Jupyter notebook to be used for Kubeflow codelabs (#217)
* Add a Jupyter notebook to be used for Kubeflow codelabs * Add help command for create_function_embeddings module * Update README to point to Jupyter Notebook * Add prerequisites to readme * Update README and getting started with notebook guide * [wip] * Update noebook with BigQuery previews * Update notebook to automatically select the latest MODEL_VERSION
This commit is contained in:
parent
a80c15b50e
commit
a687c51036
|
@ -1,204 +1,54 @@
|
|||
# Semantic Code Search
|
||||
|
||||
This demo implements End-to-End Semantic Code Search on Kubeflow. It is based on the public
|
||||
Github Dataset hosted on BigQuery.
|
||||
|
||||
## Setup
|
||||
|
||||
### Prerequisites
|
||||
|
||||
* Python 2.7 (with `pip`)
|
||||
* Python `virtualenv`
|
||||
* Node
|
||||
* Docker
|
||||
* Ksonnet
|
||||
|
||||
**NOTE**: `Apache Beam` lacks `Python3` support and hence the version requirement.
|
||||
|
||||
### Google Cloud Setup
|
||||
|
||||
* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI
|
||||
|
||||
* Setup Application Default Credentials
|
||||
```
|
||||
$ gcloud auth application-default login
|
||||
```
|
||||
|
||||
* Enable Dataflow via Command Line (or use the Google Cloud Console)
|
||||
```
|
||||
$ gcloud services enable dataflow.googleapis.com
|
||||
```
|
||||
|
||||
* Create a Google Cloud Project and Google Storage Bucket.
|
||||
|
||||
* Authenticate with Google Container Registry to push Docker images
|
||||
```
|
||||
$ gcloud auth configure-docker
|
||||
```
|
||||
|
||||
See [Google Cloud Docs](https://cloud.google.com/docs/) for more.
|
||||
|
||||
### Python Environment Setup
|
||||
|
||||
This demo needs multiple Python versions and `virtualenv` is an easy way to
|
||||
create isolated environments.
|
||||
|
||||
```
|
||||
$ virtualenv -p $(which python2) env2.7
|
||||
```
|
||||
|
||||
This creates a `env2.7` environment folder.
|
||||
|
||||
To use the environment,
|
||||
|
||||
```
|
||||
$ source env2.7/bin/activate
|
||||
```
|
||||
|
||||
See [Virtualenv Docs](https://virtualenv.pypa.io/en/stable/) for more.
|
||||
|
||||
**NOTE**: The `env2.7` environment must be activated for all steps now onwards.
|
||||
|
||||
### Python Dependencies
|
||||
|
||||
To install dependencies, run the following commands
|
||||
|
||||
```
|
||||
(env2.7) $ pip install https://github.com/activatedgeek/batch-predict/tarball/fix-value-provider
|
||||
(env2.7) $ pip install src/
|
||||
```
|
||||
|
||||
This will install everything needed to run the demo code.
|
||||
|
||||
### Node Dependencies
|
||||
|
||||
```
|
||||
$ pushd ui && npm i && popd
|
||||
```
|
||||
|
||||
### Build and Push Docker Images
|
||||
|
||||
The `docker` directory contains Dockerfiles for each target application with its own `build.sh`. This is needed
|
||||
to run the training jobs in Kubeflow cluster.
|
||||
|
||||
To build the Docker image for training jobs
|
||||
|
||||
```
|
||||
$ ./docker/t2t/build.sh
|
||||
```
|
||||
|
||||
To build the Docker image for Code Search UI
|
||||
|
||||
```
|
||||
$ ./docker/ui/build.sh
|
||||
```
|
||||
|
||||
Optionally, to push these images to GCR, one must export the `PROJECT=<my project name>` environment variable
|
||||
and use the appropriate build script.
|
||||
|
||||
See [GCR Pushing and Pulling Images](https://cloud.google.com/container-registry/docs/pushing-and-pulling) for more.
|
||||
|
||||
# Pipeline
|
||||
|
||||
## 1. Data Pre-processing
|
||||
|
||||
This step takes in the public Github dataset and generates function and docstring token pairs.
|
||||
Results are saved back into a BigQuery table. It is done via a `Dataflow` job.
|
||||
|
||||
```
|
||||
(env2.7) $ export GCS_DIR=gs://kubeflow-examples/t2t-code-search
|
||||
(env2.7) $ code-search-preprocess -r DataflowRunner -o code_search:function_docstrings \
|
||||
-p kubeflow-dev -j process-github-archive --storage-bucket ${GCS_DIR} \
|
||||
--machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16
|
||||
```
|
||||
|
||||
## 2. Model Training
|
||||
|
||||
We use `tensor2tensor` to train our model.
|
||||
|
||||
```
|
||||
(env2.7) $ t2t-trainer --generate_data --problem=github_function_docstring --model=similarity_transformer --hparams_set=transformer_tiny \
|
||||
--data_dir=${GCS_DIR}/data --output_dir=${GCS_DIR}/output \
|
||||
--train_steps=100 --eval_steps=10 \
|
||||
--t2t_usr_dir=src/code_search/t2t
|
||||
```
|
||||
|
||||
A `Dockerfile` based on Tensorflow is provided along which has all the dependencies for this part of the pipeline.
|
||||
By default, it is based off Tensorflow CPU 1.8.0 for `Python3` but can be overridden in the Docker image build.
|
||||
This script builds and pushes the docker image to Google Container Registry.
|
||||
|
||||
## 3. Model Export
|
||||
|
||||
We use `t2t-exporter` to export our trained model above into the TensorFlow `SavedModel` format.
|
||||
|
||||
```
|
||||
(env2.7) $ t2t-exporter --problem=github_function_docstring --model=similarity_transformer --hparams_set=transformer_tiny \
|
||||
--data_dir=${GCS_DIR}/data --output_dir=${GCS_DIR}/output \
|
||||
--t2t_usr_dir=src/code_search/t2t
|
||||
```
|
||||
|
||||
## 4. Batch Prediction for Code Embeddings
|
||||
|
||||
We run another `Dataflow` pipeline to use the exported model above and get a high-dimensional embedding of each of
|
||||
our code example. Specify the model version (which is a UNIX timestamp) from the output directory. This should be the name of
|
||||
a folder at path `${GCS_DIR}/output/export/Servo`
|
||||
|
||||
```
|
||||
(env2.7) $ export MODEL_VERSION=<put_unix_timestamp_here>
|
||||
```
|
||||
|
||||
Now, start the job,
|
||||
|
||||
```
|
||||
(env2.7) $ export SAVED_MODEL_DIR=${GCS_DIR}/output/export/Servo/${MODEL_VERSION}
|
||||
(env2.7) $ code-search-predict -r DataflowRunner --problem=github_function_docstring -i "${GCS_DIR}/data/*.csv" \
|
||||
--data-dir "${GCS_DIR}/data" --saved-model-dir "${SAVED_MODEL_DIR}"
|
||||
-p kubeflow-dev -j batch-predict-github-archive --storage-bucket ${GCS_DIR} \
|
||||
--machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16
|
||||
```
|
||||
|
||||
## 5. Create an NMSLib Index
|
||||
|
||||
Using the above embeddings, we will now create an NMSLib index which will serve as our search index for
|
||||
new incoming queries.
|
||||
|
||||
|
||||
```
|
||||
(env2.7) $ export INDEX_FILE= # TODO(sanyamkapoor): Add the index file
|
||||
(env2.7) $ nmslib-create --data-file=${EMBEDDINGS_FILE} --index-file=${INDEX_FILE}
|
||||
```
|
||||
|
||||
|
||||
## 6. Run a TensorFlow Serving container
|
||||
|
||||
This will start a TF Serving container using the model export above and export it at port 8501.
|
||||
|
||||
```
|
||||
$ docker run --rm -p8501:8501 gcr.io/kubeflow-images-public/tensorflow-serving-1.8 tensorflow_model_server \
|
||||
--rest_api_port=8501 --model_name=t2t_code_search --model_base_path=${GCS_DIR}/output/export/Servo
|
||||
```
|
||||
|
||||
## 7. Serve the Search Engine
|
||||
|
||||
We will now serve the search engine via a simple REST interface
|
||||
|
||||
```
|
||||
(env2.7) $ nmslib-serve --serving-url=http://localhost:8501/v1/models/t2t_code_search:predict \
|
||||
--problem=github_function_docstring --data-dir=${GCS_DIR}/data --index-file=${INDEX_FILE}
|
||||
```
|
||||
|
||||
## 8. Serve the UI
|
||||
|
||||
This will serve as the graphical wrapper on top of the REST search engine started in the previous step.
|
||||
|
||||
```
|
||||
$ pushd ui && npm run build && popd
|
||||
$ serve -s ui/build
|
||||
```
|
||||
|
||||
# Pipeline on Kubeflow
|
||||
|
||||
TODO
|
||||
# Code Search on Kubeflow
|
||||
|
||||
This demo implements End-to-End Code Search on Kubeflow.
|
||||
|
||||
# Prerequisites
|
||||
|
||||
**NOTE**: If using the JupyterHub Spawner on a Kubeflow cluster, use the Docker image
|
||||
`gcr.io/kubeflow-images-public/kubeflow-codelab-notebook` which has baked all the pre-prequisites.
|
||||
|
||||
* `Kubeflow Latest`
|
||||
This notebook assumes a Kubeflow cluster is already deployed. See
|
||||
[Getting Started with Kubeflow](https://www.kubeflow.org/docs/started/getting-started/).
|
||||
|
||||
* `Python 2.7` (bundled with `pip`)
|
||||
For this demo, we will use Python 2.7. This restriction is due to [Apache Beam](https://beam.apache.org/),
|
||||
which does not support Python 3 yet (See [BEAM-1251](https://issues.apache.org/jira/browse/BEAM-1251)).
|
||||
|
||||
* `Google Cloud SDK`
|
||||
This example will use tools from the [Google Cloud SDK](https://cloud.google.com/sdk/). The SDK
|
||||
must be authenticated and authorized. See
|
||||
[Authentication Overview](https://cloud.google.com/docs/authentication/).
|
||||
|
||||
* `Ksonnet 0.12`
|
||||
We use [Ksonnet](https://ksonnet.io/) to write Kubernetes jobs in a declarative manner to be run
|
||||
on top of Kubeflow.
|
||||
|
||||
# Getting Started
|
||||
|
||||
To get started, follow the instructions below.
|
||||
|
||||
**NOTE**: We will assume that the Kubeflow cluster is available at `kubeflow.example.com`. Make sure
|
||||
you replace this with the true FQDN of your Kubeflow cluster in any subsequent instructions.
|
||||
|
||||
* Spawn a new JupyterLab instance inside the Kubeflow cluster by pointing your browser to
|
||||
**https://kubeflow.example.com/hub** and clicking "**Start My Server**".
|
||||
|
||||
* In the **Image** text field, enter `gcr.io/kubeflow-images-public/kubeflow-codelab-notebook:v20180808-v0.2-22-gcfdcb12`.
|
||||
This image contains all the pre-requisites needed for the demo.
|
||||
|
||||
* Once spawned, you should be redirected to the notebooks UI. We intend to go to the JupyterLab home
|
||||
page which is available at the URL - **https://kubeflow.example.com/user/<ACCOUNT_NAME>/lab**.
|
||||
**TIP**: Simply point the browser to **/lab** instead of the **/tree** path in the URL.
|
||||
|
||||
* Spawn a new Terminal and run
|
||||
```
|
||||
$ git clone --branch=master --depth=1 https://github.com/kubeflow/examples
|
||||
```
|
||||
This will create an examples folder. It is safe to close the terminal now.
|
||||
|
||||
* Refresh the File Explorer (typically to the left) and navigate to `examples/code_search`. Open
|
||||
the Jupyter notebook `code-search.ipynb` and follow it along.
|
||||
|
||||
# Acknowledgements
|
||||
|
||||
|
|
|
@ -0,0 +1,582 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "-"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Code Search on Kubeflow\n",
|
||||
"\n",
|
||||
"This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string.\n",
|
||||
"\n",
|
||||
"**NOTE**: If you haven't already, see [kubeflow/examples/code_search](https://github.com/kubeflow/examples/tree/master/code_search) for instructions on how to get this notebook,."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Install dependencies\n",
|
||||
"\n",
|
||||
"Let us install all the Python dependencies. Note that everything must be done with `Python 2`. This will take a while and only needs to be run once."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# FIXME(sanyamkapoor): The Kubeflow Batch Prediction dependency is installed from a fork for reasons in\n",
|
||||
"# kubeflow/batch-predict#9 and corresponding issue kubeflow/batch-predict#10\n",
|
||||
"! pip2 install https://github.com/activatedgeek/batch-predict/tarball/fix-value-provider\n",
|
||||
"\n",
|
||||
"! pip2 install -r src/requirements.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Only for BigQuery cells\n",
|
||||
"! pip2 install pandas-gbq"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pandas.io import gbq"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configure Variables\n",
|
||||
"\n",
|
||||
"This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Configuration Variables. Modify as desired.\n",
|
||||
"\n",
|
||||
"PROJECT = 'kubeflow-dev'\n",
|
||||
"CLUSTER_NAME = 'kubeflow-latest'\n",
|
||||
"CLUSTER_REGION = 'us-east1-d'\n",
|
||||
"CLUSTER_NAMESPACE = 'kubeflow-latest'\n",
|
||||
"\n",
|
||||
"TARGET_DATASET = 'code_search'\n",
|
||||
"WORKING_DIR = 'gs://kubeflow-examples/t2t-code-search/20180813'\n",
|
||||
"WORKER_MACHINE_TYPE = 'n1-highcpu-32'\n",
|
||||
"NUM_WORKERS = 16\n",
|
||||
"\n",
|
||||
"# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n",
|
||||
"%env PROJECT $PROJECT\n",
|
||||
"%env CLUSTER_NAME $CLUSTER_NAME\n",
|
||||
"%env CLUSTER_REGION $CLUSTER_REGION\n",
|
||||
"%env CLUSTER_NAMESPACE $CLUSTER_NAMESPACE\n",
|
||||
"\n",
|
||||
"%env TARGET_DATASET $TARGET_DATASET\n",
|
||||
"%env WORKING_DIR $WORKING_DIR\n",
|
||||
"%env WORKER_MACHINE_TYPE $WORKER_MACHINE_TYPE\n",
|
||||
"%env NUM_WORKERS $NUM_WORKERS"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Setup Authorization\n",
|
||||
"\n",
|
||||
"In a Kubeflow cluster, we already have the key credentials available with each pod and will re-use them to authenticate. This will allow us to submit `TFJob`s and execute `Dataflow` pipelines. We also set the new context for the Code Search Ksonnet application."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"# Activate Service Account provided by Kubeflow.\n",
|
||||
"gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}\n",
|
||||
"\n",
|
||||
"# Get KUBECONFIG for the desired cluster.\n",
|
||||
"gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${CLUSTER_REGION}\n",
|
||||
"\n",
|
||||
"# Set the namespace of the context.\n",
|
||||
"kubectl config set contexts.$(kubectl config current-context).namespace ${CLUSTER_NAMESPACE}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Setup Ksonnet Application\n",
|
||||
"\n",
|
||||
"This will use the context we've set above and provide it as a new environment to the Ksonnet application."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd kubeflow\n",
|
||||
"\n",
|
||||
"ks env add code-search --context=$(kubectl config current-context)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Verify Version Information"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"echo \"Pip Version Info: \" && pip2 --version && echo\n",
|
||||
"echo \"Google Cloud SDK Info: \" && gcloud --version && echo\n",
|
||||
"echo \"Ksonnet Version Info: \" && ks version && echo\n",
|
||||
"echo \"Kubectl Version Info: \" && kubectl version"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## View Github Files\n",
|
||||
"\n",
|
||||
"This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"\"\"\n",
|
||||
" SELECT\n",
|
||||
" MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,\n",
|
||||
" c.content\n",
|
||||
" FROM\n",
|
||||
" `bigquery-public-data.github_repos.files` AS f\n",
|
||||
" JOIN\n",
|
||||
" `bigquery-public-data.github_repos.contents` AS c\n",
|
||||
" ON\n",
|
||||
" f.id = c.id\n",
|
||||
" JOIN (\n",
|
||||
" --this part of the query makes sure repo is watched at least twice since 2017\n",
|
||||
" SELECT\n",
|
||||
" repo\n",
|
||||
" FROM (\n",
|
||||
" SELECT\n",
|
||||
" repo.name AS repo\n",
|
||||
" FROM\n",
|
||||
" `githubarchive.year.2017`\n",
|
||||
" WHERE\n",
|
||||
" type=\"WatchEvent\"\n",
|
||||
" UNION ALL\n",
|
||||
" SELECT\n",
|
||||
" repo.name AS repo\n",
|
||||
" FROM\n",
|
||||
" `githubarchive.month.2018*`\n",
|
||||
" WHERE\n",
|
||||
" type=\"WatchEvent\" )\n",
|
||||
" GROUP BY\n",
|
||||
" 1\n",
|
||||
" HAVING\n",
|
||||
" COUNT(*) >= 2 ) AS r\n",
|
||||
" ON\n",
|
||||
" f.repo_name = r.repo\n",
|
||||
" WHERE\n",
|
||||
" f.path LIKE '%.py' AND --with python extension\n",
|
||||
" c.size < 15000 AND --get rid of ridiculously long files\n",
|
||||
" REGEXP_CONTAINS(c.content, r'def ') --contains function definition\n",
|
||||
" GROUP BY\n",
|
||||
" c.content\n",
|
||||
" LIMIT\n",
|
||||
" 10\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Pre-Processing Github Files\n",
|
||||
"\n",
|
||||
"In this step, we will run a [Google Cloud Dataflow](https://cloud.google.com/dataflow/) pipeline (based on Apache Beam). A `Python 2` module `code_search.dataflow.cli.preprocess_github_dataset` has been provided which builds an Apache Beam pipeline. A list of all possible arguments can be seen via the following command."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd src\n",
|
||||
"\n",
|
||||
"python2 -m code_search.dataflow.cli.preprocess_github_dataset -h"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Run the Dataflow Job for Pre-Processing\n",
|
||||
"\n",
|
||||
"See help above for a short description of each argument. The values are being taken from environment variables defined earlier."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd src\n",
|
||||
"\n",
|
||||
"JOB_NAME=\"preprocess-github-dataset-$(date +'%Y%m%d-%H%M%S')\"\n",
|
||||
"\n",
|
||||
"python2 -m code_search.dataflow.cli.preprocess_github_dataset \\\n",
|
||||
" --runner DataflowRunner \\\n",
|
||||
" --project \"${PROJECT}\" \\\n",
|
||||
" --target_dataset \"${TARGET_DATASET}\" \\\n",
|
||||
" --data_dir \"${WORKING_DIR}/data\" \\\n",
|
||||
" --job_name \"${JOB_NAME}\" \\\n",
|
||||
" --temp_location \"${WORKING_DIR}/data/dataflow/temp\" \\\n",
|
||||
" --staging_location \"${WORKING_DIR}/data/dataflow/staging\" \\\n",
|
||||
" --worker_machine_type \"${WORKER_MACHINE_TYPE}\" \\\n",
|
||||
" --num_workers \"${NUM_WORKERS}\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When completed successfully, this should create a dataset in `BigQuery` named `target_dataset`. Additionally, it also dumps CSV files into `data_dir` which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"\"\"\n",
|
||||
" SELECT * \n",
|
||||
" FROM \n",
|
||||
" {}.token_pairs\n",
|
||||
" LIMIT\n",
|
||||
" 10\n",
|
||||
"\"\"\".format(TARGET_DATASET)\n",
|
||||
"\n",
|
||||
"gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Prepare Dataset for Training\n",
|
||||
"\n",
|
||||
"In this step we will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format. We will run this job on the Kubeflow cluster."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd kubeflow\n",
|
||||
"\n",
|
||||
"ks apply code-search -c t2t-code-search-datagen"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Execute Tensorflow Training"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd kubeflow\n",
|
||||
"\n",
|
||||
"ks apply code-search -c t2t-code-search-trainer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Export Tensorflow Model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd kubeflow\n",
|
||||
"\n",
|
||||
"ks apply code-search -c t2t-code-search-exporter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Compute Function Embeddings\n",
|
||||
"\n",
|
||||
"In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. A `Python 2` module `code_search.dataflow.cli.create_function_embeddings` has been provided for this purpose. A list of all possible arguments can be seen below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd src\n",
|
||||
"\n",
|
||||
"python2 -m code_search.dataflow.cli.create_function_embeddings -h"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configuration\n",
|
||||
"\n",
|
||||
"First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo`. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`. Below, we automatically do that by selecting the folder which represents the latest timestamp."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash --out EXPORT_DIR_LS\n",
|
||||
"\n",
|
||||
"gsutil ls ${WORKING_DIR}/output/export/Servo | grep -oE \"([0-9]+)/$\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\\n') if ts])\n",
|
||||
"\n",
|
||||
"# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n",
|
||||
"%env MODEL_VERSION $MODEL_VERSION"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Run the Dataflow Job for Function Embeddings"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd src\n",
|
||||
"\n",
|
||||
"python2 -m code_search.dataflow.cli.create_function_embeddings \\\n",
|
||||
" --runner DataflowRunner\n",
|
||||
" --project \"${PROJECT}\" \\\n",
|
||||
" --target_dataset \"${TARGET_DATASET}\" \\\n",
|
||||
" --problem github_function_docstring \\\n",
|
||||
" --data_dir \"${WORKING_DIR}/data\" \\\n",
|
||||
" --saved_model_dir \"${WORKING_DIR}/output/export/Servo/${MODEL_VERSION}\" \\\n",
|
||||
" --job_name compute-function-embeddings\n",
|
||||
" --temp_location \"${WORKING_DIR}/data/dataflow/temp\" \\\n",
|
||||
" --staging_location \"${WORKING_DIR}/data/dataflow/staging\" \\\n",
|
||||
" --worker_machine_type \"${WORKER_MACHINE_TYPE}\" \\\n",
|
||||
" --num_workers \"${NUM_WORKERS}\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"\"\"\n",
|
||||
" SELECT * \n",
|
||||
" FROM \n",
|
||||
" {}.function_embeddings\n",
|
||||
" LIMIT\n",
|
||||
" 10\n",
|
||||
"\"\"\".format(TARGET_DATASET)\n",
|
||||
"\n",
|
||||
"gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create Search Index\n",
|
||||
"\n",
|
||||
"We now create the Search Index from the computed embeddings so that during a query we can do a k-Nearest Neighbor search to give out semantically similar results."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd kubeflow\n",
|
||||
"\n",
|
||||
"ks apply code-search -c search-index-creator"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created in the `WORKING_DIR`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Deploy an Inference Server\n",
|
||||
"\n",
|
||||
"We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd kubeflow\n",
|
||||
"\n",
|
||||
"ks apply code-search -c t2t-code-search-serving"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Deploy Search UI\n",
|
||||
"\n",
|
||||
"We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"\n",
|
||||
"cd kubeflow\n",
|
||||
"\n",
|
||||
"ks apply code-search -c search-index-server"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 2",
|
||||
"language": "python",
|
||||
"name": "python2"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 2
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython2",
|
||||
"version": "2.7.15"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
Loading…
Reference in New Issue