examples/code_search/code-search.ipynb

824 lines
21 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"# Code Search on Kubeflow\n",
"\n",
"This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string.\n",
"\n",
"**NOTE**: If you haven't already, see [kubeflow/examples/code_search](https://github.com/kubeflow/examples/tree/master/code_search) for instructions on how to get this notebook,."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install dependencies\n",
"\n",
"Let us install all the Python dependencies. Note that everything must be done with `Python 2`. This will take a while the first time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Verify Version Information"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"echo \"Pip Version Info: \" && python2 --version && python2 -m pip --version && echo\n",
"echo \"Google Cloud SDK Info: \" && gcloud --version && echo\n",
"echo \"Ksonnet Version Info: \" && ks version && echo\n",
"echo \"Kubectl Version Info: \" && kubectl version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install Pip Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"! python2 -m pip install -U pip"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Code Search dependencies\n",
"! python2 -m pip install --user https://github.com/kubeflow/batch-predict/tarball/master\n",
"! python2 -m pip install --user -r src/requirements.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# BigQuery Cell Dependencies\n",
"! python2 -m pip install --user pandas-gbq"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# NOTE: The RuntimeWarnings (if any) are harmless. See ContinuumIO/anaconda-issues#6678.\n",
"from pandas.io import gbq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Variables\n",
"\n",
"This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Configuration Variables. Modify as desired.\n",
"\n",
"PROJECT = 'kubeflow-dev'\n",
"\n",
"# Dataflow Related Variables.\n",
"TARGET_DATASET = 'code_search'\n",
"WORKING_DIR = 'gs://kubeflow-examples/t2t-code-search/notebook-demo'\n",
"WORKER_MACHINE_TYPE = 'n1-highcpu-32'\n",
"NUM_WORKERS = 16\n",
"\n",
"# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n",
"%env PROJECT $PROJECT\n",
"%env TARGET_DATASET $TARGET_DATASET\n",
"%env WORKING_DIR $WORKING_DIR\n",
"%env WORKER_MACHINE_TYPE $WORKER_MACHINE_TYPE\n",
"%env NUM_WORKERS $NUM_WORKERS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup Authorization\n",
"\n",
"In a Kubeflow cluster on GKE, we already have the Google Application Credentials mounted onto each Pod. We can simply point `gcloud` to activate that service account."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"# Activate Service Account provided by Kubeflow.\n",
"gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Additionally, to interact with the underlying cluster, we configure `kubectl`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"kubectl config set-cluster kubeflow --server=https://kubernetes.default --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n",
"kubectl config set-credentials jupyter --token \"$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\"\n",
"kubectl config set-context kubeflow --cluster kubeflow --user jupyter\n",
"kubectl config use-context kubeflow"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Collectively, these allow us to interact with Google Cloud Services as well as the Kubernetes Cluster directly to submit `TFJob`s and execute `Dataflow` pipelines."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup Ksonnet Application\n",
"\n",
"We now point the Ksonnet application to the underlying Kubernetes cluster."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"# Update Ksonnet to point to the Kubernetes Cluster\n",
"ks env add code-search --context $(kubectl config current-context)\n",
"\n",
"# Update the Working Directory of the application\n",
"sed -i'' \"s,gs://example/prefix,${WORKING_DIR},\" components/params.libsonnet\n",
"\n",
"# FIXME(sanyamkapoor): This command completely replaces previous configurations.\n",
"# Hence, using string replacement in file.\n",
"# ks param set t2t-code-search workingDir ${WORKING_DIR}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## View Github Files\n",
"\n",
"This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next.\n",
"\n",
"**WARNING**: The table is large and the query can take a few minutes to complete."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"\"\"\n",
" SELECT\n",
" MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,\n",
" c.content\n",
" FROM\n",
" `bigquery-public-data.github_repos.files` AS f\n",
" JOIN\n",
" `bigquery-public-data.github_repos.contents` AS c\n",
" ON\n",
" f.id = c.id\n",
" JOIN (\n",
" --this part of the query makes sure repo is watched at least twice since 2017\n",
" SELECT\n",
" repo\n",
" FROM (\n",
" SELECT\n",
" repo.name AS repo\n",
" FROM\n",
" `githubarchive.year.2017`\n",
" WHERE\n",
" type=\"WatchEvent\"\n",
" UNION ALL\n",
" SELECT\n",
" repo.name AS repo\n",
" FROM\n",
" `githubarchive.month.2018*`\n",
" WHERE\n",
" type=\"WatchEvent\" )\n",
" GROUP BY\n",
" 1\n",
" HAVING\n",
" COUNT(*) >= 2 ) AS r\n",
" ON\n",
" f.repo_name = r.repo\n",
" WHERE\n",
" f.path LIKE '%.py' AND --with python extension\n",
" c.size < 15000 AND --get rid of ridiculously long files\n",
" REGEXP_CONTAINS(c.content, r'def ') --contains function definition\n",
" GROUP BY\n",
" c.content\n",
" LIMIT\n",
" 10\n",
"\"\"\"\n",
"\n",
"gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pre-Processing Github Files\n",
"\n",
"In this step, we will run a [Google Cloud Dataflow](https://cloud.google.com/dataflow/) pipeline (based on Apache Beam). A `Python 2` module `code_search.dataflow.cli.preprocess_github_dataset` has been provided which builds an Apache Beam pipeline. A list of all possible arguments can be seen via the following command."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd src\n",
"\n",
"python2 -m code_search.dataflow.cli.preprocess_github_dataset -h"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run the Dataflow Job for Pre-Processing\n",
"\n",
"See help above for a short description of each argument. The values are being taken from environment variables defined earlier."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd src\n",
"\n",
"JOB_NAME=\"preprocess-github-dataset-$(date +'%Y%m%d-%H%M%S')\"\n",
"\n",
"python2 -m code_search.dataflow.cli.preprocess_github_dataset \\\n",
" --runner DataflowRunner \\\n",
" --project \"${PROJECT}\" \\\n",
" --target_dataset \"${TARGET_DATASET}\" \\\n",
" --data_dir \"${WORKING_DIR}/data\" \\\n",
" --job_name \"${JOB_NAME}\" \\\n",
" --temp_location \"${WORKING_DIR}/dataflow/temp\" \\\n",
" --staging_location \"${WORKING_DIR}/dataflow/staging\" \\\n",
" --worker_machine_type \"${WORKER_MACHINE_TYPE}\" \\\n",
" --num_workers \"${NUM_WORKERS}\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When completed successfully, this should create a dataset in `BigQuery` named `target_dataset`. Additionally, it also dumps CSV files into `data_dir` which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"query = \"\"\"\n",
" SELECT * \n",
" FROM \n",
" {}.token_pairs\n",
" LIMIT\n",
" 10\n",
"\"\"\".format(TARGET_DATASET)\n",
"\n",
"gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This pipeline also writes a set of CSV files which contain function and docstring pairs delimited by a comma. Here, we list a subset of them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"LIMIT=10\n",
"\n",
"gsutil ls ${WORKING_DIR}/data/*.csv | head -n ${LIMIT}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare Dataset for Training\n",
"\n",
"We will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format.\n",
"\n",
"**TIP**: Use `ks show` to view the Resource Spec submitted."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks show code-search -c t2t-code-search-datagen"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks apply code-search -c t2t-code-search-datagen"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once this job finishes, the data directory should have a vocabulary file and a list of `TFRecords` prefixed by the problem name which in our case is `github_function_docstring_extended`. Here, we list a subset of them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"LIMIT=10\n",
"\n",
"gsutil ls ${WORKING_DIR}/data/vocab*\n",
"gsutil ls ${WORKING_DIR}/data/*train* | head -n ${LIMIT}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Execute Tensorflow Training\n",
"\n",
"Once, the `TFRecords` are generated, we will use `t2t-trainer` to execute the training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks show code-search -c t2t-code-search-trainer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks apply code-search -c t2t-code-search-trainer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will generate TensorFlow model checkpoints which is illustrated below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"gsutil ls ${WORKING_DIR}/output/*ckpt*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export Tensorflow Model\n",
"\n",
"We now use `t2t-exporter` to export the `TFModel`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks show code-search -c t2t-code-search-exporter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks apply code-search -c t2t-code-search-exporter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once completed, this will generate a TensorFlow `SavedModel` which we will further use for both online (via `TF Serving`) and offline inference (via `Kubeflow Batch Prediction`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"gsutil ls ${WORKING_DIR}/output/export/Servo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compute Function Embeddings\n",
"\n",
"In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. A `Python 2` module `code_search.dataflow.cli.create_function_embeddings` has been provided for this purpose. A list of all possible arguments can be seen below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd src\n",
"\n",
"python2 -m code_search.dataflow.cli.create_function_embeddings -h"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configuration\n",
"\n",
"First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo` as seen above. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`. Below, we automatically do that by selecting the folder which represents the latest timestamp."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash --out EXPORT_DIR_LS\n",
"\n",
"gsutil ls ${WORKING_DIR}/output/export/Servo | grep -oE \"([0-9]+)/$\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# WARNING: This routine will fail if no export has been completed successfully.\n",
"MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\\n') if ts])\n",
"\n",
"# DO NOT MODIFY. These are environment variables to be used in a bash shell.\n",
"%env MODEL_VERSION $MODEL_VERSION"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run the Dataflow Job for Function Embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd src\n",
"\n",
"JOB_NAME=\"compute-function-embeddings-$(date +'%Y%m%d-%H%M%S')\"\n",
"PROBLEM=github_function_docstring_extended\n",
"\n",
"python2 -m code_search.dataflow.cli.create_function_embeddings \\\n",
" --runner DataflowRunner \\\n",
" --project \"${PROJECT}\" \\\n",
" --target_dataset \"${TARGET_DATASET}\" \\\n",
" --problem \"${PROBLEM}\" \\\n",
" --data_dir \"${WORKING_DIR}/data\" \\\n",
" --saved_model_dir \"${WORKING_DIR}/output/export/Servo/${MODEL_VERSION}\" \\\n",
" --job_name \"${JOB_NAME}\" \\\n",
" --temp_location \"${WORKING_DIR}/dataflow/temp\" \\\n",
" --staging_location \"${WORKING_DIR}/dataflow/staging\" \\\n",
" --worker_machine_type \"${WORKER_MACHINE_TYPE}\" \\\n",
" --num_workers \"${NUM_WORKERS}\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"\"\"\n",
" SELECT * \n",
" FROM \n",
" {}.function_embeddings\n",
" LIMIT\n",
" 10\n",
"\"\"\".format(TARGET_DATASET)\n",
"\n",
"gbq.read_gbq(query, dialect='standard', project_id=PROJECT)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The pipeline also generates a set of CSV files which will be useful to generate the search index."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"LIMIT=10\n",
"\n",
"gsutil ls ${WORKING_DIR}/data/*index*.csv | head -n ${LIMIT}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Search Index\n",
"\n",
"We now create the Search Index from the computed embeddings. This facilitates k-Nearest Neighbor search to for semantically similar results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks show code-search -c search-index-creator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks apply code-search -c search-index-creator"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"gsutil ls ${WORKING_DIR}/code_search_index*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy an Inference Server\n",
"\n",
"We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks show code-search -c t2t-code-search-serving"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks apply code-search -c t2t-code-search-serving"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy Search UI\n",
"\n",
"We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks show code-search -c search-index-server"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd kubeflow\n",
"\n",
"ks apply code-search -c search-index-server"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.15"
}
},
"nbformat": 4,
"nbformat_minor": 2
}