New tensor2tensor problem datagen for function summarization (#127)

* New tensor2tensor problem for function summarization * Consolidate README with improved docs * Remove old readme * Add T2T Trainer using Transformer Networks * Fix missing requirement for t2t-trainer
2018-06-06 00:38:58 -07:00 · 2018-06-06 00:38:58 -07:00 · 6220907044
parent 17dd02b803
commit 6220907044
5 changed files with 120 additions and 52 deletions
--- a/code_search/README.md
+++ b/code_search/README.md
@ -0,0 +1,95 @@
+# Semantic Code Search
+
+This demo implements End-to-End Semantic Code Search on Kubeflow. It is based on the public
+Github Dataset hosted on BigQuery.
+
+## Prerequisites
+
+* Python 2.7 (with `pip`)
+* Python 3.6+ (with `pip3`)
+* Python `virtualenv`
+
+**NOTE**: `Apache Beam` lacks `Python3` support and hence the multiple versions needed.
+
+## Google Cloud Setup
+
+* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI
+
+* Setup Application Default Credentials 
+```
+$ gcloud auth application-default login
+```
+
+* Enable Dataflow via Command Line (or use the Google Cloud Console)
+```
+$ gcloud services enable dataflow.googleapis.com
+```
+
+* Create a Google Cloud Project and Google Storage Bucket.
+
+See [Google Cloud Docs](https://cloud.google.com/docs/) for more.
+
+## Python Environment Setup
+
+This demo needs multiple Python versions and `virtualenv` is an easy way to
+create isolated environments.
+
+```
+$ virtualenv -p $(which python2) venv2 && virtualenv -p $(which python3) venv3 
+```
+
+This creates two environments, `venv2` and `venv3` for `Python2` and `Python3` respectively.
+
+To use either of the environments,
+
+```
+$ source venv2/bin/activate | source venv3/bin/activate # Pick one
+```
+
+See [Virtualenv Docs](https://virtualenv.pypa.io/en/stable/) for more.
+
+# Pipeline
+
+## 1. Data Pre-processing
+
+This step takes in the public Github dataset and generates function and docstring token pairs.
+Results are saved back into a BigQuery table.
+
+* Install dependencies
+```
+(venv2) $ pip install -r preprocess/requirements.txt
+```
+
+* Execute the `Dataflow` job
+```
+$ python preprocess/scripts/process_github_archive.py -i files/select_github_archive.sql \
+         -o code_search:function_docstrings -p kubeflow-dev -j process-github-archive \
+         --storage-bucket gs://kubeflow-dev --machine-type n1-highcpu-32 --num-workers 16 \
+         --max-num-workers 16
+```
+
+## 2. Function Summarizer
+
+This part generates a model to summarize functions into docstrings using the data generated in previous
+step. It uses `tensor2tensor`.
+
+* Install dependencies
+```
+(venv3) $ pip install -r summarizer/requirements.txt
+```
+
+* Generate `TFRecords` for training
+```
+(venv3) $ t2t-datagen --t2t_usr_dir=summarizer/gh_function_summarizer --problem=github_function_summarizer \
+                      --data_dir=~/data --tmp_dir=/tmp
+```
+
+* Train transduction model using `Tranformer Networks` and a base hyper-parameters set
+```
+(venv3) $ t2t-trainer --t2t_usr_dir=summarizer/gh_function_summarizer --problem=github_function_summarizer \
+                      --data_dir=~/data --model=transformer --hparams_set=transformer_base --output_dir=~/train
+```
+
+# Acknowledgements
+
+This project derives from [hamelsmu/code_search](https://github.com/hamelsmu/code_search).
--- a/code_search/preprocess/README.md
+++ b/code_search/preprocess/README.md
@ -1,52 +0,0 @@
-# Semantic Code Search
-
-Pre-processing Pipeline package for End-to-End Semantic Code Search on Kubeflow
-
-## Prerequisites
-
-* Python 2.7 (with `pip`)
-* Python `virtualenv`
-
-**NOTE**: This package uses Google Cloud Dataflow which only supports Python 2.7.
-
-## Setup
-
-* Setup Python Virtual Environment
-```
-$ virtualenv venv
-$ source venv/bin/activate
-```
-
-* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI
-
-* Setup Application Default Credentials 
-```
-$ gcloud auth application-default login
-```
-
-* Enable Dataflow via Command Line (or use the Google Cloud Console)
-```
-$ gcloud services enable dataflow.googleapis.com
-```
-
-* Build and install package
-```
-$ python setup.py build install
-```
-
-
-# Execution
-
-Submit a `Dataflow` job using the following command
-
-```
-$ python scripts/process_github_archive.py -i files/select_github_archive.sql -o code_search:function_docstrings \ 
-                        -p kubeflow-dev -j process-github-archive --storage-bucket gs://kubeflow-dev \
-                        --machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16
-```
-
-**NOTE**: Make sure the Project and Google Storage Bucket is created.
-
-# Acknowledgements
-
-This project derives from [hamelsmu/code_search](https://github.com/hamelsmu/code_search).
--- a/code_search/summarizer/gh_function_summarizer/init.py
+++ b/code_search/summarizer/gh_function_summarizer/init.py
@ -0,0 +1 @@
+from . import function_summarizer
--- a/code_search/summarizer/gh_function_summarizer/function_summarizer.py
+++ b/code_search/summarizer/gh_function_summarizer/function_summarizer.py
@ -0,0 +1,21 @@
+import os
+
+from tensor2tensor.utils import registry
+from tensor2tensor.data_generators import text_problems
+
+@registry.register_problem
+class GithubFunctionSummarizer(text_problems.Text2TextProblem):
+  """This class defines the problem of converting Python function code to docstring"""
+
+  @property
+  def is_generate_per_split(self):
+    return False
+
+  def generate_samples(self, data_dir, _tmp_dir, dataset_split): # pylint: disable=no-self-use
+    """This method returns the generator to return {"inputs": [text], "targets": [text]} dict"""
+
+    # TODO(sanyamkapoor): Merge with validation set file "valid.{function|docstring}"
+    functions_file_path = os.path.join(data_dir, '{}.function'.format(dataset_split))
+    docstrings_file_path = os.path.join(data_dir, '{}.docstring'.format(dataset_split))
+
+    return text_problems.text2text_txt_iterator(functions_file_path, docstrings_file_path)
--- a/code_search/summarizer/requirements.txt
+++ b/code_search/summarizer/requirements.txt
@ -0,0 +1,3 @@
+tensorflow~=1.8.0
+tensor2tensor~=1.6.0
+oauth2client~=4.1.0