New tensor2tensor problem datagen for function summarization (#127)

* New tensor2tensor problem for function summarization

* Consolidate README with improved docs

* Remove old readme

* Add T2T Trainer using Transformer Networks

* Fix missing requirement for t2t-trainer
This commit is contained in:
Sanyam Kapoor 2018-06-06 00:38:58 -07:00 committed by k8s-ci-robot
parent 17dd02b803
commit 6220907044
5 changed files with 120 additions and 52 deletions

95
code_search/README.md Normal file
View File

@ -0,0 +1,95 @@
# Semantic Code Search
This demo implements End-to-End Semantic Code Search on Kubeflow. It is based on the public
Github Dataset hosted on BigQuery.
## Prerequisites
* Python 2.7 (with `pip`)
* Python 3.6+ (with `pip3`)
* Python `virtualenv`
**NOTE**: `Apache Beam` lacks `Python3` support and hence the multiple versions needed.
## Google Cloud Setup
* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI
* Setup Application Default Credentials
```
$ gcloud auth application-default login
```
* Enable Dataflow via Command Line (or use the Google Cloud Console)
```
$ gcloud services enable dataflow.googleapis.com
```
* Create a Google Cloud Project and Google Storage Bucket.
See [Google Cloud Docs](https://cloud.google.com/docs/) for more.
## Python Environment Setup
This demo needs multiple Python versions and `virtualenv` is an easy way to
create isolated environments.
```
$ virtualenv -p $(which python2) venv2 && virtualenv -p $(which python3) venv3
```
This creates two environments, `venv2` and `venv3` for `Python2` and `Python3` respectively.
To use either of the environments,
```
$ source venv2/bin/activate | source venv3/bin/activate # Pick one
```
See [Virtualenv Docs](https://virtualenv.pypa.io/en/stable/) for more.
# Pipeline
## 1. Data Pre-processing
This step takes in the public Github dataset and generates function and docstring token pairs.
Results are saved back into a BigQuery table.
* Install dependencies
```
(venv2) $ pip install -r preprocess/requirements.txt
```
* Execute the `Dataflow` job
```
$ python preprocess/scripts/process_github_archive.py -i files/select_github_archive.sql \
-o code_search:function_docstrings -p kubeflow-dev -j process-github-archive \
--storage-bucket gs://kubeflow-dev --machine-type n1-highcpu-32 --num-workers 16 \
--max-num-workers 16
```
## 2. Function Summarizer
This part generates a model to summarize functions into docstrings using the data generated in previous
step. It uses `tensor2tensor`.
* Install dependencies
```
(venv3) $ pip install -r summarizer/requirements.txt
```
* Generate `TFRecords` for training
```
(venv3) $ t2t-datagen --t2t_usr_dir=summarizer/gh_function_summarizer --problem=github_function_summarizer \
--data_dir=~/data --tmp_dir=/tmp
```
* Train transduction model using `Tranformer Networks` and a base hyper-parameters set
```
(venv3) $ t2t-trainer --t2t_usr_dir=summarizer/gh_function_summarizer --problem=github_function_summarizer \
--data_dir=~/data --model=transformer --hparams_set=transformer_base --output_dir=~/train
```
# Acknowledgements
This project derives from [hamelsmu/code_search](https://github.com/hamelsmu/code_search).

View File

@ -1,52 +0,0 @@
# Semantic Code Search
Pre-processing Pipeline package for End-to-End Semantic Code Search on Kubeflow
## Prerequisites
* Python 2.7 (with `pip`)
* Python `virtualenv`
**NOTE**: This package uses Google Cloud Dataflow which only supports Python 2.7.
## Setup
* Setup Python Virtual Environment
```
$ virtualenv venv
$ source venv/bin/activate
```
* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI
* Setup Application Default Credentials
```
$ gcloud auth application-default login
```
* Enable Dataflow via Command Line (or use the Google Cloud Console)
```
$ gcloud services enable dataflow.googleapis.com
```
* Build and install package
```
$ python setup.py build install
```
# Execution
Submit a `Dataflow` job using the following command
```
$ python scripts/process_github_archive.py -i files/select_github_archive.sql -o code_search:function_docstrings \
-p kubeflow-dev -j process-github-archive --storage-bucket gs://kubeflow-dev \
--machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16
```
**NOTE**: Make sure the Project and Google Storage Bucket is created.
# Acknowledgements
This project derives from [hamelsmu/code_search](https://github.com/hamelsmu/code_search).

View File

@ -0,0 +1 @@
from . import function_summarizer

View File

@ -0,0 +1,21 @@
import os
from tensor2tensor.utils import registry
from tensor2tensor.data_generators import text_problems
@registry.register_problem
class GithubFunctionSummarizer(text_problems.Text2TextProblem):
"""This class defines the problem of converting Python function code to docstring"""
@property
def is_generate_per_split(self):
return False
def generate_samples(self, data_dir, _tmp_dir, dataset_split): # pylint: disable=no-self-use
"""This method returns the generator to return {"inputs": [text], "targets": [text]} dict"""
# TODO(sanyamkapoor): Merge with validation set file "valid.{function|docstring}"
functions_file_path = os.path.join(data_dir, '{}.function'.format(dataset_split))
docstrings_file_path = os.path.join(data_dir, '{}.docstring'.format(dataset_split))
return text_problems.text2text_txt_iterator(functions_file_path, docstrings_file_path)

View File

@ -0,0 +1,3 @@
tensorflow~=1.8.0
tensor2tensor~=1.6.0
oauth2client~=4.1.0