mirror of https://github.com/kubeflow/examples.git
New tensor2tensor problem datagen for function summarization (#127)
* New tensor2tensor problem for function summarization * Consolidate README with improved docs * Remove old readme * Add T2T Trainer using Transformer Networks * Fix missing requirement for t2t-trainer
This commit is contained in:
parent
17dd02b803
commit
6220907044
|
|
@ -0,0 +1,95 @@
|
|||
# Semantic Code Search
|
||||
|
||||
This demo implements End-to-End Semantic Code Search on Kubeflow. It is based on the public
|
||||
Github Dataset hosted on BigQuery.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Python 2.7 (with `pip`)
|
||||
* Python 3.6+ (with `pip3`)
|
||||
* Python `virtualenv`
|
||||
|
||||
**NOTE**: `Apache Beam` lacks `Python3` support and hence the multiple versions needed.
|
||||
|
||||
## Google Cloud Setup
|
||||
|
||||
* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI
|
||||
|
||||
* Setup Application Default Credentials
|
||||
```
|
||||
$ gcloud auth application-default login
|
||||
```
|
||||
|
||||
* Enable Dataflow via Command Line (or use the Google Cloud Console)
|
||||
```
|
||||
$ gcloud services enable dataflow.googleapis.com
|
||||
```
|
||||
|
||||
* Create a Google Cloud Project and Google Storage Bucket.
|
||||
|
||||
See [Google Cloud Docs](https://cloud.google.com/docs/) for more.
|
||||
|
||||
## Python Environment Setup
|
||||
|
||||
This demo needs multiple Python versions and `virtualenv` is an easy way to
|
||||
create isolated environments.
|
||||
|
||||
```
|
||||
$ virtualenv -p $(which python2) venv2 && virtualenv -p $(which python3) venv3
|
||||
```
|
||||
|
||||
This creates two environments, `venv2` and `venv3` for `Python2` and `Python3` respectively.
|
||||
|
||||
To use either of the environments,
|
||||
|
||||
```
|
||||
$ source venv2/bin/activate | source venv3/bin/activate # Pick one
|
||||
```
|
||||
|
||||
See [Virtualenv Docs](https://virtualenv.pypa.io/en/stable/) for more.
|
||||
|
||||
# Pipeline
|
||||
|
||||
## 1. Data Pre-processing
|
||||
|
||||
This step takes in the public Github dataset and generates function and docstring token pairs.
|
||||
Results are saved back into a BigQuery table.
|
||||
|
||||
* Install dependencies
|
||||
```
|
||||
(venv2) $ pip install -r preprocess/requirements.txt
|
||||
```
|
||||
|
||||
* Execute the `Dataflow` job
|
||||
```
|
||||
$ python preprocess/scripts/process_github_archive.py -i files/select_github_archive.sql \
|
||||
-o code_search:function_docstrings -p kubeflow-dev -j process-github-archive \
|
||||
--storage-bucket gs://kubeflow-dev --machine-type n1-highcpu-32 --num-workers 16 \
|
||||
--max-num-workers 16
|
||||
```
|
||||
|
||||
## 2. Function Summarizer
|
||||
|
||||
This part generates a model to summarize functions into docstrings using the data generated in previous
|
||||
step. It uses `tensor2tensor`.
|
||||
|
||||
* Install dependencies
|
||||
```
|
||||
(venv3) $ pip install -r summarizer/requirements.txt
|
||||
```
|
||||
|
||||
* Generate `TFRecords` for training
|
||||
```
|
||||
(venv3) $ t2t-datagen --t2t_usr_dir=summarizer/gh_function_summarizer --problem=github_function_summarizer \
|
||||
--data_dir=~/data --tmp_dir=/tmp
|
||||
```
|
||||
|
||||
* Train transduction model using `Tranformer Networks` and a base hyper-parameters set
|
||||
```
|
||||
(venv3) $ t2t-trainer --t2t_usr_dir=summarizer/gh_function_summarizer --problem=github_function_summarizer \
|
||||
--data_dir=~/data --model=transformer --hparams_set=transformer_base --output_dir=~/train
|
||||
```
|
||||
|
||||
# Acknowledgements
|
||||
|
||||
This project derives from [hamelsmu/code_search](https://github.com/hamelsmu/code_search).
|
||||
|
|
@ -1,52 +0,0 @@
|
|||
# Semantic Code Search
|
||||
|
||||
Pre-processing Pipeline package for End-to-End Semantic Code Search on Kubeflow
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Python 2.7 (with `pip`)
|
||||
* Python `virtualenv`
|
||||
|
||||
**NOTE**: This package uses Google Cloud Dataflow which only supports Python 2.7.
|
||||
|
||||
## Setup
|
||||
|
||||
* Setup Python Virtual Environment
|
||||
```
|
||||
$ virtualenv venv
|
||||
$ source venv/bin/activate
|
||||
```
|
||||
|
||||
* Install [`gcloud`](https://cloud.google.com/sdk/gcloud/) CLI
|
||||
|
||||
* Setup Application Default Credentials
|
||||
```
|
||||
$ gcloud auth application-default login
|
||||
```
|
||||
|
||||
* Enable Dataflow via Command Line (or use the Google Cloud Console)
|
||||
```
|
||||
$ gcloud services enable dataflow.googleapis.com
|
||||
```
|
||||
|
||||
* Build and install package
|
||||
```
|
||||
$ python setup.py build install
|
||||
```
|
||||
|
||||
|
||||
# Execution
|
||||
|
||||
Submit a `Dataflow` job using the following command
|
||||
|
||||
```
|
||||
$ python scripts/process_github_archive.py -i files/select_github_archive.sql -o code_search:function_docstrings \
|
||||
-p kubeflow-dev -j process-github-archive --storage-bucket gs://kubeflow-dev \
|
||||
--machine-type n1-highcpu-32 --num-workers 16 --max-num-workers 16
|
||||
```
|
||||
|
||||
**NOTE**: Make sure the Project and Google Storage Bucket is created.
|
||||
|
||||
# Acknowledgements
|
||||
|
||||
This project derives from [hamelsmu/code_search](https://github.com/hamelsmu/code_search).
|
||||
|
|
@ -0,0 +1 @@
|
|||
from . import function_summarizer
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
import os
|
||||
|
||||
from tensor2tensor.utils import registry
|
||||
from tensor2tensor.data_generators import text_problems
|
||||
|
||||
@registry.register_problem
|
||||
class GithubFunctionSummarizer(text_problems.Text2TextProblem):
|
||||
"""This class defines the problem of converting Python function code to docstring"""
|
||||
|
||||
@property
|
||||
def is_generate_per_split(self):
|
||||
return False
|
||||
|
||||
def generate_samples(self, data_dir, _tmp_dir, dataset_split): # pylint: disable=no-self-use
|
||||
"""This method returns the generator to return {"inputs": [text], "targets": [text]} dict"""
|
||||
|
||||
# TODO(sanyamkapoor): Merge with validation set file "valid.{function|docstring}"
|
||||
functions_file_path = os.path.join(data_dir, '{}.function'.format(dataset_split))
|
||||
docstrings_file_path = os.path.join(data_dir, '{}.docstring'.format(dataset_split))
|
||||
|
||||
return text_problems.text2text_txt_iterator(functions_file_path, docstrings_file_path)
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
tensorflow~=1.8.0
|
||||
tensor2tensor~=1.6.0
|
||||
oauth2client~=4.1.0
|
||||
Loading…
Reference in New Issue