History

Michelle Casbon 41372c9314 Add .pylintrc (#61 ) * Add .pylintrc * Resolve lint complaints in agents/trainer/task.py * Resolve lint complaints with flask app.py * Resolve linting issues Remove duplicate seq2seq_utils.py from workflow/workspace/src * Use python 3.5.2 with pylint to match prow Put pybullet import back into agents/trainer/task.py with a pylint ignore statement Use main(_) to ensure it works with tf.app.run		2018-03-29 08:25:02 -07:00
..
workspace/src	Add .pylintrc (#61 )	2018-03-29 08:25:02 -07:00
Dockerfile	Example workflow for Github issue summarization. (#35 )	2018-03-08 16:03:10 -08:00
README.md	Example workflow for Github issue summarization. (#35 )	2018-03-08 16:03:10 -08:00
github_issues_summarization.yaml	Example workflow for Github issue summarization. (#35 )	2018-03-08 16:03:10 -08:00

README.md

[WIP] Github summarization workflow.

Prerequisites.

Get the input data and upload it to GCS.

Get the input data from this location. In the following, we assume that the file path is ./github-issues.zip

Decompress the input data:

unzip ./github-issues.zip

For debugging purposes, consider reducing the size of the input data. The workflow will execute much faster:

cat ./github-issues.csv | head -n 10000 > ./github-issues-medium.csv

Compress the data using gzip (this format is the one assumed by the workflow):

gzip ./github-issues-medium.csv

Upload the data to GCS:

gsutil cp ./github-issues-medium.csv.gz gs://<MY_BUCKET>

Building the container.

Build the container and tag it so that it can be pushed to a GCP container registry

docker build -f Dockerfile -t gcr.io/<GCP_PROJECT>/github_issue_summarization:v1 .

Push the container to the GCP container registry:

gcloud docker -- push gcr.io/<GCP_PROJECT>/github_issue_summarization:v1

Running the workflow.

Run the workflow:

argo submit github_issues_summarization.yaml
  -p bucket=<BUCKET_NAME>
  -p bucket-key=<PATH_TO_INPUT_DATA_IN_BUCKET>
  -p container-image=gcr.io/<GCP_PROJECT>/github_issue_summarization:v1

Where:

<BUCKET_NAME> is the name of a GCS bucket where the input data is stored (e.g.: "my_bucket_1234").
<BUCKET_KEY> is the path to the input data in csv.gz format (e.g.: "data/github_issues.csv.gz").
<GCP_PROJECT> is the name of the GCP project where the container was pushed.

The data generated by the workflow will be stored in the default artifact repository specified in the previous section.

The logs can be read by using the argo get and argo logs commands (link)