mirror of https://github.com/kubeflow/examples.git
* Add .pylintrc * Resolve lint complaints in agents/trainer/task.py * Resolve lint complaints with flask app.py * Resolve linting issues Remove duplicate seq2seq_utils.py from workflow/workspace/src * Use python 3.5.2 with pylint to match prow Put pybullet import back into agents/trainer/task.py with a pylint ignore statement Use main(_) to ensure it works with tf.app.run |
||
---|---|---|
.. | ||
workspace/src | ||
Dockerfile | ||
README.md | ||
github_issues_summarization.yaml |
README.md
[WIP] Github summarization workflow.
Prerequisites.
- Create a GKE cluster and configure kubectl.
- Install Argo.
- Configure the defautl artifact repository.
Get the input data and upload it to GCS.
Get the input data from this location. In the following, we assume that the file path is ./github-issues.zip
Decompress the input data:
unzip ./github-issues.zip
For debugging purposes, consider reducing the size of the input data. The workflow will execute much faster:
cat ./github-issues.csv | head -n 10000 > ./github-issues-medium.csv
Compress the data using gzip (this format is the one assumed by the workflow):
gzip ./github-issues-medium.csv
Upload the data to GCS:
gsutil cp ./github-issues-medium.csv.gz gs://<MY_BUCKET>
Building the container.
Build the container and tag it so that it can be pushed to a GCP container registry
docker build -f Dockerfile -t gcr.io/<GCP_PROJECT>/github_issue_summarization:v1 .
Push the container to the GCP container registry:
gcloud docker -- push gcr.io/<GCP_PROJECT>/github_issue_summarization:v1
Running the workflow.
Run the workflow:
argo submit github_issues_summarization.yaml
-p bucket=<BUCKET_NAME>
-p bucket-key=<PATH_TO_INPUT_DATA_IN_BUCKET>
-p container-image=gcr.io/<GCP_PROJECT>/github_issue_summarization:v1
Where:
- <BUCKET_NAME> is the name of a GCS bucket where the input data is stored (e.g.: "my_bucket_1234").
- <BUCKET_KEY> is the path to the input data in csv.gz format (e.g.: "data/github_issues.csv.gz").
- <GCP_PROJECT> is the name of the GCP project where the container was pushed.
The data generated by the workflow will be stored in the default artifact repository specified in the previous section.
The logs can be read by using the argo get and argo logs commands (link)