History

Jeremy Lewi 1043bc0c26 A bunch of changes to support distributed training using tf.estimator (#265 ) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See #196 The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.		2018-11-07 16:23:59 -08:00
..
gh-app	Fix gh-demo.kubeflow.org and make it easy to setup. (#261 )	2018-10-15 08:36:11 -07:00
gh-demo-1003	A bunch of changes to support distributed training using tf.estimator (#265 )	2018-11-07 16:23:59 -08:00
README.md	A bunch of changes to support distributed training using tf.estimator (#265 )	2018-11-07 16:23:59 -08:00
gh-demo-dm-config.yaml	Fix gh-demo.kubeflow.org and make it easy to setup. (#261 )	2018-10-15 08:36:11 -07:00

README.md

Demo

This folder contains the resources needed by the Kubeflow DevRel team to setup a public demo of the GitHub Issue Summarization Example.

Public gh-demo.kubeflow.org

We currently run a public instance of the ui at gh-demo.kubeflow.org

The current setup is as follows

PROJECT=kubecon-gh-demo-1
CLUSTER=gh-demo-1003
ZONE=us-east1-d

Directory contents

gh-app - This contains the ksonnet for deploying the public instance of the model and ui.
gh-demo-1003 - This is the app created by kfctl

Setting up the demo

Here are the instructions for setting up the demo.

Follow the GKE instructions for deploying Kubeflow
- If you are using PROJECT kubecon-gh-demo-1 you can reuse the existing OAuth client
  - Use the Cloud console to lookup Client ID and secret and set the corresponding environment variables
  - You will also need to add an authorized redirect URI for the new Kubeflow deployment
Follow the instructions to Setup an NFS share
- This is needed to do distributed training with the TF estimator example

Create static IP for serving gh-demo.kubeflow.org

gcloud --project=${PROJECT}  deployment-manager deployments create  --config=gh-demo-dm-config.yaml gh-public-ui

Update the Cloud DNS record gh-demo.kubeflow.org in project kubeflow-dns to use the new static ip.
Create a namespace for serving the UI and model
```
kubectl create namespace gh-public
```

Deploy Seldon controller in the namespace that will serve the public model

This is a work around for kubeflow/kubeflow#1712

cd gh-demo-1003/ks_app
ks env add gh-public --namespace=gh-public
ks generate seldon seldon
ks apply gh-public -c seldon

Create a secret with a GitHub token
- Follow GitHub's instructions to create a token
- Then run the following command to create the secret
```
kubectl -n gh-public create secret generic github-token --from-literal=github-token=${GITHUB_TOKEN}
```

Deploy the public UI and model

cd gh-app
ks env add gh-public --namespace=gh-public
ks apply gh-public

Training and Deploying the model.

We use the ksonnet app in github/kubeflow/examples/github_issue_summarization/ks-kubeflow

The current environment is

export ENV=gh-demo-1003

Set a bucket for the job output

DAY=$(date +%Y%m%d)
ks param set --env=${ENV} tfjob-v1alpha2 output_model_gcs_bucket kubecon-gh-demo
ks param set --env=${ENV} tfjob-v1alpha2 output_model_gcs_path gh-demo/${DAY}/output

Run the job

ks apply ${ENV} -c tfjob-v1alpha2

Using TF Estimator with Keras

Copy the data to the GCFS mount by launching a notebook and then running the following commands

!mkdir -p /mnt/kubeflow-gcfs/gh-demo/data
!gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
!gsutil cp gs://kubeflow-examples/github-issue-summarization-data/github-issues.zip /mnt/kubeflow-gcfs/gh-demo/data
!unzip /mnt/kubeflow-gcfs/gh-demo/data/github-issues.zip
!cp github_issues.csv /mnt/kubeflow-gcfs/gh-demo/data/

TODO(jlewi): Can we modify the existing job that downloads data to a PVC to do this?

Run the estimator job
```
ks apply ${ENV} -c tfjob-estimator
```
Run TensorBoard
```
ks apply ${ENV} -c tensorboard-pvc-tb
```