History

Jeremy Lewi 1043bc0c26 A bunch of changes to support distributed training using tf.estimator (#265 ) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See #196 The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.		2018-11-07 16:23:59 -08:00
..
demo	A bunch of changes to support distributed training using tf.estimator (#265 )	2018-11-07 16:23:59 -08:00
docker	Fix gh-demo.kubeflow.org and make it easy to setup. (#261 )	2018-10-15 08:36:11 -07:00
hp-tune	Create a deployment to run the HP/Katib controller for the GitHub issue example. (#161 )	2018-07-11 08:46:25 -07:00
ks-kubeflow	A bunch of changes to support distributed training using tf.estimator (#265 )	2018-11-07 16:23:59 -08:00
notebooks	A bunch of changes to support distributed training using tf.estimator (#265 )	2018-11-07 16:23:59 -08:00
scripts	Update PVC to /home/jovyan (#119 )	2018-07-13 14:39:26 -07:00
sql	Remove third_party folder & MIT license file	2018-02-27 13:17:42 -05:00
workflow	Add .pylintrc (#61 )	2018-03-29 08:25:02 -07:00
01_setup_a_kubeflow_cluster.md	docs updated (#240 )	2018-09-24 15:07:27 -07:00
02_distributed_training.md	A bunch of changes to support distributed training using tf.estimator (#265 )	2018-11-07 16:23:59 -08:00
02_training_the_model.md	Fixed broken link in github issue summarization example (#235 )	2018-08-26 18:01:31 -07:00
02_training_the_model_tfjob.md	Fix model file upload (#160 )	2018-06-29 18:41:20 -07:00
03_serving_the_model.md	Edit navigation and markdown for github example (#93 )	2018-05-09 12:12:54 -07:00
04_querying_the_model.md	Edit navigation and markdown for github example (#93 )	2018-05-09 12:12:54 -07:00
05_teardown.md	Edit navigation and markdown for github example (#93 )	2018-05-09 12:12:54 -07:00
README.md	Add estimator example for github issues (#203 )	2018-08-24 18:10:27 -07:00
requirements.txt	Remove third_party folder & MIT license file	2018-02-27 13:17:42 -05:00

README.md

End-to-End kubeflow tutorial using a Sequence-to-Sequence model

This example demonstrates how you can use kubeflow end-to-end to train and serve a Sequence-to-Sequence model on an existing kubernetes cluster. This tutorial is based upon @hamelsmu's article "How To Create Data Products That Are Magical Using Sequence-to-Sequence Models".

Goals

There are two primary goals for this tutorial:

Demonstrate an End-to-End kubeflow example
Present an End-to-End Sequence-to-Sequence model

By the end of this tutorial, you should learn how to:

Setup a Kubeflow cluster on an existing Kubernetes deployment
Spawn up a Jupyter Notebook on the cluster
Spawn up a shared-persistent storage across the cluster to store large datasets
Train a Sequence-to-Sequence model using TensorFlow and GPUs on the cluster
Serve the model using Seldon Core
Query the model from a simple front-end application

Steps:

Setup a Kubeflow cluster
Training the model. You can train the model using any of the following methods using Jupyter Notebook or using TFJob:
Serving the model
Querying the model
Teardown