examples/mnist/data
Elson Rodriguez 1be7ccb142 Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42)
* Add awscli tools container.

* Add initial readme.

* Add argo skeleton.

* Run a an argo job.

* Artifact support and argo test

* Use built container (#3)

* Fix artifacts and secrets

* Add work in progress tfflow (#14)

* Add kvc deployment to workflow.

* Switch aws repo.

* wip.

* Add working tfflow job.

* Add sidecar that waits for MASTER completion

* Pass in job-name

* Add volumemanager info step

* Add input parameters to step

* Adds nodeaffinity and hostpath

* Add fixes for workflow (#17)

- Use correct images for worker and ps
- Use correct aws keys
- Change volumemanager to mnist
- Comment unused steps
- Fix volume mount to correct containers

* Fix hostpath for tfjob

* Download all mnist files

* added GCS stored artifacts comptability to Argo

* Add initial inference workflow. (#30)

* Initial serving step (#31)

* Adds fixes to initial serving step

* Ready for rough demo: Workflow in working state

* Move conflicting readme.

* Initial commit, everything boots without crashing.

* Working, with some python errors.

* Adding explicit flags

* Working with ins-outs

* Letting training job exit on success

* Adding documentation skeletion

* trying to properly save model

* Almost working

* Working

* Adding export script, refactored to allow model more reusability

* Starting documentation

* little further on docs

* More doc updates, fixing sleep logic

* adding urls for mnist data

* Removing download logic, it's to tied in with build-in tf examples.

* Added argo workflow instructions, minor cleanups.

* Adding mnist client.

* Fixing typos

* Adding instructions for installing components.

* Added ksonnet container

* Adding new entrypoint.

* Added helm install instructions for kvc

* doing things with variables

* Typos.

* Added better namespace support

* S3 refactor.

* Added missing region variables.

* Adding tensorboard support.

* Addding Container for Tensorboard.

* Added temporary flag, added install instructions for CLI.

* Removing invalid ksonnet environment.

* Updating readme

* Cleanup currently unused pieces

* Add missint cluster-role

* Minor cleanup.

* Adding more parameters.

* added changes to allow model to train on multiple workers and fixed some doc typos

* Adding flag to enable/disable model serving. Adding s3 urls as outputs for future querying, renaming info step.

* Adding seperate deployer workflow.

* Split serving working.

* Adding split workflow.

* More parameters.

* updates as to elson comments

* Revert "added changes to allow model to train on multiple workers and fixed s…"

* Initial working pure-s3 workflow.

* Removed wait sidecars.

* Remove unused flag.

* Added part two, minor doc fixes

* Inverted links...

* Adding diff.

* Fix url syntax

* Documentation updates.

* Added AWS Cli

* Parameterized export.

* Fixing image in s3 version.

* Fixed documentation issues.

* KVC snippet changes, need to find last working helm chart.

* Temporarily pinning kvc version.

* working master model and some doc typos fixes (#13)

* added changes to allow model to train on multiple workers and fixed some doc typos

* Adding flag to enable/disable model serving. Adding s3 urls as outputs for future querying, renaming info step.

* Adding seperate deployer workflow.

* Split serving working.

* Adding split workflow.

* More parameters.

* updates as to elson comments

* working master model and some doc typos

* fixes as to Elson

* Removign whitespace differences

* updating diff

* Changing parameters.

* Undoing whitespace.

* Changing termination policy on s3 version due to unknown issue.

* Updating mnist diff.

* Changing train steps.

* Syncing Demo changes.

* Update README.md

* Going S3-native for initial example. Getting rid of Master.

* Minor documentation tweaks, adding params, swapping aws cli for minio.

* Updating KVC version.

* Switching ksonnet repo, removing model name from client.

* Updating git url.

* Adding certificate hack to avoid RBAC errors.

* Pinning KVC to commit while working on PR.

* Updating version.

* Updates README with additional details (#14)

* Updates README with additional details

* Adding clarity to kubectl config commands

* Fixed comma placement

* Refactoring notes for github and kubernetes credentials.

* Forgot to add an overview of the argo template.

* Updating example based on feedback.

- Removed superflous images
- Clarified use of KVC
- Added unaltered model
- Variable cleanup

* Refactored grpc image into generic base image.

* minor cleanup of resubmitting section.

* Switching Argo deployment to ksonnet, conslidating install instructions.

* Removing old cruft, clarifying cluster requirements.

* [WIP] Switching out model (#15)

* Switching to new mnist example.

* Parameterized model, testing export.

* Got CNN model exporting.

* Attempting to do distributed training with Estimator, removed seperate export.

* Adding master back, otherwise Estimator complains about not having a chief.

* Switching to tf.estimator.train_and_evaluate.

* Minor path/var name refactor.

* Adding test data and new client.

* Fixed documentation to reflect new client.

* Getting rid of tf job shim.

* Removing KVC from example, renaming directory

* Modifying parent README

* Removed reference to export.

* Adding reference to export.

* Removing unused Dockerfile.

* Removing uneeded files, simplifying how to get status, refactor model serving workflow step.

* Renaming directory

* Minor doc improvements, removed extra clis.

* Making SSL configurable for clusters without secured s3 endpoints.

* Added a tf-user account for workflow. Fixed serving bug.

* Updating gke version.

* Re-ran through instructions, fixed errata.

* Fixing lint issues

* Pylint errors

* Pylint errors

* Adding parenthesis back.

* pylint Hacks

* Disabling argument filter, model bombs without empty arg.

* Removing unneeded lambdas
2018-04-06 14:34:09 -07:00
..
0.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
1.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
2.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
3.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
4.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
5.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
6.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
7.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
8.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00
9.png Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow (#42) 2018-04-06 14:34:09 -07:00