* edit TF example readme
* prefix tutorial steps with a number for nicer display in repo
* fix typo
* edit steps 4 and 5
* edit docs
* add navigation and formatting edits to example
* Improvements to the tensor2tensor traininer for the GitHub summarization example.
* Simplify the launcher; we can just pass through most command line arguments and not
use environment variables and command line arguments.
* This makes it easier to control the job just by setting the parameters in the template
rather than having to rebuild the images.
* Add a Makefile to build the image.
* Replace the tensor2tensor jsonnet with a newer version of the jsonnet used with T2T.
* Address reviewer comments.
* Install pip packages as user Jovyan
* Rely on implicit string conversion with concatenation in template file.
* Add a component to run TensorBoard.
* Autoformate jsonnet file.
* * Set a default of "" for logDir; there's not a really good default location
because it will depend on where the data is stored.
* Make it easier to demo serving and run in Katacoda
* Allow the model path to be specified via environment variables so that
we could potentially load the model from PVC.
* Continue to bake the model into the image so that we don't need to train
in order to serve.
* Parameterize download_data.sh so we could potentially fetch different sources.
* Update the Makefile so that we can build and set the image for the serving
component.
* Fix lint.
* Update the serving docs.
* Support training using a PVC for the data.
* This will make it easier to run the example on Katacoda and non-GCP platforms.
* Modify train.py so we can use a GCS location or local file paths.
* Update the Dockerfile. The jupyter Docker images and had a bunch of
dependencies removed and the latest images don't have the dependencies
needed to run the examples.
* Creat a tfjob-pvc component that trains reading/writing using PVC
and not GCP.
* * Address reviewer comments
* Ignore changes to the ksonnet parameters when determining whether to include
dirty and sha of the diff in the image. This way we can update the
ksonnet app with the newly built image without it leading to subsequent
images being marked dirty.
* Fix lint issues.
* Fix lint import issue.
* This is the first step to doing training and serving using a PV as opposed
to GCS.
* This will make the sample easier to run anyhere and in particular on Katacoda.
* This currently would work as follows
User creates a PVC
ks apply ${ENV} -c data-pvc
User runs a K8s job to download the data to PVC
ks apply ${ENV} -c data-downloader
In subsequent PRs we will update the train and serve steps to load the
model from the PVC as opposed to GCS.
Related to #91
* Add setup scripts & github token param
* Clarify instructions
Add pointers to resolution for common friction points of new cluster
setup: GitHub rate limiting and RBAC permissions
Setup persistent disk before Jupyterhub so that it is only setup once
Clarify instructions about copying trained model files locally
Add version number to frontend image build
Add github_token ks parameter for frontend
* Change port to 8080
Fix indentation of bullet points
* Fix var name & link spacing
* Update description of serving script
* Use a single ksonnet environment
Move ksonnet app out of notebooks subdirectory
Rename ksonnet app to ks-kubeflow
Update instructions & scripts
Remove instructions to delete ksonnet app directory
* Remove github access token
* Distributed training using tensor2tensor
* Use a transformer model to train the github issue summarization
problem
* Dockerfile for building training image
* ksonnet component for deploying tfjob
Fixes https://github.com/kubeflow/examples/issues/43
* Fix lint issues
* Rename issue_summarization.py to IssueSummarization.py
* The module name is supposed to be the same as the class name
* Fix the predict method signature
* Fix lint
* Updates to the demo docs (notebook and readme) which were out-dated in multiple places
* Removed unused tools/ dir
* Update main readme to reference the example
* Inclusion of kubeflow vendor/ tf-job code
* Illustrates how logging and rendering to an attached volume can simplify the process of viewing logs with TensorHub and exploring
render outputs.
* Storage in user-space allows it to be sym-linked into directory tree
watched by TensorHub extension (which is running tensorboard --logdir=/home/jovyan)
* I anticipate this current approach to controlling volume mounts for NFS through
ksonnet to be replaced by doing so with python as I demonstrated in
the enhance example so I wouldn't lose sleep over the ksonnet
prototypes in this commit.
* Add awscli tools container.
* Add initial readme.
* Add argo skeleton.
* Run a an argo job.
* Artifact support and argo test
* Use built container (#3)
* Fix artifacts and secrets
* Add work in progress tfflow (#14)
* Add kvc deployment to workflow.
* Switch aws repo.
* wip.
* Add working tfflow job.
* Add sidecar that waits for MASTER completion
* Pass in job-name
* Add volumemanager info step
* Add input parameters to step
* Adds nodeaffinity and hostpath
* Add fixes for workflow (#17)
- Use correct images for worker and ps
- Use correct aws keys
- Change volumemanager to mnist
- Comment unused steps
- Fix volume mount to correct containers
* Fix hostpath for tfjob
* Download all mnist files
* added GCS stored artifacts comptability to Argo
* Add initial inference workflow. (#30)
* Initial serving step (#31)
* Adds fixes to initial serving step
* Ready for rough demo: Workflow in working state
* Move conflicting readme.
* Initial commit, everything boots without crashing.
* Working, with some python errors.
* Adding explicit flags
* Working with ins-outs
* Letting training job exit on success
* Adding documentation skeletion
* trying to properly save model
* Almost working
* Working
* Adding export script, refactored to allow model more reusability
* Starting documentation
* little further on docs
* More doc updates, fixing sleep logic
* adding urls for mnist data
* Removing download logic, it's to tied in with build-in tf examples.
* Added argo workflow instructions, minor cleanups.
* Adding mnist client.
* Fixing typos
* Adding instructions for installing components.
* Added ksonnet container
* Adding new entrypoint.
* Added helm install instructions for kvc
* doing things with variables
* Typos.
* Added better namespace support
* S3 refactor.
* Added missing region variables.
* Adding tensorboard support.
* Addding Container for Tensorboard.
* Added temporary flag, added install instructions for CLI.
* Removing invalid ksonnet environment.
* Updating readme
* Cleanup currently unused pieces
* Add missint cluster-role
* Minor cleanup.
* Adding more parameters.
* added changes to allow model to train on multiple workers and fixed some doc typos
* Adding flag to enable/disable model serving. Adding s3 urls as outputs for future querying, renaming info step.
* Adding seperate deployer workflow.
* Split serving working.
* Adding split workflow.
* More parameters.
* updates as to elson comments
* Revert "added changes to allow model to train on multiple workers and fixed s…"
* Initial working pure-s3 workflow.
* Removed wait sidecars.
* Remove unused flag.
* Added part two, minor doc fixes
* Inverted links...
* Adding diff.
* Fix url syntax
* Documentation updates.
* Added AWS Cli
* Parameterized export.
* Fixing image in s3 version.
* Fixed documentation issues.
* KVC snippet changes, need to find last working helm chart.
* Temporarily pinning kvc version.
* working master model and some doc typos fixes (#13)
* added changes to allow model to train on multiple workers and fixed some doc typos
* Adding flag to enable/disable model serving. Adding s3 urls as outputs for future querying, renaming info step.
* Adding seperate deployer workflow.
* Split serving working.
* Adding split workflow.
* More parameters.
* updates as to elson comments
* working master model and some doc typos
* fixes as to Elson
* Removign whitespace differences
* updating diff
* Changing parameters.
* Undoing whitespace.
* Changing termination policy on s3 version due to unknown issue.
* Updating mnist diff.
* Changing train steps.
* Syncing Demo changes.
* Update README.md
* Going S3-native for initial example. Getting rid of Master.
* Minor documentation tweaks, adding params, swapping aws cli for minio.
* Updating KVC version.
* Switching ksonnet repo, removing model name from client.
* Updating git url.
* Adding certificate hack to avoid RBAC errors.
* Pinning KVC to commit while working on PR.
* Updating version.
* Updates README with additional details (#14)
* Updates README with additional details
* Adding clarity to kubectl config commands
* Fixed comma placement
* Refactoring notes for github and kubernetes credentials.
* Forgot to add an overview of the argo template.
* Updating example based on feedback.
- Removed superflous images
- Clarified use of KVC
- Added unaltered model
- Variable cleanup
* Refactored grpc image into generic base image.
* minor cleanup of resubmitting section.
* Switching Argo deployment to ksonnet, conslidating install instructions.
* Removing old cruft, clarifying cluster requirements.
* [WIP] Switching out model (#15)
* Switching to new mnist example.
* Parameterized model, testing export.
* Got CNN model exporting.
* Attempting to do distributed training with Estimator, removed seperate export.
* Adding master back, otherwise Estimator complains about not having a chief.
* Switching to tf.estimator.train_and_evaluate.
* Minor path/var name refactor.
* Adding test data and new client.
* Fixed documentation to reflect new client.
* Getting rid of tf job shim.
* Removing KVC from example, renaming directory
* Modifying parent README
* Removed reference to export.
* Adding reference to export.
* Removing unused Dockerfile.
* Removing uneeded files, simplifying how to get status, refactor model serving workflow step.
* Renaming directory
* Minor doc improvements, removed extra clis.
* Making SSL configurable for clusters without secured s3 endpoints.
* Added a tf-user account for workflow. Fixed serving bug.
* Updating gke version.
* Re-ran through instructions, fixed errata.
* Fixing lint issues
* Pylint errors
* Pylint errors
* Adding parenthesis back.
* pylint Hacks
* Disabling argument filter, model bombs without empty arg.
* Removing unneeded lambdas
* Github Issue Summarization - Train using TFJob
* Create a Dockerfile to build the image for tf-job
* Create a manifest to deploy the tf-job
* Create instructions on how to do all of this
Fixes https://github.com/kubeflow/examples/issues/43
* Address comments
* Add gcloud commands
* Add ks app
* Update Dockerfile base image
* Python train.py fixes
* Remove tfjob.yaml as it is replaced by ksonnet app
* Remove plot_model_history as it is not required for tfjob training
* Don't change WORKDIR
* Address reviewer comments
* Fix links
* Fix lint issues using yapf
* Sort imports
* Add .pylintrc
* Resolve lint complaints in agents/trainer/task.py
* Resolve lint complaints with flask app.py
* Resolve linting issues
Remove duplicate seq2seq_utils.py from workflow/workspace/src
* Use python 3.5.2 with pylint to match prow
Put pybullet import back into agents/trainer/task.py with a pylint ignore statement
Use main(_) to ensure it works with tf.app.run
* Add barebones frontend
Add instructions for querying the trained model via a simple frontend
deployed locally.
* Add instructions for running the ui in-cluster
TODO: Resolve ksonnet namespace collisions for deployed-service
prototype
* Remove reference to running trained model locally
Update the issue summarization end to end tutorial
to deploy the seldon core model to the k8s cluster
Update the sample request and response
Related to https://github.com/kubeflow/examples/issues/11
* Add file copy instructions after training
Fix broken link in cluster setup
Fix broken env variable in Training notebook
Change notebook name from Tutorial to Training
* Fix app selector value
* Fix folder link
* Add detail to cluster setup instructions
Add a link to the image for this example.
In Tutorial.ipynb, move mounted directory into a variable to help avoid collisions on shared clusters.
* Create a end-to-end kubeflow example using seq2seq model (4/n)
* Move from a custom tornado server to a seldon-core model
Related to #11
* Update to use gcr.io registry for serving image
- Previously instructed users to build demo container via doc/Dockerfile.
- Since rendering isn't working in the notebook and the rest of the installed dependencies are now available in the base tensorflow-notebook container building a container isn't necessary.
- Users are now instructed to run the base tensorflow-notebook-cpu container and clone the example code with git.
- The git clone command refers to https://github.com/kubeflow/examples instead of the URL of this fork so the docs will be incorrect in that regard until this is merged into master. Optionally until then we can add an instruction to switch branch.