* Some of the code is copied over from https://github.com/kubeflow/katib/tree/master/examples/GKEDemo
* I think it makes sense to centralize all the code in a single place.
* Update the controller program (git-issue-summarize-demo.go) so that can
specify the Docker image containing the training code.
* Create a ksonnet deployment for running the controller on the cluster.
* The HP tuning job isn't functional here's an incomplete list of issues
* The training jobs launched fail because they don't have GCP credentials
so they can't download the data.
* We don't actually extract and report metrics back to Katib.
Related to: kubeflow/katib#116
* Add component parameters
Add model_url & port arguments to flask app
Add service_type, image, and model_url parameters to ui component
Fix problem argument in tensor2tensor component
* Fix broken UI component
Fix broken UI component structure by adding all, service, & deployment parts
Add parameter defaults for tfjob to resolve failures deploying other components
* Add missing imports in flask app
Fix syntax error in argument parsing
Remove underscores from parameter names to workaround ksonnet bug #554: https://github.com/ksonnet/ksonnet/issues/554
* Fix syntax errors in t2t instructions
Add CPU image build arg to docker build command for t2t-training
Fix link to ksonnet app dir
Correct param names for tensor2tensor component
Add missing params for tensor2tensor component
Fix apply command syntax
Swap out log view pod for t2t-master instead of tf-operator
Fix link to training with tfjob
* Fix model file upload
Update default params for tfjob-v1alpha2
Fix build directory path in Makefile
* Resolve lint issues
Lines too long
* Add specific image tag to tfjob-v1alpha2 default
* Fix defaults for training output files
Update image tag
Add UI image tag
* Revert service account secret details
Update associated readme
* Update the Docker image for T2T to use a newer version of T2T library
* Add parameters to set the GCP secret; we need GCP credentials to
read from GCS even if reading a public bucket. We default
to the parameters that are created automatically in the case of a GKE
deployment.
* Create a v1alpha2 template for the job that uses PVC.
* Update the GH summarization example to Kubeflow 0.2 and TFJob v1alpha2.
* Upgrade the ksonnet app to Kubeflow 0.2 rc.1
* Add the examples package.
* Add a .gitignore file and ignore all environments so that we won't pick
up people's testing environments.
* Add tfjob-v1alpha2 component; this trains the model using Keras using
TFJob v1alpha2.
* Update the parameters so that we use the GCP secrets created as part
of the Kubeflow deployment.
* Remove jlewi environment.
* Verified that training ran successfully and outputted a model to GCS
* There was an error about some missing arguments to a logging statement
but this can be ignored although it would be good to fix.
* Started working on T2T v1alpha2. Seems to be messing up the app.
* Update the v1alpha2 template for the tensor2tensor job but it looks like
there is an error
2018-06-29 17:45:23,369] Found unknown flag: --problem=github_issue_summarization_problem
Traceback (most recent call last):
File "/home/jovyan/.conda/bin/t2t-trainer", line 32, in <module>
tf.app.run()
File "/home/jovyan/.conda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/home/jovyan/.conda/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/home/jovyan/.conda/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py", line 334, in main
exp_fn = create_experiment_fn()
File "/home/jovyan/.conda/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py", line 158, in create_experiment_fn
problem_name=get_problem_name(),
File "/home/jovyan/.conda/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py", line 115, in get_problem_name
problems = FLAGS.problems.split("-")
AttributeError: 'NoneType' object has no attribute 'split'
* Add component parameters
Add model_url & port arguments to flask app
Add service_type, image, and model_url parameters to ui component
Fix problem argument in tensor2tensor component
* Fix broken UI component
Fix broken UI component structure by adding all, service, & deployment parts
Add parameter defaults for tfjob to resolve failures deploying other components
* Add missing imports in flask app
Fix syntax error in argument parsing
Remove underscores from parameter names to workaround ksonnet bug #554: https://github.com/ksonnet/ksonnet/issues/554
* Fix syntax errors in t2t instructions
Add CPU image build arg to docker build command for t2t-training
Fix link to ksonnet app dir
Correct param names for tensor2tensor component
Add missing params for tensor2tensor component
Fix apply command syntax
Swap out log view pod for t2t-master instead of tf-operator
Fix link to training with tfjob
* edit TF example readme
* prefix tutorial steps with a number for nicer display in repo
* fix typo
* edit steps 4 and 5
* edit docs
* add navigation and formatting edits to example
* Improvements to the tensor2tensor traininer for the GitHub summarization example.
* Simplify the launcher; we can just pass through most command line arguments and not
use environment variables and command line arguments.
* This makes it easier to control the job just by setting the parameters in the template
rather than having to rebuild the images.
* Add a Makefile to build the image.
* Replace the tensor2tensor jsonnet with a newer version of the jsonnet used with T2T.
* Address reviewer comments.
* Install pip packages as user Jovyan
* Rely on implicit string conversion with concatenation in template file.
* Add a component to run TensorBoard.
* Autoformate jsonnet file.
* * Set a default of "" for logDir; there's not a really good default location
because it will depend on where the data is stored.
* Make it easier to demo serving and run in Katacoda
* Allow the model path to be specified via environment variables so that
we could potentially load the model from PVC.
* Continue to bake the model into the image so that we don't need to train
in order to serve.
* Parameterize download_data.sh so we could potentially fetch different sources.
* Update the Makefile so that we can build and set the image for the serving
component.
* Fix lint.
* Update the serving docs.
* Support training using a PVC for the data.
* This will make it easier to run the example on Katacoda and non-GCP platforms.
* Modify train.py so we can use a GCS location or local file paths.
* Update the Dockerfile. The jupyter Docker images and had a bunch of
dependencies removed and the latest images don't have the dependencies
needed to run the examples.
* Creat a tfjob-pvc component that trains reading/writing using PVC
and not GCP.
* * Address reviewer comments
* Ignore changes to the ksonnet parameters when determining whether to include
dirty and sha of the diff in the image. This way we can update the
ksonnet app with the newly built image without it leading to subsequent
images being marked dirty.
* Fix lint issues.
* Fix lint import issue.
* This is the first step to doing training and serving using a PV as opposed
to GCS.
* This will make the sample easier to run anyhere and in particular on Katacoda.
* This currently would work as follows
User creates a PVC
ks apply ${ENV} -c data-pvc
User runs a K8s job to download the data to PVC
ks apply ${ENV} -c data-downloader
In subsequent PRs we will update the train and serve steps to load the
model from the PVC as opposed to GCS.
Related to #91
* Add setup scripts & github token param
* Clarify instructions
Add pointers to resolution for common friction points of new cluster
setup: GitHub rate limiting and RBAC permissions
Setup persistent disk before Jupyterhub so that it is only setup once
Clarify instructions about copying trained model files locally
Add version number to frontend image build
Add github_token ks parameter for frontend
* Change port to 8080
Fix indentation of bullet points
* Fix var name & link spacing
* Update description of serving script
* Use a single ksonnet environment
Move ksonnet app out of notebooks subdirectory
Rename ksonnet app to ks-kubeflow
Update instructions & scripts
Remove instructions to delete ksonnet app directory
* Remove github access token
* Distributed training using tensor2tensor
* Use a transformer model to train the github issue summarization
problem
* Dockerfile for building training image
* ksonnet component for deploying tfjob
Fixes https://github.com/kubeflow/examples/issues/43
* Fix lint issues
* Rename issue_summarization.py to IssueSummarization.py
* The module name is supposed to be the same as the class name
* Fix the predict method signature
* Fix lint
* Github Issue Summarization - Train using TFJob
* Create a Dockerfile to build the image for tf-job
* Create a manifest to deploy the tf-job
* Create instructions on how to do all of this
Fixes https://github.com/kubeflow/examples/issues/43
* Address comments
* Add gcloud commands
* Add ks app
* Update Dockerfile base image
* Python train.py fixes
* Remove tfjob.yaml as it is replaced by ksonnet app
* Remove plot_model_history as it is not required for tfjob training
* Don't change WORKDIR
* Address reviewer comments
* Fix links
* Fix lint issues using yapf
* Sort imports
* Add .pylintrc
* Resolve lint complaints in agents/trainer/task.py
* Resolve lint complaints with flask app.py
* Resolve linting issues
Remove duplicate seq2seq_utils.py from workflow/workspace/src
* Use python 3.5.2 with pylint to match prow
Put pybullet import back into agents/trainer/task.py with a pylint ignore statement
Use main(_) to ensure it works with tf.app.run
* Add barebones frontend
Add instructions for querying the trained model via a simple frontend
deployed locally.
* Add instructions for running the ui in-cluster
TODO: Resolve ksonnet namespace collisions for deployed-service
prototype
* Remove reference to running trained model locally
Update the issue summarization end to end tutorial
to deploy the seldon core model to the k8s cluster
Update the sample request and response
Related to https://github.com/kubeflow/examples/issues/11
* Add file copy instructions after training
Fix broken link in cluster setup
Fix broken env variable in Training notebook
Change notebook name from Tutorial to Training
* Fix app selector value
* Fix folder link
* Add detail to cluster setup instructions
Add a link to the image for this example.
In Tutorial.ipynb, move mounted directory into a variable to help avoid collisions on shared clusters.
* Create a end-to-end kubeflow example using seq2seq model (4/n)
* Move from a custom tornado server to a seldon-core model
Related to #11
* Update to use gcr.io registry for serving image