* Remove modules from .pylintrc
* Add lint inline exceptions
* Add lint inline exceptions as all as the specific exception is not available for Pylint 1.8
* Fix string formatting logging message and remove unnecessary Pylint exception
* Update app.yaml with correct environment details
* Add build and test presubmit jobs for Pytorch nmist example
Keep postsubmit jobs as original release job to push images to examples registry
* Refactor all jobs like mnist and GIS, will drop using release jobs
* Implement test scripts and Ksonnet artifacts from mnist example to enable E2E tests
* Remove release components as they are no longer used
* Refactor YAML manifests as Ksonnet components
* Update documentation to submit training jobs from Ksonnet
* Updated to point to correct component and refactor to PytorchJob
* Add seldon image build
Add train CPU and GPU in jsonnet to build workflow
Add Dockerfile.ksonnet and entrypoint
* Commented out calls to tf-util until
https://github.com/kubeflow/pytorch-operator/issues/108 is implemented
* Refactor to PytorchJob
* Add seldon image build
Add train CPU and GPU in jsonnet to build workflow
Add Dockerfile.ksonnet and entrypoint
* Refactor to PytorchJob
* Rename workflow to avoid dns issue with "_"
* Add TODO note to convert to GRPC
* Rename workflow to avoid dns issue with "_"
* Rename workflow to avoid dns issue with "_"
* Fix path to build Seldon image in Makefile
* Fix tabs in Makefile
* Fix tabs in Makefile
* Fix rule in Makefile
* Add sleep in Makefile to wait for docker ps
* Change node worker image to have docker
* Remove seldon image step from Makefile
Add steps to wrap model with Seldon
Add boolean flag to build Seldon steps
* Add step id build- in jsonnet
* Skip pull step for Seldon
* Fix wait for in Seldon build
* Fix lint errors
* Set useimagecache to false first time the pipeline is executed to avoid error
* Set contextDir as absolute path for Seldon step
* Remove unnecessary argument and Dockerfile in Seldon step
* Add absolute path for build in Seldon steps
* Include absolute path inside jsonnet hardcoded to GCB /workspace/
Remove setting rootDir from Makefile
* Update images with new naming from E2E tests
* Change test-worker image version
* Update images with new naming from E2E tests
* Set useimagecache to true now that we have first images built
* Fix cachelist in Seldon build
* Fix cachelist in Seldon build
* Leverage tf-operator test framework for test_runner
As per https://github.com/kubeflow/pytorch-operator/issues/108
* Consolidate testing imports
Rename testing package as https://github.com/kubeflow/tf-operator/pull/945
Added correct path to import test framework from tf-operator
* Add test framework in PYTHONPATH in build_template
* Remove old release jobs to build images
* Update stepimage to same as GIS example
* Bump up supported Pytorch operator versions from v1alpha2/v1beta1 to v1beta1/v1beta2 to support Kubeflow 0.5
- Refactor training manifests from v1alpha2 to v1beta2
- Update documents
* Update KF cluster version to latest to run tests
* Update KF cluster zone
* Add pylint exception while importing test_runner class from tf-operator
* Pass dummy tests to train, deploy and predict
Remove no longer used test_data and conftest
* Pass dummy tests to train, deploy and predict
Remove no longer used test_data and conftest
* Refactor Python module:
- Replace MPI by GLOO as backend to avoid having to recompily Pytorch
- Replace DistributedDataParallel() class with official version when using GPUs
- Remove unnecessary method to disable logs in workers
- Refactor run()
* Simplify Dockerfile by using Pytorch 0.4 official image with Cuda and remove mpirun call
* Add job_mnist_DDP_CPU for v1beta1
* Add job_mnist_DDP_GPU for v1beta1
* Update 02_distributed_training.md to use v1beta1
* Remove pytorch v1alpha2 config
* Add missing CPU training config
* Fix link to next section, training the model
* Added links to next and previous sections in training the model README
* Fix link to previous section, training the model
* Remove TODO list
* Default to model trained with CPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models
* Checkout 1.0rc1 release as latest Pytorch master seems to have MPI backend detection broken
* Track changes in pytorch_mnist/training/ddp/mnist folder to trigger test jobs
* Repoint to pull images from gcr.io/kubeflow-ci built during pre-submit
* Fix image webui name
* Fix logging
* Add GCFS to CPU train
* Fix logging
* Add GCFS to CPU train
* Default to model trained with GPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models
* Fix Predict() method as Seldon expects 3 arguments
* Fix x reference
* Add Pytorch MNIST example
* Fix link to Pytorch NMIST example
* Fix indentation in README
* Fix lint errors
* Fix lint errors
Add prediction proto files
* Add build_image.sh script to build image and push to gcr.io
* Add pytorch-mnist-webui-release release through automatic ksonnet package
* Fix lint errors
* Add pytorch-mnist-webui-release release through automatic ksonnet package
* Add PB2 autogenerated files to ignore with Pylint
* Fix lint errors
* Add official Pytorch DDP examples to ignore with Pylint
* Fix lint errors
* Update component to web-ui release
* Update mount point to kubeflow-gcfs as the example is GCP specific
* 01_setup_a_kubeflow_cluster document complete
* Test release job while PR is WIP
* Reduce workflow name to avoid Argo error:
"must be no more than 63 characters"
* Fix extra_repos to pull worker image
* Fix testing_image using kubeflow-ci rather than kubeflow-releasing
* Fix extra_repo, only needs kubeflow/testing
* Set build_image.sh executable
* Update build_image.sh from CentralDashboard component
* Remove old reference to centraldashboard in echo message
* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md
* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md
* Add releases for the training and serving images
* Add releases for the training and serving images
* Fix testing_image using kubeflow-ci rather than kubeflow-releasing
* Fix path to Seldon-wrapper build_image.sh
* Fix image name in ksonnet parameter
* Add 02 distributed training documentation
* Add 03 serving the model documentation
Update shared persistent reference in 02 distributed training documentation
* Add 05 teardown documentation
* Add section to test the model is deployed correctly in 03 serving the model
* Add 04 querying the model documentation
* Fix ks-app to ks_app
* Set prow jobs back to postsubmit
* Set prow jobs to trigger presubmit to kubeflow-ci and postsubmit to
kubeflow-images-public
* Change to kubeflow-ci project
* Increase timeout limit during image build to compile Pytorch
* Increase timeout limit during image build to compile Pytorch
* Change build machine type to compile Pytorch for training image
* Change build machine type to compile Pytorch for training image
* Add OWNERS file to Pytorch example
* Fix typo in documentation
* Remove checking docker daemon as we are using gcloud build instead
* Use logging module rather print()
* Remove empty file, replace with .gitignore to keep tmp folder
* Add ksonnet application to deploy model server and web-ui
Delete model server JSON manifest
* Refactor ks-app to ks_app
* Parametrise serving_model ksonnet component
Default web-ui to use ambassador route to seldon
Remove form section in web-ui
* Remove default environment from ksonnet application
* Update documentation to use ksonnet application
* Fix component name in documentation
* Consolidate Pytorch train module and build_image.sh script
* Consolidate Pytorch train module
* Consolidate Pytorch train module
* Consolidate Pytorch train module and build_image.sh script
* Revert back build_image.sh scripts
* Remove duplicates
* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
* Fix docker build command
* Fix docker build command
* Fix image name for cpu and gpu train
* Consolidate Pytorch train module
* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud