Commit Graph

6 Commits

Author SHA1 Message Date
David Sabater Dinter 7a6dc7b911 [pytorch_mnist] Automate image build (#490)
* Add build and test presubmit jobs for Pytorch nmist example
Keep postsubmit jobs as original release job to push images to examples registry

* Refactor all jobs like mnist and GIS, will drop using release jobs

* Implement test scripts and Ksonnet artifacts from mnist example to enable E2E tests

* Remove release components as they are no longer used

* Refactor YAML manifests as Ksonnet components

* Update documentation to submit training jobs from Ksonnet

* Updated to point to correct component and refactor to PytorchJob

* Add seldon image build
Add train CPU and GPU in jsonnet to build workflow
Add Dockerfile.ksonnet and entrypoint

* Commented out calls to tf-util until
https://github.com/kubeflow/pytorch-operator/issues/108 is implemented

* Refactor to PytorchJob

* Add seldon image build
Add train CPU and GPU in jsonnet to build workflow
Add Dockerfile.ksonnet and entrypoint

* Refactor to PytorchJob

* Rename workflow to avoid dns issue with "_"

* Add TODO note to convert to GRPC

* Rename workflow to avoid dns issue with "_"

* Rename workflow to avoid dns issue with "_"

* Fix path to build Seldon image in Makefile

* Fix tabs in Makefile

* Fix tabs in Makefile

* Fix rule in Makefile

* Add sleep in Makefile to wait for docker ps

* Change node worker image to have docker

* Remove seldon image step from Makefile
Add steps to wrap model with Seldon
Add boolean flag to build Seldon steps

* Add step id build- in jsonnet

* Skip pull step for Seldon

* Fix wait for in Seldon build

* Fix lint errors

* Set useimagecache to false first time the pipeline is executed to avoid error

* Set contextDir as absolute path for Seldon step

* Remove unnecessary argument and Dockerfile in Seldon step

* Add absolute path for build in Seldon steps

* Include absolute path inside jsonnet hardcoded to GCB /workspace/
Remove setting rootDir from Makefile

* Update images with new naming from E2E tests

* Change test-worker image version

* Update images with new naming from E2E tests

* Set useimagecache to true now that we have first images built

* Fix cachelist in Seldon build

* Fix cachelist in Seldon build

* Leverage tf-operator test framework for test_runner
As per https://github.com/kubeflow/pytorch-operator/issues/108

* Consolidate testing imports
Rename testing package as https://github.com/kubeflow/tf-operator/pull/945
Added correct path to import test framework from tf-operator

* Add test framework in PYTHONPATH in build_template

* Remove old release jobs to build images

* Update stepimage to same as GIS example

* Bump up supported Pytorch operator versions from v1alpha2/v1beta1 to v1beta1/v1beta2 to support Kubeflow 0.5
- Refactor training manifests from v1alpha2 to v1beta2
- Update documents

* Update KF cluster version to latest to run tests

* Update KF cluster zone

* Add pylint exception while importing test_runner class from tf-operator

* Pass dummy tests to train, deploy and predict
Remove no longer used test_data and conftest

* Pass dummy tests to train, deploy and predict
Remove no longer used test_data and conftest
2019-06-14 16:20:09 -07:00
Craig Sterrett 838ad79898 Fixed typo in README and one bad link
Two small fixes I ran into when trying the example. One is a type, it says it's displaying an 8 when it is a 7.  The other was a bad link.
2019-02-15 11:14:23 -08:00
Hung-Ting Wen c83ed09a77 revert back removed v1alpha2 yaml manifests (#475)
* revert back removed v1alpha2 yaml manifests

* Add documentation

* Fix format
2019-01-14 17:08:29 -08:00
Hung-Ting Wen 4dda73afbf Update pytorch_mnist example to use v1beta1 (#445)
* Add job_mnist_DDP_CPU for v1beta1

* Add job_mnist_DDP_GPU for v1beta1

* Update 02_distributed_training.md to use v1beta1

* Remove pytorch v1alpha2 config

* Add missing CPU training config
2019-01-09 05:27:35 -08:00
David Sabater Dinter 38daafa0c3 [mnist_pytorch] Update documentation (#463)
* Fix link to next section, training the model

* Added links to next and previous sections in training the model README

* Fix link to previous section, training the model

* Remove TODO list
2019-01-08 15:32:51 -08:00
David Sabater Dinter a402db1ccc E2E Pytorch mnist example (#274)
* Add Pytorch MNIST example

* Fix link to Pytorch NMIST example

* Fix indentation in README

* Fix lint errors

* Fix lint errors
Add prediction proto files

* Add build_image.sh script to build image and push to gcr.io

* Add pytorch-mnist-webui-release release through automatic ksonnet package

* Fix lint errors

* Add pytorch-mnist-webui-release release through automatic ksonnet package

* Add PB2 autogenerated files to ignore with Pylint

* Fix lint errors

* Add official Pytorch DDP examples to ignore with Pylint

* Fix lint errors

* Update component to web-ui release

* Update mount point to kubeflow-gcfs as the example is GCP specific

* 01_setup_a_kubeflow_cluster document complete

* Test release job while PR is WIP

* Reduce workflow name to avoid Argo error:
"must be no more than 63 characters"

* Fix extra_repos to pull worker image

* Fix testing_image using kubeflow-ci rather than kubeflow-releasing

* Fix extra_repo, only needs kubeflow/testing

* Set build_image.sh executable

* Update build_image.sh from CentralDashboard component

* Remove old reference to centraldashboard in echo message

* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md

* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md

* Add releases for the training and serving images

* Add releases for the training and serving images

* Fix testing_image using kubeflow-ci rather than kubeflow-releasing

* Fix path to Seldon-wrapper build_image.sh

* Fix image name in ksonnet parameter

* Add 02 distributed training documentation

* Add 03 serving the model documentation
Update shared persistent reference in 02 distributed training documentation

* Add 05 teardown documentation

* Add section to test the model is deployed correctly in 03 serving the model

* Add 04 querying the model documentation

* Fix ks-app to ks_app

* Set prow jobs back to postsubmit

* Set prow jobs to trigger presubmit to kubeflow-ci and postsubmit to
kubeflow-images-public

* Change to kubeflow-ci project

* Increase timeout limit during image build to compile Pytorch

* Increase timeout limit during image build to compile Pytorch

* Change build machine type to compile Pytorch for training image

* Change build machine type to compile Pytorch for training image

* Add OWNERS file to Pytorch example

* Fix typo in documentation

* Remove checking docker daemon as we are using gcloud build instead

* Use logging module rather print()

* Remove empty file, replace with .gitignore to keep tmp folder

* Add ksonnet application to deploy model server and web-ui
Delete model server JSON manifest

* Refactor ks-app to ks_app

* Parametrise serving_model ksonnet component
Default web-ui to use ambassador route to seldon
Remove form section in web-ui

* Remove default environment from ksonnet application

* Update documentation to use ksonnet application

* Fix component name in documentation

* Consolidate Pytorch train module and build_image.sh script

* Consolidate Pytorch train module

* Consolidate Pytorch train module

* Consolidate Pytorch train module and build_image.sh script

* Revert back build_image.sh scripts

* Remove duplicates

* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud

* Fix docker build command

* Fix docker build command

* Fix image name for cpu and gpu train

* Consolidate Pytorch train module

* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
2018-11-18 14:24:43 -08:00