examples/pytorch_mnist
David Sabater Dinter 7a6dc7b911 [pytorch_mnist] Automate image build (#490)
* Add build and test presubmit jobs for Pytorch nmist example
Keep postsubmit jobs as original release job to push images to examples registry

* Refactor all jobs like mnist and GIS, will drop using release jobs

* Implement test scripts and Ksonnet artifacts from mnist example to enable E2E tests

* Remove release components as they are no longer used

* Refactor YAML manifests as Ksonnet components

* Update documentation to submit training jobs from Ksonnet

* Updated to point to correct component and refactor to PytorchJob

* Add seldon image build
Add train CPU and GPU in jsonnet to build workflow
Add Dockerfile.ksonnet and entrypoint

* Commented out calls to tf-util until
https://github.com/kubeflow/pytorch-operator/issues/108 is implemented

* Refactor to PytorchJob

* Add seldon image build
Add train CPU and GPU in jsonnet to build workflow
Add Dockerfile.ksonnet and entrypoint

* Refactor to PytorchJob

* Rename workflow to avoid dns issue with "_"

* Add TODO note to convert to GRPC

* Rename workflow to avoid dns issue with "_"

* Rename workflow to avoid dns issue with "_"

* Fix path to build Seldon image in Makefile

* Fix tabs in Makefile

* Fix tabs in Makefile

* Fix rule in Makefile

* Add sleep in Makefile to wait for docker ps

* Change node worker image to have docker

* Remove seldon image step from Makefile
Add steps to wrap model with Seldon
Add boolean flag to build Seldon steps

* Add step id build- in jsonnet

* Skip pull step for Seldon

* Fix wait for in Seldon build

* Fix lint errors

* Set useimagecache to false first time the pipeline is executed to avoid error

* Set contextDir as absolute path for Seldon step

* Remove unnecessary argument and Dockerfile in Seldon step

* Add absolute path for build in Seldon steps

* Include absolute path inside jsonnet hardcoded to GCB /workspace/
Remove setting rootDir from Makefile

* Update images with new naming from E2E tests

* Change test-worker image version

* Update images with new naming from E2E tests

* Set useimagecache to true now that we have first images built

* Fix cachelist in Seldon build

* Fix cachelist in Seldon build

* Leverage tf-operator test framework for test_runner
As per https://github.com/kubeflow/pytorch-operator/issues/108

* Consolidate testing imports
Rename testing package as https://github.com/kubeflow/tf-operator/pull/945
Added correct path to import test framework from tf-operator

* Add test framework in PYTHONPATH in build_template

* Remove old release jobs to build images

* Update stepimage to same as GIS example

* Bump up supported Pytorch operator versions from v1alpha2/v1beta1 to v1beta1/v1beta2 to support Kubeflow 0.5
- Refactor training manifests from v1alpha2 to v1beta2
- Update documents

* Update KF cluster version to latest to run tests

* Update KF cluster zone

* Add pylint exception while importing test_runner class from tf-operator

* Pass dummy tests to train, deploy and predict
Remove no longer used test_data and conftest

* Pass dummy tests to train, deploy and predict
Remove no longer used test_data and conftest
2019-06-14 16:20:09 -07:00
..
ks_app [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00
serving/seldon-wrapper Fixed some outdated comments to trigger pushing web-ui and model serve images to gcr.io/kubeflow-examples (#444) 2018-12-26 15:05:42 -08:00
testing [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00
training/ddp/mnist [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00
web-ui Fixed some outdated comments to trigger pushing web-ui and model serve images to gcr.io/kubeflow-examples (#444) 2018-12-26 15:05:42 -08:00
01_setup_a_kubeflow_cluster.md [mnist_pytorch] Update documentation (#463) 2019-01-08 15:32:51 -08:00
02_distributed_training.md [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00
03_serving_the_model.md [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00
04_querying_the_model.md E2E Pytorch mnist example (#274) 2018-11-18 14:24:43 -08:00
05_teardown.md E2E Pytorch mnist example (#274) 2018-11-18 14:24:43 -08:00
Dockerfile.ksonnet [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00
Makefile [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00
OWNERS E2E Pytorch mnist example (#274) 2018-11-18 14:24:43 -08:00
README.md Fixed typo in README and one bad link 2019-02-15 11:14:23 -08:00
image_build.jsonnet [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00
ksonnet-entrypoint.sh [pytorch_mnist] Automate image build (#490) 2019-06-14 16:20:09 -07:00

README.md

End-to-End kubeflow tutorial using a Pytorch model in Google Cloud

This example demonstrates how you can use kubeflow end-to-end to train and serve a distributed Pytorch model on a kubernetes cluster in GCP. This tutorial is based upon the below projects:

Goals

There are two primary goals for this tutorial:

  • Demonstrate an End-to-End kubeflow example
  • Present an End-to-End Pytorch model

By the end of this tutorial, you should learn how to:

  • Setup a Kubeflow cluster on a new Kubernetes deployment
  • Spawn up a shared-persistent storage across the cluster to store models
  • Train a distributed model using Pytorch and GPUs on the cluster
  • Serve the model using Seldon Core
  • Query the model from a simple front-end application

The model and the data

This tutorial trains a TensorFlow model on the MNIST dataset, which is the hello world for machine learning.

The MNIST dataset contains a large number of images of hand-written digits in the range 0 to 9, as well as the labels identifying the digit in each image.

After training, the model classifies incoming images into 10 categories (0 to 9) based on what its learned about handwritten images. In other words, you send an image to the model, and the model does its best to identify the digit shown in the image.

In the above screenshot, the image shows a hand-written 7. The table below the image shows a bar graph for each classification label from 0 to 9. Each bar represents the probability that the image matches the respective label. Looks like its pretty confident this one is an 7!

Steps:

  1. Setup a Kubeflow cluster
  2. Distributed Training using DDP and PyTorchJob
  3. Serving the model
  4. Querying the model
  5. Teardown