examples/pytorch_mnist
David Sabater Dinter 152c38b386 [mnist_pytorch] Optimise build and switch backend from MPI to GLOO (#480)
* Refactor Python module:
- Replace MPI by GLOO as backend to avoid having to recompily Pytorch
- Replace DistributedDataParallel() class with official version when using GPUs
- Remove unnecessary method to disable logs in workers
- Refactor run()

* Simplify Dockerfile by using Pytorch 0.4 official image with Cuda and remove mpirun call
2019-01-16 11:38:52 -08:00
..
ks_app [pytorch_mnist] Point images back to gcr.io/kubeflow-examples (#360) 2018-11-28 22:48:16 -08:00
serving/seldon-wrapper Fixed some outdated comments to trigger pushing web-ui and model serve images to gcr.io/kubeflow-examples (#444) 2018-12-26 15:05:42 -08:00
training/ddp/mnist [mnist_pytorch] Optimise build and switch backend from MPI to GLOO (#480) 2019-01-16 11:38:52 -08:00
web-ui Fixed some outdated comments to trigger pushing web-ui and model serve images to gcr.io/kubeflow-examples (#444) 2018-12-26 15:05:42 -08:00
01_setup_a_kubeflow_cluster.md [mnist_pytorch] Update documentation (#463) 2019-01-08 15:32:51 -08:00
02_distributed_training.md revert back removed v1alpha2 yaml manifests (#475) 2019-01-14 17:08:29 -08:00
03_serving_the_model.md [mnist_pytorch] Update documentation (#463) 2019-01-08 15:32:51 -08:00
04_querying_the_model.md E2E Pytorch mnist example (#274) 2018-11-18 14:24:43 -08:00
05_teardown.md E2E Pytorch mnist example (#274) 2018-11-18 14:24:43 -08:00
OWNERS E2E Pytorch mnist example (#274) 2018-11-18 14:24:43 -08:00
README.md [mnist_pytorch] Update documentation (#463) 2019-01-08 15:32:51 -08:00

README.md

End-to-End kubeflow tutorial using a Pytorch model in Google Cloud

This example demonstrates how you can use kubeflow end-to-end to train and serve a distributed Pytorch model on a kubernetes cluster in GCP. This tutorial is based upon the below projects:

Goals

There are two primary goals for this tutorial:

  • Demonstrate an End-to-End kubeflow example
  • Present an End-to-End Pytorch model

By the end of this tutorial, you should learn how to:

  • Setup a Kubeflow cluster on a new Kubernetes deployment
  • Spawn up a shared-persistent storage across the cluster to store models
  • Train a distributed model using Pytorch and GPUs on the cluster
  • Serve the model using Seldon Core
  • Query the model from a simple front-end application

The model and the data

This tutorial trains a TensorFlow model on the MNIST dataset, which is the hello world for machine learning.

The MNIST dataset contains a large number of images of hand-written digits in the range 0 to 9, as well as the labels identifying the digit in each image.

After training, the model classifies incoming images into 10 categories (0 to 9) based on what its learned about handwritten images. In other words, you send an image to the model, and the model does its best to identify the digit shown in the image.

In the above screenshot, the image shows a hand-written 8. The table below the image shows a bar graph for each classification label from 0 to 9. Each bar represents the probability that the image matches the respective label. Looks like its pretty confident this one is an 8!

Steps:

  1. Setup a Kubeflow cluster
  2. Distributed Training using DDP and PyTorchJob
  3. Serving the model
  4. Querying the model
  5. Teardown