|
|
||
|---|---|---|
| .. | ||
| clean_utils | ||
| components | ||
| pipeline | ||
| LICENSE | ||
| README.md | ||
| build_pipeline.sh | ||
| install_kubeflow_and_dependencies.sh | ||
| prepare_test_inference_data.sh | ||
README.md
A simple GPU-accelerated ResNet Kubeflow pipeline
Overview
This example demonstrates a simple end-to-end training & deployment of a Keras Resnet model on the CIFAR10 dataset utilizing the following technologies:
- NVIDIA-Docker2 to make the Docker containers GPU aware.
- NVIDIA device plugin to allow Kubernetes to access GPU nodes.
- TensorFlow-19.03 containers from NVIDIA GPU Cloud container registry.
- TensorRT for optimizing the Inference Graph in FP16 for leveraging the dedicated use of Tensor Cores for Inference.
- TensorRT Inference Server for serving the trained model.
System Requirements
- Ubuntu 16.04 and above
- NVIDIA GPU
Quickstart
- Install NVIDIA Docker, Kubernetes and Kubeflow on your local machine (on your first run):
sudo ./install_kubeflow_and_dependencies.sh
- Build the Docker image of each pipeline component and compile the Kubeflow pipeline:
- First, make sure
IMAGEvariable inbuild.shin each component dir undercomponentsdir points to a public container registry - Then, make sure the
imageused in eachContainerOpinpipeline/src/pipeline.pymatchesIMAGEin the step above - Then, make sure the
imageof the webapp Deployment incomponents/webapp_launcher/src/webapp-service-template.yamlmatchesIMAGEincomponents/webapp/build.sh - Then,
sudo ./build_pipeline.sh - Note the
pipeline.py.tar.gzfile that appears in your working directory
- First, make sure
- Determine the ambassador port:
sudo kubectl get svc -n kubeflow ambassador
- Open the Kubeflow UI on:
- Click on Pipeline Dashboard tab, upload the
pipeline.py.tar.gzfile you just compile and create a run - Training takes about 20 minutes for 50 epochs and a web UI is deployed as part of the pipeline so user can interact with the served model
- Access the client web UI:
- Now you can test the trained model with random images and obtain class prediction and probability distribution
Cleanup
Following are optional scripts to cleanup your cluster (useful for debugging)
- Delete deployments & services from previous runs:
sudo ./clean_utils/delete_all_previous_resources.sh
- Uninstall Minikube and Kubeflow:
sudo ./clean_utils/remove_minikube_and_kubeflow.sh