examples/demos/simple_pipeline/README.md

134 lines
4.2 KiB
Markdown

# Kubeflow demo - Simple pipeline
## Hyperparameter tuning and autoprovisioning GPU nodes
This folder contains a demonstration of Kubeflow capabilities, suitable for
presentation to public audiences.
This demo highlights the use of pipelines and hyperparameter tuning on a GKE
cluster with node autoprovisioning (NAP). A simple pipeline requests GPU resources, which triggers
node pool creation. This demo includes the following steps:
1. [Setup your environment](#1-setup-your-environment)
1. [Run a simple pipeline](#2-run-a-simple-pipeline)
1. [Perform hyperparameter tuning](#3-perform-hyperparameter-tuning)
1. [Run a better pipeline](#4-run-a-better-pipeline)
## 1. Setup your environment
Follow the instructions in
[demo_setup/README.md](https://github.com/kubeflow/examples/blob/master/demos/simple_pipeline/demo_setup/README.md)
to setup your environment and install Kubeflow with pipelines on an
autoprovisioning GKE cluster.
View the installed components in the GCP Console.
* In the
[Kubernetes Engine](https://console.cloud.google.com/kubernetes)
section, you will see a new cluster ${CLUSTER} with 3 `n1-standard-1` nodes
* Under
[Workloads](https://console.cloud.google.com/kubernetes/workload),
you will see all the default Kubeflow and pipeline components.
Source the environment file and activate the conda environment for pipelines:
```
source kubeflow-demo-simple-pipeline.env
source activate kfp
```
## 2. Run a simple pipeline
Show the file `gpu-example-pipeline.py` as an example of a simple pipeline.
Compile it to create a .tar.gz file:
```
./gpu-example-pipeline.py
```
View the pipelines UI locally by forwarding a port to the ml-pipeline-ui pod:
```
PIPELINES_POD=$(kubectl get po -l app=ml-pipeline-ui | \
grep ml-pipeline-ui | \
head -n 1 | \
cut -d " " -f 1 )
kubectl port-forward ${PIPELINES_POD} 8080:3000
```
In the browser, navigate to `localhost:8080` and create a new pipeline by
uploading `gpu-example-pipeline.py.tar.gz`. Select the pipeline and click
_Create experiment_. Use all suggested defaults.
View the effects of autoprovisioning by observing the number of nodes increase.
Select _Experiments_ from the left-hand side, then _Runs_. Click on the
experiment run to view the graph and watch it execute.
View the container logs for the training step and take note of the low accuracy (~0.113).
## 3. Perform hyperparameter tuning
In order to determine parameters that result in higher accuracy, use Katib
to execute a Study, which defines a search space for performing training with a
range of different parameters.
Create a Study by applying an
[example file](https://github.com/kubeflow/examples/blob/master/demos/simple_pipeline/gpu-example-katib.yaml)
to the cluster:
```
kubectl apply -f gpu-example-katib.yaml
```
This creates a Studyjob object. To view it:
```
kubectl get studyjob
kubectl describe studyjobs gpu-example
```
To view the Katib UI, connect to the modeldb-frontend pod:
```
KATIB_POD=$(kubectl get po -l app=modeldb,component=frontend | \
grep modeldb-frontend | \
head -n 1 | \
cut -d " " -f 1 )
kubectl port-forward ${KATIB_POD} 8081:3000
```
In the browser, navigate to `localhost:8081/katib` and click on the
gpu-example project. In the _Explore Visualizations_ section, select
_Optimizer_ in the _Group By_ dropdown, then click _Compare_.
While you're waiting, watch for autoprovisioning to occur. View the pods in Pending status.
View the creation of a new GPU node pool:
```
gcloud container node-pools list --cluster ${CLUSTER}
```
View the creation of new nodes:
```
kubectl get nodes
```
In the Katib UI, interact with the various graphs to determine which
combination of parameters results in the highest accuracy. Grouping by optimizer
type is one way to find consistently higher accuracies. Gather a set of
parameters to use in a new run of the pipeline.
## 4. Run a better pipeline
In the pipelines UI, clone the previous experiment run and update the arguments
to match the parameters for one of the runs with higher accuracies from the
Katib UI. Execute the pipeline and watch for the resulting accuracy, which
should be closer to 0.98.
Approximately 5 minutes after the last run completes, check the cluster nodes
to verify that GPU nodes have disappeared.