mirror of https://github.com/kubeflow/examples.git
262 lines
7.6 KiB
Markdown
262 lines
7.6 KiB
Markdown
# Kubeflow demo - Yelp restaurant reviews
|
|
|
|
This repository contains a demonstration of Kubeflow capabilities, suitable for
|
|
presentation to public audiences.
|
|
|
|
The base demo includes the following steps:
|
|
|
|
1. [Setup your environment](#1-setup-your-environment)
|
|
1. [Run training on CPUs](#2-run-training-on-cpus)
|
|
1. [Run training on TPUs](#3-run-training-on-tpus)
|
|
1. [Create the serving and UI components](#4-create-the-serving-and-ui-components)
|
|
1. [Bring up a notebook](#5-bring-up-a-notebook)
|
|
1. [Run a simple pipeline](#6-run-a-simple-pipeline)
|
|
1. [Perform hyperparameter tuning](#7-perform-hyperparameter-tuning)
|
|
1. [Run a better pipeline](#8-run-a-better-pipeline)
|
|
1. [Cleanup](#9-cleanup)
|
|
|
|
## 1. Setup your environment
|
|
|
|
Follow the instructions in
|
|
[demo_setup/README.md](https://github.com/kubeflow/examples/blob/master/demos/yelp_demo/demo_setup/README.md)
|
|
to setup your environment and install Kubeflow with pipelines on an
|
|
auto-provisioning GKE cluster with support for GPUs and TPUs.
|
|
_Note: This was tested using the_
|
|
_[v0.3.4-rc.1](https://github.com/kubeflow/kubeflow/tree/v0.3.4-rc.1)_
|
|
_branch with a cherry-pick of_
|
|
_[#1955](https://github.com/kubeflow/kubeflow/pull/1955)._
|
|
|
|
View the installed components in the GCP Console.
|
|
* In the
|
|
[Kubernetes Engine](https://console.cloud.google.com/kubernetes)
|
|
section, you will see a new cluster ${CLUSTER} with 3 `n1-standard-1` nodes
|
|
* Under
|
|
[Workloads](https://console.cloud.google.com/kubernetes/workload),
|
|
you will see all the default Kubeflow and pipeline components.
|
|
|
|
Source the environment file and activate the conda environment for pipelines:
|
|
|
|
```
|
|
source kubeflow-demo-base.env
|
|
source activate kfp
|
|
```
|
|
|
|
## 2. Run training on CPUs
|
|
|
|
Navigate to the ksonnet app directory created by `kfctl` and retrieve the
|
|
following files for the t2tcpu & t2ttpu jobs:
|
|
|
|
```
|
|
cd ks_app
|
|
cp ${DEMO_REPO}/demo/components/t2t*pu.* components
|
|
cp ${DEMO_REPO}/demo/components/params.* components
|
|
```
|
|
|
|
Set parameter values for training:
|
|
|
|
```
|
|
ks param set t2tcpu outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_CPU}
|
|
```
|
|
|
|
Generate manifests and apply to cluster:
|
|
|
|
```
|
|
ks apply default -c t2tcpu
|
|
```
|
|
|
|
View the new training pod and wait until it has a `Running` status:
|
|
|
|
```
|
|
kubectl get pod -l tf_job_name=t2tcpu
|
|
```
|
|
|
|
View the logs to watch training commence:
|
|
|
|
```
|
|
kubectl logs -f t2tcpu-master-0 | grep INFO:tensorflow
|
|
```
|
|
|
|
## 3. Run training on TPUs
|
|
|
|
Set parameter values for training:
|
|
|
|
```
|
|
ks param set t2ttpu outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_TPU}
|
|
```
|
|
|
|
Kick off training:
|
|
|
|
```
|
|
ks apply default -c t2ttpu
|
|
```
|
|
|
|
Verify that a TPU is being provisioned by viewing pod status. It should remain
|
|
in Pending state for 3-4 minutes with the message
|
|
`Creating Cloud TPUs for pod default/t2ttpu-master-0`.
|
|
|
|
```
|
|
kubectl describe pod t2ttpu-master-0
|
|
```
|
|
|
|
Once it has `Running` status, view the logs to watch training commence:
|
|
|
|
```
|
|
kubectl logs -f t2ttpu-master-0 | grep INFO:tensorflow
|
|
```
|
|
|
|
## 4. Create the serving and UI components
|
|
|
|
Retrieve the following files for the serving & UI components:
|
|
|
|
```
|
|
cp ${DEMO_REPO}/demo/components/serving.* components
|
|
cp ${DEMO_REPO}/demo/components/ui.* components
|
|
```
|
|
|
|
Create the serving and UI components:
|
|
|
|
```
|
|
ks apply default -c serving -c ui
|
|
```
|
|
|
|
Connect to the UI by forwarding a port to the ambassador service:
|
|
|
|
```
|
|
kubectl port-forward svc/ambassador 8080:80
|
|
```
|
|
|
|
Optional: If necessary, setup an SSH tunnel from your local laptop into the
|
|
compute instance connecting to GKE:
|
|
|
|
```
|
|
ssh ${HOST} -L 8080:localhost:8080
|
|
```
|
|
|
|
To show the naive version, navigate to
|
|
[localhost:8080/kubeflow_demo/](localhost:8080/kubeflow_demo/) from a browser.
|
|
|
|
To show the ML version, navigate to
|
|
[localhost:8080/kubeflow_demo/kubeflow](localhost:8080/kubeflow_demo/kubeflow) from a browser.
|
|
|
|
## 5. Bring up a notebook
|
|
|
|
Open a browser and connect to the Central Dashboard at [localhost:8080/](localhost:8080/).
|
|
Show the TF-job dashboard, then click on Jupyterhub.
|
|
Log in with any username and password combination and wait until the page
|
|
refreshes. Spawn a new pod with these resource requirements:
|
|
|
|
| Resource | Value |
|
|
| --------------------- | -------------------------------------------------------------------- |
|
|
| Image | `gcr.io/kubeflow-images-public/tensorflow-1.7.0-notebook-gpu:v0.2.1` |
|
|
| CPU | 2 |
|
|
| Memory | 48G |
|
|
| Extra Resource Limits | `{"nvidia.com/gpu":2}` |
|
|
|
|
It will take a while for the pod to spawn. While you're waiting, watch for
|
|
autoprovisioning to occur. View the Workload and Node status in the GCP console.
|
|
|
|
Once the notebook environment is
|
|
available, open a new terminal and upload this
|
|
[Yelp notebook](notebooks/yelp.ipynb).
|
|
|
|
Ensure the kernel is set to Python 2, then execute the notebook.
|
|
|
|
## 6. Run a simple pipeline
|
|
|
|
Show the file `gpu-example-pipeline.py` as an example of a simple pipeline.
|
|
|
|
Compile it to create a .tar.gz file:
|
|
|
|
```
|
|
./gpu-example-pipeline.py
|
|
```
|
|
|
|
View the pipelines UI locally by forwarding a port to the ml-pipeline-ui pod:
|
|
|
|
```
|
|
kubectl port-forward svc/ml-pipeline-ui 8081:80
|
|
```
|
|
|
|
In the browser, navigate to `localhost:8081` and create a new pipeline by
|
|
uploading `gpu-example-pipeline.py.tar.gz`. Select the pipeline and click
|
|
_Create experiment_. Use all suggested defaults.
|
|
|
|
View the effects of autoprovisioning by observing the number of nodes increase.
|
|
|
|
Select _Experiments_ from the left-hand side, then _Runs_. Click on the
|
|
experiment run to view the graph and watch it execute.
|
|
|
|
View the container logs for the training step and take note of the low accuracy (~0.113).
|
|
|
|
## 7. Perform hyperparameter tuning
|
|
|
|
In order to determine parameters that result in higher accuracy, use Katib
|
|
to execute a Study, which defines a search space for performing training with a
|
|
range of different parameters.
|
|
|
|
Create a Study by applying an
|
|
[example file](https://github.com/kubeflow/katib/blob/master/examples/gpu-example.yaml)
|
|
to the cluster:
|
|
|
|
```
|
|
kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/gpu-example.yaml
|
|
```
|
|
|
|
This creates a Studyjob object. To view it:
|
|
|
|
```
|
|
kubectl get studyjob
|
|
kubectl describe studyjobs gpu-example
|
|
```
|
|
|
|
To view the Katib UI, connect to the modeldb-frontend pod:
|
|
|
|
```
|
|
kubectl port-forward svc/katib-ui 8082:80
|
|
```
|
|
|
|
In the browser, navigate to `localhost:8082/katib` and click on the
|
|
gpu-example project. In the _Explore Visualizations_ section, select
|
|
_Optimizer_ in the _Group By_ dropdown, then click _Compare_.
|
|
|
|
|
|
View the creation of a new GPU node pool:
|
|
|
|
```
|
|
gcloud container node-pools list --cluster ${CLUSTER}
|
|
```
|
|
|
|
View the creation of new nodes:
|
|
|
|
```
|
|
kubectl get nodes
|
|
```
|
|
|
|
In the Katib UI, interact with the various graphs to determine which
|
|
combination of parameters results in the highest accuracy. Grouping by optimizer
|
|
type is one way to find consistently higher accuracies. Gather a set of
|
|
parameters to use in a new run of the pipeline.
|
|
|
|
## 8. Run a better pipeline
|
|
|
|
In the pipelines UI, clone the previous experiment run and update the arguments
|
|
to match the parameters for one of the runs with higher accuracies from the
|
|
Katib UI. Execute the pipeline and watch for the resulting accuracy, which
|
|
should be closer to 0.98.
|
|
|
|
Approximately 5 minutes after the last run completes, check the cluster nodes
|
|
to verify that GPU nodes have disappeared.
|
|
|
|
|
|
## 9. Cleanup
|
|
|
|
From the application directory created by `kfctl`, issue a cleanup command:
|
|
|
|
```
|
|
kfctl delete k8s
|
|
```
|
|
|
|
The cluster will scale back down to the default node pool, removing all nodes
|
|
created by NAP.
|
|
|