website/content/docs/components/hyperparameter.md

373 lines
9.9 KiB
Markdown

+++
title = "Hyperparameter Tuning (Katib)"
description = "Using Katib to tune your model's hyperparameters on Kubernetes"
weight = 5
+++
The [Katib](https://github.com/kubeflow/katib) project is inspired by
[Google vizier](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/bcb15507f4b52991a0783013df4222240e942381.pdf).
Katib is a scalable and flexible hyperparameter tuning framework and is tightly
integrated with Kubernetes. It does not depend on any specific deep learning
framework (such as TensorFlow, MXNet, or PyTorch).
## Installing Katib
To run Katib jobs, you must install the required packages as shown in this
section.
In your ksonnet application's root directory, run the following commands:
```
export KF_ENV=default
ks env set ${KF_ENV} --namespace=kubeflow
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
```
The `KF_ENV` environment variable represents a conceptual deployment environment
such as development, test, staging, or production, as defined by
ksonnet. For this example, we use the `default` environment.
You can read more about Kubeflow's use of ksonnet in the Kubeflow
[ksonnet component guide](/docs/components/ksonnet/).
### TFJob (tf-operator)
To install a TensorFlow job operator, run the following commands:
```
ks pkg install kubeflow/tf-training
ks pkg install kubeflow/common
ks generate tf-job-operator tf-job-operator
ks apply ${KF_ENV} -c tf-job-operator
```
### PyTorch operator
To install a PyTorch job operator, run the following commands:
```
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
ks apply ${KF_ENV} -c pytorch-operator
```
### Katib
Then run the following commands to install Katib:
```
ks pkg install kubeflow/katib
ks generate katib katib
ks apply ${KF_ENV} -c katib
```
If you want to use Katib outside Google Kubernetes Engine (GKE) and you don't
have a StorageClass for dynamic volume provisioning in your cluster, you must
create a persistent volume (PV) to bind your persistent volume claim (PVC).
This is the YAML file for a PV:
```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: katib-mysql
labels:
type: local
app: katib
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/katib
```
After deploying the Katib package, run the following command to create the PV:
```
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml
```
## Running examples
After deploying everything, you can run some examples.
### Example using random algorithm
You can create a StudyJob for Katib by defining a StudyJob config file. See the
[random algorithm example](https://github.com/kubeflow/katib/blob/master/examples/v1alpha1/random-example.yaml).
```
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/random-example.yaml
```
Running this command launches a StudyJob. The study job runs a series of
training jobs to train models using different hyperparameters and save the
results.
The configurations for the study (hyper-parameter feasible space, optimization
parameter, optimization goal, suggestion algorithm, and so on) are defined in
[random-example.yaml](https://github.com/kubeflow/katib/blob/master/examples/v1alpha1/random-example.yaml).
In this demo, hyper-parameters are embedded as args.
You can embed hyper-parameters in another way (for example, environment values)
by using the template defined in `WorkerSpec.GoTemplate.RawTemplate`.
It is written in [go template](https://golang.org/pkg/text/template/) format.
This demo randomly generates 3 hyper parameters:
* Learning Rate (--lr) - type: double
* Number of NN Layer (--num-layers) - type: int
* optimizer (--optimizer) - type: categorical
Check the study status:
```
$ kubectl -n kubeflow describe studyjobs random-example
Name: random-example
Namespace: kubeflow
Labels: controller-tools.k8s.io=1.0
Annotations: <none>
API Version: kubeflow.org/v1alpha1
Kind: StudyJob
Metadata:
Creation Timestamp: 2019-01-18T16:30:46Z
Finalizers:
clean-studyjob-data
Generation: 5
Resource Version: 1777650
Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/random-example
UID: 687a67f9-1b3e-11e9-a0c2-c6456c1f5f0a
Spec:
Metricsnames:
accuracy
Objectivevaluename: Validation-accuracy
Optimizationgoal: 0.88
Optimizationtype: maximize
Owner: crd
Parameterconfigs:
Feasible:
Max: 0.03
Min: 0.01
Name: --lr
Parametertype: double
Feasible:
Max: 5
Min: 2
Name: --num-layers
Parametertype: int
Feasible:
List:
sgd
adam
ftrl
Name: --optimizer
Parametertype: categorical
Requestcount: 4
Study Name: random-example
Suggestion Spec:
Request Number: 3
Suggestion Algorithm: random
Suggestion Parameters:
Name: SuggestionCount
Value: 0
Worker Spec:
Go Template:
Raw Template: apiVersion: batch/v1
kind: Job
metadata:
name: {{.WorkerID}}
namespace: kubeflow
spec:
template:
spec:
containers:
- name: {{.WorkerID}}
image: katib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
Status:
Condition: Running
Early Stopping Parameter Id:
Last Reconcile Time: 2019-01-18T16:30:46Z
Start Time: 2019-01-18T16:30:46Z
Studyid: y456536bd1e0ad5e
Suggestion Count: 1
Suggestion Parameter Id: i31c2adcab54f891
Trials:
Trialid: ka897d189e024460
Workeridlist:
Completion Time: <nil>
Condition: Running
Kind: Job
Start Time: 2019-01-18T16:30:46Z
Workerid: ma76ebe2b23fec02
Trialid: v9ec0edbb16befd7
Workeridlist:
Completion Time: <nil>
Condition: Running
Kind: Job
Start Time: 2019-01-18T16:30:46Z
Workerid: yc5053df337dbeec
Trialid: be68860be22cfce3
Workeridlist:
Completion Time: <nil>
Condition: Running
Kind: Job
Start Time: 2019-01-18T16:30:46Z
Workerid: v095e6b93d87e9eb
Events: <none>
```
The demo should start a study and run three jobs with different parameters.
When the `spec.Status.Condition` changes to *Completed*, the StudyJob is
finished.
### TensorFlow operator example
To run the TensorFlow operator example, you must install a volume.
If you are using GKE and default StorageClass, you must create this PVC:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfevent-volume
namespace: kubeflow
labels:
type: local
app: tfjob
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
```
If you are not using GKE and you don't have StorageClass for dynamic volume
provisioning in your cluster, you must create a PVC and a PV:
```
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml
```
Now you can run the TensorFlow operator example:
```
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfjob-example.yaml
```
You can check the status of the study:
```
kubectl -n kubeflow describe studyjobs tfjob-example
```
### PyTorch example
This is an example for the PyTorch operator:
```
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/pytorchjob-example.yaml
```
You can check the status of the study:
```
kubectl -n kubeflow describe studyjobs pytorchjob-example
```
## Monitoring results
You can monitor your results in the Katib UI. To access the Katib UI, you must
install Ambassador.
In your ksonnet application's root directory, run the following commands:
```
ks generate ambassador ambassador
ks apply ${KF_ENV} -c ambassador
```
Then port-forward the Ambassador service:
* For Kubernetes version 1.9 and later:
```
kubectl port-forward svc/ambassador -n kubeflow 8080:80
```
* For Kubernetes version 1.8 and earlier:
```
kubectl get pods -n kubeflow # Find one of the Ambassador pods
kubectl port-forward [Ambassador pod] -n kubeflow 8080:80
```
Now you can access the Katib UI at this URL: ```http://localhost:8080/katib/```.
## Cleanup
Delete the installed components:
```
ks delete ${KF_ENV} -c katib
ks delete ${KF_ENV} -c pytorch-operator
ks delete ${KF_ENV} -c tf-job-operator
```
If you created a PV for Katib, delete it:
```
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml
```
If you created a PV and PVC for the TensorFlow operator, delete it:
```
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml
```
If you deployed Ambassador, delete it:
```
ks delete ${KF_ENV} -c ambassador
```
## Metrics collector
Katib has a metrics collector to take metrics from each worker. Katib collects
metrics from stdout of each worker. Metrics should print in the following
format: `{metrics name}={value}`. For example, when your objective value name
is `loss` and the metrics are `recall` and `precision`, your training container
should print like this:
```
epoch 1:
loss=0.3
recall=0.5
precision=0.4
epoch 2:
loss=0.2
recall=0.55
precision=0.5
```
Katib collects all logs of metrics.