mirror of https://github.com/kubeflow/website.git
373 lines
9.9 KiB
Markdown
373 lines
9.9 KiB
Markdown
+++
|
|
title = "Hyperparameter Tuning (Katib)"
|
|
description = "Using Katib to tune your model's hyperparameters on Kubernetes"
|
|
weight = 5
|
|
+++
|
|
|
|
The [Katib](https://github.com/kubeflow/katib) project is inspired by
|
|
[Google vizier](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/bcb15507f4b52991a0783013df4222240e942381.pdf).
|
|
Katib is a scalable and flexible hyperparameter tuning framework and is tightly
|
|
integrated with Kubernetes. It does not depend on any specific deep learning
|
|
framework (such as TensorFlow, MXNet, or PyTorch).
|
|
|
|
## Installing Katib
|
|
|
|
To run Katib jobs, you must install the required packages as shown in this
|
|
section.
|
|
|
|
In your ksonnet application's root directory, run the following commands:
|
|
|
|
```
|
|
export KF_ENV=default
|
|
ks env set ${KF_ENV} --namespace=kubeflow
|
|
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
|
|
```
|
|
|
|
The `KF_ENV` environment variable represents a conceptual deployment environment
|
|
such as development, test, staging, or production, as defined by
|
|
ksonnet. For this example, we use the `default` environment.
|
|
|
|
You can read more about Kubeflow's use of ksonnet in the Kubeflow
|
|
[ksonnet component guide](/docs/components/ksonnet/).
|
|
|
|
### TFJob (tf-operator)
|
|
|
|
To install a TensorFlow job operator, run the following commands:
|
|
|
|
```
|
|
ks pkg install kubeflow/tf-training
|
|
ks pkg install kubeflow/common
|
|
ks generate tf-job-operator tf-job-operator
|
|
ks apply ${KF_ENV} -c tf-job-operator
|
|
```
|
|
|
|
### PyTorch operator
|
|
|
|
To install a PyTorch job operator, run the following commands:
|
|
|
|
```
|
|
ks pkg install kubeflow/pytorch-job
|
|
ks generate pytorch-operator pytorch-operator
|
|
ks apply ${KF_ENV} -c pytorch-operator
|
|
```
|
|
|
|
### Katib
|
|
|
|
Then run the following commands to install Katib:
|
|
|
|
```
|
|
ks pkg install kubeflow/katib
|
|
ks generate katib katib
|
|
ks apply ${KF_ENV} -c katib
|
|
```
|
|
|
|
If you want to use Katib outside Google Kubernetes Engine (GKE) and you don't
|
|
have a StorageClass for dynamic volume provisioning in your cluster, you must
|
|
create a persistent volume (PV) to bind your persistent volume claim (PVC).
|
|
|
|
This is the YAML file for a PV:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: PersistentVolume
|
|
metadata:
|
|
name: katib-mysql
|
|
labels:
|
|
type: local
|
|
app: katib
|
|
spec:
|
|
capacity:
|
|
storage: 10Gi
|
|
accessModes:
|
|
- ReadWriteOnce
|
|
hostPath:
|
|
path: /data/katib
|
|
```
|
|
|
|
After deploying the Katib package, run the following command to create the PV:
|
|
|
|
```
|
|
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml
|
|
```
|
|
|
|
## Running examples
|
|
|
|
After deploying everything, you can run some examples.
|
|
|
|
### Example using random algorithm
|
|
|
|
You can create a StudyJob for Katib by defining a StudyJob config file. See the
|
|
[random algorithm example](https://github.com/kubeflow/katib/blob/master/examples/v1alpha1/random-example.yaml).
|
|
|
|
```
|
|
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/random-example.yaml
|
|
```
|
|
|
|
Running this command launches a StudyJob. The study job runs a series of
|
|
training jobs to train models using different hyperparameters and save the
|
|
results.
|
|
|
|
The configurations for the study (hyper-parameter feasible space, optimization
|
|
parameter, optimization goal, suggestion algorithm, and so on) are defined in
|
|
[random-example.yaml](https://github.com/kubeflow/katib/blob/master/examples/v1alpha1/random-example.yaml).
|
|
|
|
In this demo, hyper-parameters are embedded as args.
|
|
You can embed hyper-parameters in another way (for example, environment values)
|
|
by using the template defined in `WorkerSpec.GoTemplate.RawTemplate`.
|
|
It is written in [go template](https://golang.org/pkg/text/template/) format.
|
|
|
|
This demo randomly generates 3 hyper parameters:
|
|
|
|
* Learning Rate (--lr) - type: double
|
|
* Number of NN Layer (--num-layers) - type: int
|
|
* optimizer (--optimizer) - type: categorical
|
|
|
|
Check the study status:
|
|
|
|
```
|
|
$ kubectl -n kubeflow describe studyjobs random-example
|
|
Name: random-example
|
|
Namespace: kubeflow
|
|
Labels: controller-tools.k8s.io=1.0
|
|
Annotations: <none>
|
|
API Version: kubeflow.org/v1alpha1
|
|
Kind: StudyJob
|
|
Metadata:
|
|
Creation Timestamp: 2019-01-18T16:30:46Z
|
|
Finalizers:
|
|
clean-studyjob-data
|
|
Generation: 5
|
|
Resource Version: 1777650
|
|
Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/random-example
|
|
UID: 687a67f9-1b3e-11e9-a0c2-c6456c1f5f0a
|
|
Spec:
|
|
Metricsnames:
|
|
accuracy
|
|
Objectivevaluename: Validation-accuracy
|
|
Optimizationgoal: 0.88
|
|
Optimizationtype: maximize
|
|
Owner: crd
|
|
Parameterconfigs:
|
|
Feasible:
|
|
Max: 0.03
|
|
Min: 0.01
|
|
Name: --lr
|
|
Parametertype: double
|
|
Feasible:
|
|
Max: 5
|
|
Min: 2
|
|
Name: --num-layers
|
|
Parametertype: int
|
|
Feasible:
|
|
List:
|
|
sgd
|
|
adam
|
|
ftrl
|
|
Name: --optimizer
|
|
Parametertype: categorical
|
|
Requestcount: 4
|
|
Study Name: random-example
|
|
Suggestion Spec:
|
|
Request Number: 3
|
|
Suggestion Algorithm: random
|
|
Suggestion Parameters:
|
|
Name: SuggestionCount
|
|
Value: 0
|
|
Worker Spec:
|
|
Go Template:
|
|
Raw Template: apiVersion: batch/v1
|
|
kind: Job
|
|
metadata:
|
|
name: {{.WorkerID}}
|
|
namespace: kubeflow
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: {{.WorkerID}}
|
|
image: katib/mxnet-mnist-example
|
|
command:
|
|
- "python"
|
|
- "/mxnet/example/image-classification/train_mnist.py"
|
|
- "--batch-size=64"
|
|
{{- with .HyperParameters}}
|
|
{{- range .}}
|
|
- "{{.Name}}={{.Value}}"
|
|
{{- end}}
|
|
{{- end}}
|
|
restartPolicy: Never
|
|
Status:
|
|
Condition: Running
|
|
Early Stopping Parameter Id:
|
|
Last Reconcile Time: 2019-01-18T16:30:46Z
|
|
Start Time: 2019-01-18T16:30:46Z
|
|
Studyid: y456536bd1e0ad5e
|
|
Suggestion Count: 1
|
|
Suggestion Parameter Id: i31c2adcab54f891
|
|
Trials:
|
|
Trialid: ka897d189e024460
|
|
Workeridlist:
|
|
Completion Time: <nil>
|
|
Condition: Running
|
|
Kind: Job
|
|
Start Time: 2019-01-18T16:30:46Z
|
|
Workerid: ma76ebe2b23fec02
|
|
Trialid: v9ec0edbb16befd7
|
|
Workeridlist:
|
|
Completion Time: <nil>
|
|
Condition: Running
|
|
Kind: Job
|
|
Start Time: 2019-01-18T16:30:46Z
|
|
Workerid: yc5053df337dbeec
|
|
Trialid: be68860be22cfce3
|
|
Workeridlist:
|
|
Completion Time: <nil>
|
|
Condition: Running
|
|
Kind: Job
|
|
Start Time: 2019-01-18T16:30:46Z
|
|
Workerid: v095e6b93d87e9eb
|
|
Events: <none>
|
|
```
|
|
|
|
The demo should start a study and run three jobs with different parameters.
|
|
When the `spec.Status.Condition` changes to *Completed*, the StudyJob is
|
|
finished.
|
|
|
|
### TensorFlow operator example
|
|
|
|
To run the TensorFlow operator example, you must install a volume.
|
|
|
|
If you are using GKE and default StorageClass, you must create this PVC:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: tfevent-volume
|
|
namespace: kubeflow
|
|
labels:
|
|
type: local
|
|
app: tfjob
|
|
spec:
|
|
accessModes:
|
|
- ReadWriteOnce
|
|
resources:
|
|
requests:
|
|
storage: 10Gi
|
|
```
|
|
|
|
If you are not using GKE and you don't have StorageClass for dynamic volume
|
|
provisioning in your cluster, you must create a PVC and a PV:
|
|
|
|
```
|
|
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml
|
|
|
|
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml
|
|
```
|
|
|
|
Now you can run the TensorFlow operator example:
|
|
|
|
```
|
|
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfjob-example.yaml
|
|
```
|
|
|
|
You can check the status of the study:
|
|
|
|
```
|
|
kubectl -n kubeflow describe studyjobs tfjob-example
|
|
```
|
|
|
|
### PyTorch example
|
|
|
|
This is an example for the PyTorch operator:
|
|
|
|
```
|
|
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/pytorchjob-example.yaml
|
|
```
|
|
|
|
You can check the status of the study:
|
|
|
|
```
|
|
kubectl -n kubeflow describe studyjobs pytorchjob-example
|
|
```
|
|
|
|
## Monitoring results
|
|
|
|
You can monitor your results in the Katib UI. To access the Katib UI, you must
|
|
install Ambassador.
|
|
|
|
In your ksonnet application's root directory, run the following commands:
|
|
|
|
```
|
|
ks generate ambassador ambassador
|
|
ks apply ${KF_ENV} -c ambassador
|
|
```
|
|
|
|
Then port-forward the Ambassador service:
|
|
|
|
* For Kubernetes version 1.9 and later:
|
|
|
|
```
|
|
kubectl port-forward svc/ambassador -n kubeflow 8080:80
|
|
```
|
|
|
|
* For Kubernetes version 1.8 and earlier:
|
|
|
|
```
|
|
kubectl get pods -n kubeflow # Find one of the Ambassador pods
|
|
kubectl port-forward [Ambassador pod] -n kubeflow 8080:80
|
|
```
|
|
|
|
Now you can access the Katib UI at this URL: ```http://localhost:8080/katib/```.
|
|
|
|
## Cleanup
|
|
|
|
Delete the installed components:
|
|
|
|
```
|
|
ks delete ${KF_ENV} -c katib
|
|
ks delete ${KF_ENV} -c pytorch-operator
|
|
ks delete ${KF_ENV} -c tf-job-operator
|
|
```
|
|
|
|
If you created a PV for Katib, delete it:
|
|
|
|
```
|
|
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml
|
|
```
|
|
|
|
If you created a PV and PVC for the TensorFlow operator, delete it:
|
|
|
|
```
|
|
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml
|
|
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml
|
|
```
|
|
|
|
If you deployed Ambassador, delete it:
|
|
|
|
```
|
|
ks delete ${KF_ENV} -c ambassador
|
|
```
|
|
|
|
## Metrics collector
|
|
|
|
Katib has a metrics collector to take metrics from each worker. Katib collects
|
|
metrics from stdout of each worker. Metrics should print in the following
|
|
format: `{metrics name}={value}`. For example, when your objective value name
|
|
is `loss` and the metrics are `recall` and `precision`, your training container
|
|
should print like this:
|
|
|
|
```
|
|
epoch 1:
|
|
loss=0.3
|
|
recall=0.5
|
|
precision=0.4
|
|
|
|
epoch 2:
|
|
loss=0.2
|
|
recall=0.55
|
|
precision=0.5
|
|
```
|
|
|
|
Katib collects all logs of metrics.
|