+++ title = "Hyperparameter Tuning (Katib)" description = "Using Katib to tune your model's hyperparameters on Kubernetes" weight = 5 +++ The [Katib](https://github.com/kubeflow/katib) project is inspired by [Google vizier](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/bcb15507f4b52991a0783013df4222240e942381.pdf). Katib is a scalable and flexible hyperparameter tuning framework and is tightly integrated with Kubernetes. It does not depend on any specific deep learning framework (such as TensorFlow, MXNet, or PyTorch). ## Installing Katib To run Katib jobs, you must install the required packages as shown in this section. In your ksonnet application's root directory, run the following commands: ``` export KF_ENV=default ks env set ${KF_ENV} --namespace=kubeflow ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow ``` The `KF_ENV` environment variable represents a conceptual deployment environment such as development, test, staging, or production, as defined by ksonnet. For this example, we use the `default` environment. You can read more about Kubeflow's use of ksonnet in the Kubeflow [ksonnet component guide](/docs/components/ksonnet/). ### TFJob (tf-operator) To install a TensorFlow job operator, run the following commands: ``` ks pkg install kubeflow/tf-training ks pkg install kubeflow/common ks generate tf-job-operator tf-job-operator ks apply ${KF_ENV} -c tf-job-operator ``` ### PyTorch operator To install a PyTorch job operator, run the following commands: ``` ks pkg install kubeflow/pytorch-job ks generate pytorch-operator pytorch-operator ks apply ${KF_ENV} -c pytorch-operator ``` ### Katib Then run the following commands to install Katib: ``` ks pkg install kubeflow/katib ks generate katib katib ks apply ${KF_ENV} -c katib ``` If you want to use Katib outside Google Kubernetes Engine (GKE) and you don't have a StorageClass for dynamic volume provisioning in your cluster, you must create a persistent volume (PV) to bind your persistent volume claim (PVC). This is the YAML file for a PV: ```yaml apiVersion: v1 kind: PersistentVolume metadata: name: katib-mysql labels: type: local app: katib spec: capacity: storage: 10Gi accessModes: - ReadWriteOnce hostPath: path: /data/katib ``` After deploying the Katib package, run the following command to create the PV: ``` kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml ``` ## Running examples After deploying everything, you can run some examples. ### Example using random algorithm You can create a StudyJob for Katib by defining a StudyJob config file. See the [random algorithm example](https://github.com/kubeflow/katib/blob/master/examples/v1alpha1/random-example.yaml). ``` kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/random-example.yaml ``` Running this command launches a StudyJob. The study job runs a series of training jobs to train models using different hyperparameters and save the results. The configurations for the study (hyper-parameter feasible space, optimization parameter, optimization goal, suggestion algorithm, and so on) are defined in [random-example.yaml](https://github.com/kubeflow/katib/blob/master/examples/v1alpha1/random-example.yaml). In this demo, hyper-parameters are embedded as args. You can embed hyper-parameters in another way (for example, environment values) by using the template defined in `WorkerSpec.GoTemplate.RawTemplate`. It is written in [go template](https://golang.org/pkg/text/template/) format. This demo randomly generates 3 hyper parameters: * Learning Rate (--lr) - type: double * Number of NN Layer (--num-layers) - type: int * optimizer (--optimizer) - type: categorical Check the study status: ``` $ kubectl -n kubeflow describe studyjobs random-example Name: random-example Namespace: kubeflow Labels: controller-tools.k8s.io=1.0 Annotations: API Version: kubeflow.org/v1alpha1 Kind: StudyJob Metadata: Creation Timestamp: 2019-01-18T16:30:46Z Finalizers: clean-studyjob-data Generation: 5 Resource Version: 1777650 Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/random-example UID: 687a67f9-1b3e-11e9-a0c2-c6456c1f5f0a Spec: Metricsnames: accuracy Objectivevaluename: Validation-accuracy Optimizationgoal: 0.88 Optimizationtype: maximize Owner: crd Parameterconfigs: Feasible: Max: 0.03 Min: 0.01 Name: --lr Parametertype: double Feasible: Max: 5 Min: 2 Name: --num-layers Parametertype: int Feasible: List: sgd adam ftrl Name: --optimizer Parametertype: categorical Requestcount: 4 Study Name: random-example Suggestion Spec: Request Number: 3 Suggestion Algorithm: random Suggestion Parameters: Name: SuggestionCount Value: 0 Worker Spec: Go Template: Raw Template: apiVersion: batch/v1 kind: Job metadata: name: {{.WorkerID}} namespace: kubeflow spec: template: spec: containers: - name: {{.WorkerID}} image: katib/mxnet-mnist-example command: - "python" - "/mxnet/example/image-classification/train_mnist.py" - "--batch-size=64" {{- with .HyperParameters}} {{- range .}} - "{{.Name}}={{.Value}}" {{- end}} {{- end}} restartPolicy: Never Status: Condition: Running Early Stopping Parameter Id: Last Reconcile Time: 2019-01-18T16:30:46Z Start Time: 2019-01-18T16:30:46Z Studyid: y456536bd1e0ad5e Suggestion Count: 1 Suggestion Parameter Id: i31c2adcab54f891 Trials: Trialid: ka897d189e024460 Workeridlist: Completion Time: Condition: Running Kind: Job Start Time: 2019-01-18T16:30:46Z Workerid: ma76ebe2b23fec02 Trialid: v9ec0edbb16befd7 Workeridlist: Completion Time: Condition: Running Kind: Job Start Time: 2019-01-18T16:30:46Z Workerid: yc5053df337dbeec Trialid: be68860be22cfce3 Workeridlist: Completion Time: Condition: Running Kind: Job Start Time: 2019-01-18T16:30:46Z Workerid: v095e6b93d87e9eb Events: ``` The demo should start a study and run three jobs with different parameters. When the `spec.Status.Condition` changes to *Completed*, the StudyJob is finished. ### TensorFlow operator example To run the TensorFlow operator example, you must install a volume. If you are using GKE and default StorageClass, you must create this PVC: ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: tfevent-volume namespace: kubeflow labels: type: local app: tfjob spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi ``` If you are not using GKE and you don't have StorageClass for dynamic volume provisioning in your cluster, you must create a PVC and a PV: ``` kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml ``` Now you can run the TensorFlow operator example: ``` kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfjob-example.yaml ``` You can check the status of the study: ``` kubectl -n kubeflow describe studyjobs tfjob-example ``` ### PyTorch example This is an example for the PyTorch operator: ``` kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/pytorchjob-example.yaml ``` You can check the status of the study: ``` kubectl -n kubeflow describe studyjobs pytorchjob-example ``` ## Monitoring results You can monitor your results in the Katib UI. To access the Katib UI, you must install Ambassador. In your ksonnet application's root directory, run the following commands: ``` ks generate ambassador ambassador ks apply ${KF_ENV} -c ambassador ``` Then port-forward the Ambassador service: * For Kubernetes version 1.9 and later: ``` kubectl port-forward svc/ambassador -n kubeflow 8080:80 ``` * For Kubernetes version 1.8 and earlier: ``` kubectl get pods -n kubeflow # Find one of the Ambassador pods kubectl port-forward [Ambassador pod] -n kubeflow 8080:80 ``` Now you can access the Katib UI at this URL: ```http://localhost:8080/katib/```. ## Cleanup Delete the installed components: ``` ks delete ${KF_ENV} -c katib ks delete ${KF_ENV} -c pytorch-operator ks delete ${KF_ENV} -c tf-job-operator ``` If you created a PV for Katib, delete it: ``` kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha1/pv/pv.yaml ``` If you created a PV and PVC for the TensorFlow operator, delete it: ``` kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml ``` If you deployed Ambassador, delete it: ``` ks delete ${KF_ENV} -c ambassador ``` ## Metrics collector Katib has a metrics collector to take metrics from each worker. Katib collects metrics from stdout of each worker. Metrics should print in the following format: `{metrics name}={value}`. For example, when your objective value name is `loss` and the metrics are `recall` and `precision`, your training container should print like this: ``` epoch 1: loss=0.3 recall=0.5 precision=0.4 epoch 2: loss=0.2 recall=0.55 precision=0.5 ``` Katib collects all logs of metrics.