mirror of https://github.com/kubeflow/website.git
145 lines
4.5 KiB
Markdown
145 lines
4.5 KiB
Markdown
+++
|
||
title = "MPI Training"
|
||
description = "Instructions for using MPI for training"
|
||
weight = 25
|
||
+++
|
||
|
||
This guide walks you through using MPI for training.
|
||
|
||
## Installation
|
||
|
||
If you haven’t already done so please follow the [Getting Started Guide](https://www.kubeflow.org/docs/started/getting-started/) to deploy Kubeflow.
|
||
|
||
An alpha version of MPI support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.
|
||
|
||
You can check whether the MPI Job custom resource is installed via:
|
||
|
||
```
|
||
kubectl get crd
|
||
```
|
||
|
||
The output should include `mpijobs.kubeflow.org` like the following:
|
||
|
||
```
|
||
NAME AGE
|
||
...
|
||
mpijobs.kubeflow.org 4d
|
||
...
|
||
```
|
||
|
||
If it is not included you can add it as follows:
|
||
|
||
```
|
||
cd ${KSONNET_APP}
|
||
ks pkg install kubeflow/mpi-job
|
||
ks generate mpi-operator mpi-operator
|
||
ks apply ${ENVIRONMENT} -c mpi-operator
|
||
```
|
||
|
||
Alternatively, you can deploy the operator with default settings without using ksonnet by running the following from the repo:
|
||
|
||
```shell
|
||
kubectl create -f deploy/
|
||
```
|
||
|
||
## Creating an MPI Job
|
||
|
||
You can create an MPI job by defining an `MPIJob` config file. See [TensorFlow benchmark example](https://github.com/kubeflow/mpi-operator/blob/master/examples/tensorflow-benchmarks.yaml) config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements.
|
||
|
||
```
|
||
cat examples/tensorflow-benchmarks.yaml
|
||
```
|
||
Deploy the `MPIJob` resource to start training:
|
||
|
||
```
|
||
kubectl create -f examples/tensorflow-benchmarks.yaml
|
||
```
|
||
|
||
## Monitoring an MPI Job
|
||
|
||
Once the `MPIJob` resource is created, you should now be able to see the created pods matching the specified number of GPUs. You can also monitor the job status from the status section. Here is sample output when the job is successfully completed.
|
||
|
||
```
|
||
kubectl get -o yaml mpijobs tensorflow-benchmarks-16
|
||
```
|
||
|
||
```
|
||
apiVersion: kubeflow.org/v1alpha1
|
||
kind: MPIJob
|
||
metadata:
|
||
clusterName: ""
|
||
creationTimestamp: 2019-01-07T20:32:12Z
|
||
generation: 1
|
||
name: tensorflow-benchmarks-16
|
||
namespace: default
|
||
resourceVersion: "185051397"
|
||
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/mpijobs/tensorflow-benchmarks-16
|
||
uid: 8dc8c044-127d-11e9-a419-02420bbe29f3
|
||
spec:
|
||
gpus: 16
|
||
template:
|
||
metadata:
|
||
creationTimestamp: null
|
||
spec:
|
||
containers:
|
||
- image: mpioperator/tensorflow-benchmarks:latest
|
||
name: tensorflow-benchmarks
|
||
resources: {}
|
||
status:
|
||
launcherStatus: Succeeded
|
||
```
|
||
|
||
|
||
Training should run for 100 steps and takes a few minutes on a GPU cluster. You can inspect the logs to see the training progress. When the job starts, access the logs from the `launcher` pod:
|
||
|
||
```
|
||
PODNAME=$(kubectl get pods -l mpi_job_name=tensorflow-benchmarks-16,mpi_role_type=launcher -o name)
|
||
kubectl logs -f ${PODNAME}
|
||
```
|
||
|
||
```
|
||
TensorFlow: 1.10
|
||
Model: resnet101
|
||
Dataset: imagenet (synthetic)
|
||
Mode: training
|
||
SingleSess: False
|
||
Batch size: 128 global
|
||
64 per device
|
||
Num batches: 100
|
||
Num epochs: 0.01
|
||
Devices: ['horovod/gpu:0', 'horovod/gpu:1']
|
||
Data format: NCHW
|
||
Optimizer: sgd
|
||
Variables: horovod
|
||
|
||
...
|
||
|
||
40 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.146
|
||
40 images/sec: 132.1 +/- 0.0 (jitter = 0.1) 9.182
|
||
50 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.071
|
||
50 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.210
|
||
60 images/sec: 132.2 +/- 0.0 (jitter = 0.2) 9.180
|
||
60 images/sec: 132.2 +/- 0.0 (jitter = 0.2) 9.055
|
||
70 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.005
|
||
70 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.096
|
||
80 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.231
|
||
80 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.197
|
||
90 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.201
|
||
90 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.089
|
||
100 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.183
|
||
----------------------------------------------------------------
|
||
total images/sec: 264.26
|
||
----------------------------------------------------------------
|
||
100 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.044
|
||
----------------------------------------------------------------
|
||
total images/sec: 264.26
|
||
----------------------------------------------------------------
|
||
```
|
||
|
||
# Docker Images
|
||
|
||
Docker images are built and pushed automatically to [mpioperator on Dockerhub](https://hub.docker.com/u/mpioperator). You can use the following Dockerfiles to build the images yourself:
|
||
|
||
* [mpi-operator](https://github.com/kubeflow/mpi-operator/blob/master/Dockerfile)
|
||
* [kubectl-delivery](https://github.com/kubeflow/mpi-operator/blob/master/cmd/kubectl-delivery/Dockerfile)
|