examples/tensorflow-horovod
Kunming Qu 0d49548b3a Mpi example (#690)
* mpi horovod example on kubeflow

* add readme
2019-12-09 17:49:29 -08:00
..
README.md Mpi example (#690) 2019-12-09 17:49:29 -08:00
mpi-job.yaml Mpi example (#690) 2019-12-09 17:49:29 -08:00

README.md

Kubeflow MPI Horovod example

This example deploys MPI operator into kubeflow cluster and runs an distributed training example using GPU.

Steps

export PROJECT=
export CLUSTER=
gcloud container node-pools create gpu-pool-mpi --accelerator=type=nvidia-tesla-k80,count=4 --cluster=$CLUSTER --project=$PROJECT --machine-type=n1-standard-8 --num-nodes=2
kustomize build mpi-job/mpi-operator/base/ | kubectl apply -f -
  • Deploy the MPI exmaple job:
kubectl apply -f mpi-job.yaml -n kubeflow
  • Once launcher pod is up and running, log will be available from:
POD_NAME=$(kubectl -n kubeflow get pods -l mpi_job_name=tf-resnet50-horovod-job,mpi_role_type=launcher -o name)
kubectl -n kubeflow logs -f ${POD_NAME}