Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)

apache-mxnet distributed-computing horovod kubeflow kubernetes mpi pytorch tensorflow

Go to file

Rong Ou 9994bd02db Initial code push		2018-06-01 15:52:20 -07:00
cmd	Initial code push	2018-06-01 15:52:20 -07:00
deploy	Initial code push	2018-06-01 15:52:20 -07:00
examples	Initial code push	2018-06-01 15:52:20 -07:00
hack	Initial code push	2018-06-01 15:52:20 -07:00
pkg	Initial code push	2018-06-01 15:52:20 -07:00
.dockerignore	Initial code push	2018-06-01 15:52:20 -07:00
.gitignore	Initial code push	2018-06-01 15:52:20 -07:00
Gopkg.lock	Initial code push	2018-06-01 15:52:20 -07:00
Gopkg.toml	Initial code push	2018-06-01 15:52:20 -07:00
LICENSE	Initial code push	2018-06-01 15:52:20 -07:00
OWNERS	Add initial approvers and reviewers to OWNERS (#1 )	2018-06-01 15:47:04 -07:00
README.md	Initial code push	2018-06-01 15:52:20 -07:00

README.md

MPI Operator

The MPI Operator makes it easy to run allreduce-style distributed training.

Build

Check out the code:

mkdir -p ${GOPATH}/src/github.com/kubeflow
cd ${GOPATH}/src/github.com/kubeflow
git clone https://github.com/kubeflow/mpi-operator.git
cd mpi-operator

Build and push the mpi-operator Docker image:

docker built -t rongou/mpi-operator:0.1.0 -f cmd/mpi-operator/Dockerfile .
docker push rongou/mpi-operator:0.1.0

Build and push the kubectl-delivery Docker image:

docker build -t rongou/kubectl-delivery:0.1.0 -f cmd/kubectl-delivery/Dockerfile .
docker push rongou/mpi-operator:0.1.0

Deploy

kubectl create -f deploy/

Test

Build and push the horovod Docker image (this takes a while):

docker build -t rongou/horovod https://github.com/uber/horovod.git
docker push rongou/horovod

Build and push the tensorflow_benchmarks Docker image:

docker build -t rongou/tensorflow_benchmarks examples/tensorflow-benchmarks
docker push rongou/tensorflow_benchmarks

Launch a multi-node tensorflow benchmark training job:

kubectl create -f examples/tensorflow-benchmarks.yaml

Once everything starts, the logs are available in the launcher pod.