Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
Go to file
Rong Ou 9994bd02db Initial code push 2018-06-01 15:52:20 -07:00
cmd Initial code push 2018-06-01 15:52:20 -07:00
deploy Initial code push 2018-06-01 15:52:20 -07:00
examples Initial code push 2018-06-01 15:52:20 -07:00
hack Initial code push 2018-06-01 15:52:20 -07:00
pkg Initial code push 2018-06-01 15:52:20 -07:00
.dockerignore Initial code push 2018-06-01 15:52:20 -07:00
.gitignore Initial code push 2018-06-01 15:52:20 -07:00
Gopkg.lock Initial code push 2018-06-01 15:52:20 -07:00
Gopkg.toml Initial code push 2018-06-01 15:52:20 -07:00
LICENSE Initial code push 2018-06-01 15:52:20 -07:00
OWNERS Add initial approvers and reviewers to OWNERS (#1) 2018-06-01 15:47:04 -07:00
README.md Initial code push 2018-06-01 15:52:20 -07:00

README.md

MPI Operator

The MPI Operator makes it easy to run allreduce-style distributed training.

Build

Check out the code:

mkdir -p ${GOPATH}/src/github.com/kubeflow
cd ${GOPATH}/src/github.com/kubeflow
git clone https://github.com/kubeflow/mpi-operator.git
cd mpi-operator

Build and push the mpi-operator Docker image:

docker built -t rongou/mpi-operator:0.1.0 -f cmd/mpi-operator/Dockerfile .
docker push rongou/mpi-operator:0.1.0

Build and push the kubectl-delivery Docker image:

docker build -t rongou/kubectl-delivery:0.1.0 -f cmd/kubectl-delivery/Dockerfile .
docker push rongou/mpi-operator:0.1.0

Deploy

kubectl create -f deploy/

Test

Build and push the horovod Docker image (this takes a while):

docker build -t rongou/horovod https://github.com/uber/horovod.git
docker push rongou/horovod

Build and push the tensorflow_benchmarks Docker image:

docker build -t rongou/tensorflow_benchmarks examples/tensorflow-benchmarks
docker push rongou/tensorflow_benchmarks

Launch a multi-node tensorflow benchmark training job:

kubectl create -f examples/tensorflow-benchmarks.yaml

Once everything starts, the logs are available in the launcher pod.