Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
Go to file
Rong Ou a3487b2208 allow specifying gpus explicitly (#16)
* allow specifying gpu resources explicitly; also no longer allocate gpus to the launcher

* small fixes

* address review comments
2018-06-14 09:49:28 -07:00
cmd Use docker hub to automatically build docker images after every commit (#13) 2018-06-06 10:33:25 -07:00
deploy fix rbac (#15) 2018-06-08 16:19:58 -07:00
examples allow specifying gpus explicitly (#16) 2018-06-14 09:49:28 -07:00
hack Initial code push 2018-06-01 15:52:20 -07:00
pkg allow specifying gpus explicitly (#16) 2018-06-14 09:49:28 -07:00
.dockerignore Initial code push 2018-06-01 15:52:20 -07:00
.gitignore Initial code push 2018-06-01 15:52:20 -07:00
.travis.yml ci: add travis manifest (#14) 2018-06-06 22:12:24 -07:00
Dockerfile Use docker hub to automatically build docker images after every commit (#13) 2018-06-06 10:33:25 -07:00
Gopkg.lock Initial code push 2018-06-01 15:52:20 -07:00
Gopkg.toml Initial code push 2018-06-01 15:52:20 -07:00
LICENSE Initial code push 2018-06-01 15:52:20 -07:00
OWNERS fix owners file 2018-06-05 10:21:08 -07:00
README.md Use docker hub to automatically build docker images after every commit (#13) 2018-06-06 10:33:25 -07:00
prow_config.yaml add empty prow_config.yaml 2018-06-01 16:17:49 -07:00

README.md

MPI Operator

The MPI Operator makes it easy to run allreduce-style distributed training.

Deploy

kubectl create -f deploy/

Test

Launch a multi-node tensorflow benchmark training job:

kubectl create -f examples/tensorflow-benchmarks.yaml

Once everything starts, the logs are available in the launcher pod.