Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)

Go to file

Rong Ou a3487b2208 allow specifying gpus explicitly (#16 ) * allow specifying gpu resources explicitly; also no longer allocate gpus to the launcher * small fixes * address review comments		2018-06-14 09:49:28 -07:00
cmd	Use docker hub to automatically build docker images after every commit (#13 )	2018-06-06 10:33:25 -07:00
deploy	fix rbac (#15 )	2018-06-08 16:19:58 -07:00
examples	allow specifying gpus explicitly (#16 )	2018-06-14 09:49:28 -07:00
hack	Initial code push	2018-06-01 15:52:20 -07:00
pkg	allow specifying gpus explicitly (#16 )	2018-06-14 09:49:28 -07:00
.dockerignore	Initial code push	2018-06-01 15:52:20 -07:00
.gitignore	Initial code push	2018-06-01 15:52:20 -07:00
.travis.yml	ci: add travis manifest (#14 )	2018-06-06 22:12:24 -07:00
Dockerfile	Use docker hub to automatically build docker images after every commit (#13 )	2018-06-06 10:33:25 -07:00
Gopkg.lock	Initial code push	2018-06-01 15:52:20 -07:00
Gopkg.toml	Initial code push	2018-06-01 15:52:20 -07:00
LICENSE	Initial code push	2018-06-01 15:52:20 -07:00
OWNERS	fix owners file	2018-06-05 10:21:08 -07:00
README.md	Use docker hub to automatically build docker images after every commit (#13 )	2018-06-06 10:33:25 -07:00
prow_config.yaml	add empty prow_config.yaml	2018-06-01 16:17:49 -07:00

MPI Operator

The MPI Operator makes it easy to run allreduce-style distributed training.

Deploy

kubectl create -f deploy/

Launch a multi-node tensorflow benchmark training job:

kubectl create -f examples/tensorflow-benchmarks.yaml

Once everything starts, the logs are available in the launcher pod.