* allow specifying gpu resources explicitly; also no longer allocate gpus to the launcher * small fixes * address review comments |
||
|---|---|---|
| cmd | ||
| deploy | ||
| examples | ||
| hack | ||
| pkg | ||
| .dockerignore | ||
| .gitignore | ||
| .travis.yml | ||
| Dockerfile | ||
| Gopkg.lock | ||
| Gopkg.toml | ||
| LICENSE | ||
| OWNERS | ||
| README.md | ||
| prow_config.yaml | ||
README.md
MPI Operator
The MPI Operator makes it easy to run allreduce-style distributed training.
Deploy
kubectl create -f deploy/
Test
Launch a multi-node tensorflow benchmark training job:
kubectl create -f examples/tensorflow-benchmarks.yaml
Once everything starts, the logs are available in the launcher pod.