trainer/examples/pytorch/image-classification
Andrey Velichkevich 4b0c2943bc
feat(sdk): Support MPI-based TrainJobs (#2545)
* feat(sdk): Support MPI-based TrainJobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor list_runtimes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add Runtime Trainer object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update for new Runtime object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Implement get_runtime API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Torch example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove un-unsed consts

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update func args

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update SDK constants

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Change to 16Gi

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix container name for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Keep launcher container for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-20 18:08:02 +00:00
..
mnist.ipynb feat(sdk): Support MPI-based TrainJobs (#2545) 2025-03-20 18:08:02 +00:00