Commit Graph

5 Commits

Author SHA1 Message Date
Andrey Velichkevich 4b0c2943bc
feat(sdk): Support MPI-based TrainJobs (#2545)
* feat(sdk): Support MPI-based TrainJobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor list_runtimes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add Runtime Trainer object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update for new Runtime object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Implement get_runtime API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Torch example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove un-unsed consts

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update func args

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update SDK constants

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Change to 16Gi

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix container name for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Keep launcher container for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-20 18:08:02 +00:00
Shao Wang 32ee3c7212
KEP-2401: Refactor current `train()` API (#2513)
* fix(sdk): rename Trainer to CustomTrainer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove validate_trainer().

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove lora related code.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove get_lora_config()

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): fix import error in __init__.py

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(example): update the image-classification example.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): delete remaining lora related code.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): modify args description in CustomTrainer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): add parameter type in CustomTrainer dataclass.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): update args in CustomTrainer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-03-13 02:18:13 +00:00
Andrey Velichkevich 9e785750d0
chore(test): Add E2E tests for Kubeflow Trainer (#2470)
* Add e2e tests for Kubeflow Trainer

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add timeout for papermill

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add output as part of make command

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add k8s version to setup cluster

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Kind k8s version

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix 1.29 version

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Create script to run Notebook

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Download dataset when local_rank=0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update test/e2e/e2e_test.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor Go e2e tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Bump k8s to 1.29.14

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Install Kind from go mod

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix path for Kind package

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Go e2e

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Reduce number of CPUs
Export Notebook as artifact

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Print logs due to flaky test

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix artifact path

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* docker pull image

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix path

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add k8s version to output name

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove install Kind cmd

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-05 04:04:07 +00:00
Andrey Velichkevich 3060332931
Update the naming conventions for Kubeflow Trainer (#2415)
* Update the naming conventions for Kubeflow Trainer

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix webhooks

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix paths for webhooks

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update go test cmd

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Rename kubeflowv1 to trainer pkg

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-02-06 13:48:30 +00:00
Antonin Stefanutti ee11629194
KEP-2170: Add PyTorch DDP MNIST training example (#2387)
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
2025-02-01 22:16:33 +00:00