Andrey Velichkevich
|
4b0c2943bc
|
feat(sdk): Support MPI-based TrainJobs (#2545)
* feat(sdk): Support MPI-based TrainJobs
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Refactor list_runtimes
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix example
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add Runtime Trainer object
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update for new Runtime object
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Implement get_runtime API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix Torch example
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Remove un-unsed consts
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update func args
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update SDK constants
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Change to 16Gi
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix container name for MPI
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Keep launcher container for MPI
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
2025-03-20 18:08:02 +00:00 |
Shao Wang
|
32ee3c7212
|
KEP-2401: Refactor current `train()` API (#2513)
* fix(sdk): rename Trainer to CustomTrainer.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): remove validate_trainer().
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): remove lora related code.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): remove get_lora_config()
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): fix import error in __init__.py
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(example): update the image-classification example.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): delete remaining lora related code.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): modify args description in CustomTrainer.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): add parameter type in CustomTrainer dataclass.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): update args in CustomTrainer.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
---------
Signed-off-by: Electronic-Waste <2690692950@qq.com>
|
2025-03-13 02:18:13 +00:00 |
Andrey Velichkevich
|
9e785750d0
|
chore(test): Add E2E tests for Kubeflow Trainer (#2470)
* Add e2e tests for Kubeflow Trainer
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add timeout for papermill
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add output as part of make command
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add k8s version to setup cluster
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix Kind k8s version
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix 1.29 version
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Create script to run Notebook
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Download dataset when local_rank=0
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update test/e2e/e2e_test.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Refactor Go e2e tests
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Bump k8s to 1.29.14
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Install Kind from go mod
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix path for Kind package
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix Go e2e
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Reduce number of CPUs
Export Notebook as artifact
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Print logs due to flaky test
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix artifact path
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* docker pull image
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix path
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add k8s version to output name
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Remove install Kind cmd
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
|
2025-03-05 04:04:07 +00:00 |
Andrey Velichkevich
|
3060332931
|
Update the naming conventions for Kubeflow Trainer (#2415)
* Update the naming conventions for Kubeflow Trainer
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix webhooks
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix paths for webhooks
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update go test cmd
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Rename kubeflowv1 to trainer pkg
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
2025-02-06 13:48:30 +00:00 |
Antonin Stefanutti
|
ee11629194
|
KEP-2170: Add PyTorch DDP MNIST training example (#2387)
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
|
2025-02-01 22:16:33 +00:00 |