trainer/pkg/runtime/core
Shao Wang 040b34e1e6
KEP-2401: Complement torch plugin to support torchtune config mutation (#2587)
* chore(plugin): Add torchtune-related constants & update current torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(plugin): Add EnforceMLPolicy for torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(plugin): Add UTs in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): fix error in torch plugin UTs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(plugin): Choose recipe according to numNodes & numProcPerNode & Args.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): Add PretrainedModel enum type.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(plugin): Add torchtune config arg.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(test): add UT for single-device full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(test): Add test for multi-nodes full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(test): Update torch validate UTs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove pretrained model enum type in sdk.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): retrieve model name from runtimeRef.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(lint): fix typo.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): make some adjustments according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove runtime in get_trainer_crd_from_builtin_trainer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): pass PET_ env variables in torch plugin for torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): add env validation for torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): update comments.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugins): fix the implementation according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(plugins): fix UT error in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix UT and e2e tests error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: remove debug info.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): add args in UTs related to torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): update torchtune related args.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): Add a UT for multi-node mode check in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-04-29 13:21:03 +00:00
..
clustertrainingruntime.go Fix issue with fetching clustertrainingruntime for validations (#2564) 2025-03-24 15:14:06 +00:00
clustertrainingruntime_test.go feat(sdk): Support MPI-based TrainJobs (#2545) 2025-03-20 18:08:02 +00:00
core.go Add dependencies to RuntimeRegistrar (#2476) 2025-03-06 02:13:13 +00:00
registry.go Add dependencies to RuntimeRegistrar (#2476) 2025-03-06 02:13:13 +00:00
trainingruntime.go Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557) 2025-03-24 11:05:06 +00:00
trainingruntime_test.go KEP-2401: Complement torch plugin to support torchtune config mutation (#2587) 2025-04-29 13:21:03 +00:00