mirror of https://github.com/kubeflow/trainer.git
* chore(plugin): Add torchtune-related constants & update current torch plugin. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(plugin): Add EnforceMLPolicy for torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(plugin): Add UTs in torch plugin. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(test): fix error in torch plugin UTs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(plugin): Choose recipe according to numNodes & numProcPerNode & Args. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(sdk): Add PretrainedModel enum type. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(plugin): Add torchtune config arg. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(test): add UT for single-device full fine-tuning with torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(test): Add test for multi-nodes full fine-tuning with torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(test): Update torch validate UTs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(lint): fix lint error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(sdk): remove pretrained model enum type in sdk. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugin): retrieve model name from runtimeRef. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(lint): fix typo. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugin): make some adjustments according to the review. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(sdk): remove runtime in get_trainer_crd_from_builtin_trainer. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugin): pass PET_ env variables in torch plugin for torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugin): add env validation for torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugin): update comments. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugins): fix the implementation according to the review. Signed-off-by: Electronic-Waste <2690692950@qq.com> * test(plugins): fix UT error in torch plugin. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix: fix UT and e2e tests error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix: remove debug info. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(test): add args in UTs related to torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(test): update torchtune related args. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(test): Add a UT for multi-node mode check in torch plugin. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com> |
||
---|---|---|
.. | ||
clustertrainingruntime.go | ||
clustertrainingruntime_test.go | ||
core.go | ||
registry.go | ||
trainingruntime.go | ||
trainingruntime_test.go |