Antonin Stefanutti
c333826023
Remove TrainJobCreated condition ( #2621 )
...
* Remove TrainJobCreated condition
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Update KEP-2170 proposal
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Remove Created condition from SDK
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Default TrainJob status to Created unconditionally
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Set Failed condition on TrainJob runtime creation errors
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Emit a warning event upon TrainJob resources reconcile error
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Update TrainJob resources creation failed event
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Truncate event message to the maximum length limit
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Update state diagram in KEP-2170
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Append ellipsis to event message if it's truncated
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
---------
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
2025-05-02 17:59:05 +00:00
Shao Wang
9a2036dc93
fix(doc): tidy up KEP-2401. ( #2594 )
...
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-04-11 22:04:05 +00:00
Yuki Iwai
3781eda0e6
Add PodNetwork plugin to KEP-2170 Job Pipeline Framework description ( #2578 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-30 22:35:30 +00:00
Shao Wang
0b83eeb892
KEP-2401: Add `TorchTuneConfig` to `train()` API ( #2522 )
...
* feat(sdk): add TorchTuneConfig.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* rebase(sdk): rebase on the newest master branch.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): add args description in train().
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): add description for train() func.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): split train() according to trainer and fine_tuning_config
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): update the launching command and args.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): add get_args_using_torchtune_config.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): fix some wrong description in train()
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): update the description of fine_tuning_config.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): remove numProcPernode in train().
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): add TorchTuneConfig in train()
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): add torchtune logic in train()
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* feat(sdk): add BuiltinTrainer.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* feat(sdk): add BuiltinTrainer logic.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): add description for initializer.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): update description of runtime
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): address unresolved merge error.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): remove duplicated fields.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): fix train() description according to the review.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): add get_trainer_crd_from_custom_trainer
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): add get_trainer_crd_from_builtin_trainer and refactor train() API.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(sdk): add Loss enum class.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update loss type in KEP-2401
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): remove BuiltinTrainer in train().
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): add enum type for dtype.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update KEP according to the type update of dtype.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update the type description in the table.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): update dtype validation in utils.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): update dtype override according to the review.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
---------
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-03-26 12:23:04 +00:00
Shao Wang
4e6199c948
feat(doc): add Runtime API design in KEP-2401. ( #2501 )
...
* feat(doc): add Runtime API design in KEP-2401.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix typo error.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(doc): update the implementation history.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): rename model to pretrained_model.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(doc): update runtime class according to the review.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(doc): update the runtimes design according to PR #2545
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(doc): update train() API according to PR #2545
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update runtime_ref field.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
---------
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-03-25 02:28:56 +00:00
Andrey Velichkevich
4b0c2943bc
feat(sdk): Support MPI-based TrainJobs ( #2545 )
...
* feat(sdk): Support MPI-based TrainJobs
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Refactor list_runtimes
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix example
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add Runtime Trainer object
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update for new Runtime object
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Implement get_runtime API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix Torch example
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Remove un-unsed consts
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update func args
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update SDK constants
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Change to 16Gi
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix container name for MPI
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Keep launcher container for MPI
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-20 18:08:02 +00:00
Shao Wang
e7609928f3
fix(doc): Update `train()` API in KEP-2401 ( #2536 )
...
* fix(doc): update train API.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update example for train API.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix error in train() API example.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update func type in CustomTrainer.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix error in runtime_ref type.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix some typos.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* Update docs/proposals/2401-llm-trainer-v2/README.md
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>
* Update docs/proposals/2401-llm-trainer-v2/README.md
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>
* Update docs/proposals/2401-llm-trainer-v2/README.md
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>
* Update docs/proposals/2401-llm-trainer-v2/README.md
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>
---------
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-18 02:35:20 +00:00
Andrey Velichkevich
b323b44cac
feat(controller): Refactor the Initializer APIs of TrainJob ( #2523 )
...
* feat(controller): Refactor the Initializer APIs of TrainJob
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix go unit test
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix integration test
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-17 07:15:51 +00:00
Shao Wang
b89ce8491f
KEP-2401: Kubeflow LLM Trainer V2 ( #2410 )
...
* doc: add initial doc for KEP-2401.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: update motivation.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add llm lifecycle picture.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add goals and non-goals.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add alternatives.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add proposal chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add multiple frameworks support section in design chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add data preprocess section in design chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add fine-tuning config section in design details chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): remote all trailing whitespaces.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: update llm-trainer-v2-workflow img.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: update goals and non-goals.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: remove torchrun proposal to alternatives.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: move torchrun design to alternatives.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: move some fine-tuning config not support by torchtune to alternatives.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: move torchtune sections to proposal and design chapters.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: update proposal & move FSDP config to alternatives.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: update fine-tuning config & unify lora/qlora/dora.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update fine-tuning config & fix doc according to comments.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add model & dataset initialization / model exporting.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add dataset preprocess/tokenizer chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: fix some errors in doc.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: update chapter name.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add type in the diagrams.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add optimizer and scheduler config.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix some errors in doc.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add initial parameter override.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: update config override.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: fix some errors in doc.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): add CustomTrainingConfig dataclass.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): integrate torchtune mutation logic into torch plugin.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): split torchtune config chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): add two options for SDK & seperate LoRA chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add an example to show parameters mutation.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add detailed design on mutation in torch plugin.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add dir structure for option 1.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add dir structure for option 2.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add Test Plans chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): remove device parameter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix typo error.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix code line format.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update error in proposal example & add num_nodes and resources_per_node to TorchtuneConfig.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update manifests dir in option 1.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): split complement torch plugin chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): move option 1 (reserving recipe and config) to alternatives & reorganize structures.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update goals & add description in propagate torchtune settings in SDK.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(doc): complete map section in SDK.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(doc): add maintaining ClusterTrainingRuntime chapter.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update recipe selection.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): remove some CTRs & only reserve llama family.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): rename TorchtuneConfig to TorchTuneConfig.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): remove name prefix in CTRs.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update TrainJob and CTR example.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix some typos & address comments.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update complement torch plugin section.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): add gemma2 mistral qwen2_5 back.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update implementation history.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): remove the name prefix in CTRs.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update typo according to the review.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(doc): add webhook section.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(doc): add webhook func description.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update item format.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): add the lifecyle of LLM fine-tuning with torchtune.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): remove diagram description.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): reorg and update the doc according to the review.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix some typos.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): fix some format error.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update implementation history.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): rename CTRs' file name.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): remove detailed design.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
---------
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-03-11 11:29:39 +00:00
Andrey Velichkevich
9ac32413c3
feat(controller): Integrate DependsOn API ( #2484 )
...
* feat(controller): Integrate DependsOn API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Use go for unit test
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update Makefile
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update Makefile
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix integration test
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix e2e
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Exit 1 if e2e fails
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-07 01:11:34 +00:00
Yuki Iwai
3ec8f0705f
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design ( #2439 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-03 21:51:51 +00:00
Anish Asthana
e7e35d12f5
Add 'KEP Usage' KEP and template link ( #2423 )
...
Signed-off-by: Anish Asthana <anishasthana1@gmail.com>
2025-02-15 00:23:37 +00:00
Shao Wang
62e958fa8c
KEP-2170: Change API Group Name to `trainer.kubeflow.org` ( #2413 )
...
* fix(apis): change the group of API to trainer.kubeflow.org.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore(manifests): update crds in manifests using make manifests.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore: change the apia dir name to trainer.kubeflow.org and update reference.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore: execute make generate.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix: remove remaining kubeflow.org dirs.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): remove outdated docs & update models reference.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix: rename apis dir to ttrainer.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* chore: execute make generate.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): remove outdated docs & update models reference.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(sdk): update model reference in code.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
* fix(doc): update api grou p in KEP-2170.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
---------
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-02-05 13:34:37 +00:00
Andrey Velichkevich
0c30f5cd30
KEP-2170: Update V2 KEP with MPI Runtime info ( #2345 )
...
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-12-16 03:21:08 +00:00
Yuki Iwai
9e46f9d422
KEP-2170: Add the TrainJob state transition design ( #2298 )
...
* KEP-2170: Add the TrainJob state transition design
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Replace actual jobs with TrainJob
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Expand the Creation Failed reasons
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Rename Completed to Complete
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-11-02 21:31:14 +00:00
Andrey Velichkevich
2e1e125d7c
KEP-2170: Implement Initializer builders in the JobSet plugin ( #2316 )
...
* KEP-2170: Implement Initializer builder in the JobSet plugin
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update the SDK models
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Remove Info from Initializer builder
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update manifests
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update pkg/constants/constants.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Use var for envs
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Remove check manifests from GitHub actions
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Move consts to JobSet plugin
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-11-01 21:07:14 +00:00
Yuki Iwai
a655a9045b
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings ( #2304 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-24 19:26:31 +00:00
Yuki Iwai
ab6938c864
KEP-2170: Decouple JobSet from TrainJob ( #2296 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-23 14:50:30 +00:00
Andrey Velichkevich
6965c1a924
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API ( #2283 )
...
* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Rename RuntimeRef in runtime framework
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-10-17 18:29:20 +00:00
Andrey Velichkevich
2cc5dfed46
Update README and out-of-date docs ( #2252 )
...
* Update README and out-of-date docs
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Move KEPs
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Revert Jax KEP table
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix readme text
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-09-10 10:18:20 +00:00
Andrey Velichkevich
13c3ee8354
KEP-2170: Update Training V2 APIs in the KEP ( #2240 )
...
* KEP-2170: Update Training V2 APIs in the KEP
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update docs/proposals/2170-kubeflow-training-v2/README.md
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update PodSpecOverride API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update managedBy comment
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-08-30 18:46:33 +00:00
Yuki Iwai
725b09e300
KEP-2170: Add the apiGroup to the TrainingRuntimeRef ( #2201 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-08-09 15:16:38 +00:00
Yuki Iwai
94140ed1d3
KEP-2170: Make API specification more restricting ( #2198 )
...
* Fix formatting issues
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Make trainingRuntimeRef more clarify
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Update the managedBy specifications
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Use 'kubeflow.org/trainjob-controller' instead of 'training-operator.kubeflow.org/trainjob-controller'
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* The ClusterTrainingRuntime is used in the runtimeRef as a default
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Move apiVersion for the TrainingRuntime to alternative section
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-08-09 11:58:38 +00:00
Andrey Velichkevich
53341c9e9d
KEP-2170: Kubeflow Training V2 API ( #2171 )
...
* KEP-2170: Kubeflow Training V2 API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix some comments
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add user roles diagram
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Move diagrams after design
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update diagram
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Refactor Model and Dataset configs
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update runtime timelines
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Address readability comments
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Explaination for Trainer
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update LLM Fine-Tuning Diagram
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix Llama model name
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add goal for integration with Kueue
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add links for Job run policies
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add some alternatives
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix more API types
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix empty number of nodes
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Rename to Coscheduling
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Change parameters to env
Add runLauncherAsNode parameter
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update PodSpecOverride with scheduling directives
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix TrainingRuntime field
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Refactor PodGroupSpec APIs
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add note about scheduler name
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add initial TrainJob status field
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix pre-commit
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-08-06 16:56:39 +00:00
Sandipan Panda
ee736a76ba
Update JAX integration proposal ( #2165 )
...
Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
2024-07-15 16:09:55 +00:00
Sandipan Panda
bcba864ee2
JAX Integration Enhancement Proposal ( #2125 )
...
Kubeflow Enhancement Proposal: Integrate JAX with Kubeflow Training Operator
Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
2024-07-12 10:30:17 +00:00
Andrey Velichkevich
0b6a30cd34
[SDK] Fix Worker and Master templates for PyTorchJob ( #1988 )
2024-01-16 19:09:19 +00:00
deepanker13
39f8b2202b
Train/Fine-tune API Proposal for LLMs ( #1945 )
...
* added train api proposal
* feedback changes
* Update docs/proposals/train_api_proposal.md
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Update docs/proposals/train_api_proposal.md
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Update docs/proposals/train_api_proposal.md
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* proposal review changes
* Update docs/proposals/train_api_proposal.md
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review changes
* name fix
* review ehanges
* review changes
* review changes
* adding goal/nongoal header
* adding more non goals
* review changes
* adding br tags
* review changes
---------
Co-authored-by: Johnu George <johnu.george@nutanix.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2023-12-05 16:35:06 +00:00