Commit Graph

28 Commits

Author SHA1 Message Date
Antonin Stefanutti c333826023
Remove TrainJobCreated condition (#2621)
* Remove TrainJobCreated condition

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Update KEP-2170 proposal

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Remove Created condition from SDK

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Default TrainJob status to Created unconditionally

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Set Failed condition on TrainJob runtime creation errors

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Emit a warning event upon TrainJob resources reconcile error

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Update TrainJob resources creation failed event

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Truncate event message to the maximum length limit

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Update state diagram in KEP-2170

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Append ellipsis to event message if it's truncated

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
2025-05-02 17:59:05 +00:00
Shao Wang 9a2036dc93
fix(doc): tidy up KEP-2401. (#2594)
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-04-11 22:04:05 +00:00
Yuki Iwai 3781eda0e6
Add PodNetwork plugin to KEP-2170 Job Pipeline Framework description (#2578)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-30 22:35:30 +00:00
Shao Wang 0b83eeb892
KEP-2401: Add `TorchTuneConfig` to `train()` API (#2522)
* feat(sdk): add TorchTuneConfig.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* rebase(sdk): rebase on the newest master branch.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): add args description in train().

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): add description for train() func.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): split train() according to trainer and fine_tuning_config

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): update the launching command and args.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): add get_args_using_torchtune_config.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): fix some wrong description in train()

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): update the description of fine_tuning_config.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove numProcPernode in train().

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): add TorchTuneConfig in train()

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): add torchtune logic in train()

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* feat(sdk): add BuiltinTrainer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* feat(sdk): add BuiltinTrainer logic.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): add description for initializer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): update description of runtime

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): address unresolved merge error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove duplicated fields.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): fix train() description according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): add get_trainer_crd_from_custom_trainer

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): add get_trainer_crd_from_builtin_trainer and refactor train() API.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): add Loss enum class.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update loss type in KEP-2401

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove BuiltinTrainer in train().

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): add enum type for dtype.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update KEP according to the type update of dtype.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update the type description in the table.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): update dtype validation in utils.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): update dtype override according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-03-26 12:23:04 +00:00
Shao Wang 4e6199c948
feat(doc): add Runtime API design in KEP-2401. (#2501)
* feat(doc): add Runtime API design in KEP-2401.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix typo error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): update the implementation history.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): rename model to pretrained_model.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): update runtime class according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): update the runtimes design according to PR #2545

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): update train() API according to PR #2545

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update runtime_ref field.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-03-25 02:28:56 +00:00
Andrey Velichkevich 4b0c2943bc
feat(sdk): Support MPI-based TrainJobs (#2545)
* feat(sdk): Support MPI-based TrainJobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor list_runtimes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add Runtime Trainer object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update for new Runtime object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Implement get_runtime API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Torch example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove un-unsed consts

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update func args

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update SDK constants

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Change to 16Gi

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix container name for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Keep launcher container for MPI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-20 18:08:02 +00:00
Shao Wang e7609928f3
fix(doc): Update `train()` API in KEP-2401 (#2536)
* fix(doc): update train API.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update example for train API.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix error in train() API example.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update func type in CustomTrainer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix error in runtime_ref type.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some typos.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* Update docs/proposals/2401-llm-trainer-v2/README.md

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>

* Update docs/proposals/2401-llm-trainer-v2/README.md

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>

* Update docs/proposals/2401-llm-trainer-v2/README.md

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>

* Update docs/proposals/2401-llm-trainer-v2/README.md

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Shao Wang <77665902+Electronic-Waste@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-18 02:35:20 +00:00
Andrey Velichkevich b323b44cac
feat(controller): Refactor the Initializer APIs of TrainJob (#2523)
* feat(controller): Refactor the Initializer APIs of TrainJob

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix go unit test

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix integration test

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-17 07:15:51 +00:00
Shao Wang b89ce8491f
KEP-2401: Kubeflow LLM Trainer V2 (#2410)
* doc: add initial doc for KEP-2401.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update motivation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add llm lifecycle picture.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add goals and non-goals.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add proposal chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add multiple frameworks support section in design chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add data preprocess section in design chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add fine-tuning config section in design details chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remote all trailing whitespaces.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update llm-trainer-v2-workflow img.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update goals and non-goals.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: remove torchrun proposal to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move torchrun design to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move some fine-tuning config not support by torchtune to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move torchtune sections to proposal and design chapters.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update proposal & move FSDP config to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update fine-tuning config & unify lora/qlora/dora.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update fine-tuning config & fix doc according to comments.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add model & dataset initialization / model exporting.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dataset preprocess/tokenizer chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update chapter name.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add type in the diagrams.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add optimizer and scheduler config.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add initial parameter override.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update config override.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add CustomTrainingConfig dataclass.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): integrate torchtune mutation logic into torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): split torchtune config chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add two options for SDK & seperate LoRA chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add an example to show parameters mutation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add detailed design on mutation in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dir structure for option 1.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dir structure for option 2.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add Test Plans chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove device parameter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix typo error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix code line format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update error in proposal example & add num_nodes and resources_per_node to TorchtuneConfig.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update manifests dir in option 1.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): split complement torch plugin chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): move option 1 (reserving recipe and config) to alternatives & reorganize structures.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update goals & add description in propagate torchtune settings in SDK.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): complete map section in SDK.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add maintaining ClusterTrainingRuntime chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update recipe selection.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove some CTRs & only reserve llama family.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): rename TorchtuneConfig to TorchTuneConfig.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove name prefix in CTRs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update TrainJob and CTR example.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some typos & address comments.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update complement torch plugin section.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add gemma2 mistral qwen2_5 back.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update implementation history.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove the name prefix in CTRs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update typo according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add webhook section.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add webhook func description.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update item format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add the lifecyle of LLM fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove diagram description.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): reorg and update the doc according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some typos.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some format error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update implementation history.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): rename CTRs' file name.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove detailed design.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-03-11 11:29:39 +00:00
Andrey Velichkevich 9ac32413c3
feat(controller): Integrate DependsOn API (#2484)
* feat(controller): Integrate DependsOn API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use go for unit test

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update Makefile

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update Makefile

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix integration test

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix e2e

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Exit 1 if e2e fails

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-07 01:11:34 +00:00
Yuki Iwai 3ec8f0705f
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design (#2439)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-03 21:51:51 +00:00
Anish Asthana e7e35d12f5
Add 'KEP Usage' KEP and template link (#2423)
Signed-off-by: Anish Asthana <anishasthana1@gmail.com>
2025-02-15 00:23:37 +00:00
Shao Wang 62e958fa8c
KEP-2170: Change API Group Name to `trainer.kubeflow.org` (#2413)
* fix(apis): change the group of API to trainer.kubeflow.org.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(manifests): update crds in manifests using make manifests.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: change the apia dir name to trainer.kubeflow.org and update reference.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: execute make generate.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: remove remaining kubeflow.org dirs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove outdated docs & update models reference.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: rename apis dir to ttrainer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: execute make generate.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove outdated docs & update models reference.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): update model reference in code.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update api grou p in KEP-2170.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-02-05 13:34:37 +00:00
Andrey Velichkevich 0c30f5cd30
KEP-2170: Update V2 KEP with MPI Runtime info (#2345)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-12-16 03:21:08 +00:00
Yuki Iwai 9e46f9d422
KEP-2170: Add the TrainJob state transition design (#2298)
* KEP-2170: Add the TrainJob state transition design

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Replace actual jobs with TrainJob

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Expand the Creation Failed reasons

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rename Completed to Complete

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-11-02 21:31:14 +00:00
Andrey Velichkevich 2e1e125d7c
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316)
* KEP-2170: Implement Initializer builder in the JobSet plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update the SDK models

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove Info from Initializer builder

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update manifests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update pkg/constants/constants.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use var for envs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove check manifests from GitHub actions

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move consts to JobSet plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-11-01 21:07:14 +00:00
Yuki Iwai a655a9045b
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-24 19:26:31 +00:00
Yuki Iwai ab6938c864
KEP-2170: Decouple JobSet from TrainJob (#2296)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-23 14:50:30 +00:00
Andrey Velichkevich 6965c1a924
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)
* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Rename RuntimeRef in runtime framework

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-10-17 18:29:20 +00:00
Andrey Velichkevich 2cc5dfed46
Update README and out-of-date docs (#2252)
* Update README and out-of-date docs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move KEPs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Revert Jax KEP table

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix readme text

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-09-10 10:18:20 +00:00
Andrey Velichkevich 13c3ee8354
KEP-2170: Update Training V2 APIs in the KEP (#2240)
* KEP-2170: Update Training V2 APIs in the KEP

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update docs/proposals/2170-kubeflow-training-v2/README.md

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update PodSpecOverride API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update managedBy comment

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-08-30 18:46:33 +00:00
Yuki Iwai 725b09e300
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-08-09 15:16:38 +00:00
Yuki Iwai 94140ed1d3
KEP-2170: Make API specification more restricting (#2198)
* Fix formatting issues

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Make trainingRuntimeRef more clarify

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update the managedBy specifications

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Use 'kubeflow.org/trainjob-controller' instead of 'training-operator.kubeflow.org/trainjob-controller'

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* The ClusterTrainingRuntime is used in the runtimeRef as a default

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Move apiVersion for the TrainingRuntime to alternative section

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-08-09 11:58:38 +00:00
Andrey Velichkevich 53341c9e9d
KEP-2170: Kubeflow Training V2 API (#2171)
* KEP-2170: Kubeflow Training V2 API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix some comments

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add user roles diagram

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move diagrams after design

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update diagram

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor Model and Dataset configs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update runtime timelines

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Address readability comments

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Explaination for Trainer

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update LLM Fine-Tuning Diagram

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Llama model name

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add goal for integration with Kueue

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add links for Job run policies

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add some alternatives

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix more API types

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix empty number of nodes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Rename to Coscheduling

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Change parameters to env

Add runLauncherAsNode parameter

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update PodSpecOverride with scheduling directives

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix TrainingRuntime field

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Refactor PodGroupSpec APIs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add note about scheduler name

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add initial TrainJob status field

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix pre-commit

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-08-06 16:56:39 +00:00
Sandipan Panda ee736a76ba
Update JAX integration proposal (#2165)
Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
2024-07-15 16:09:55 +00:00
Sandipan Panda bcba864ee2
JAX Integration Enhancement Proposal (#2125)
Kubeflow Enhancement Proposal: Integrate JAX with Kubeflow Training Operator

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
2024-07-12 10:30:17 +00:00
Andrey Velichkevich 0b6a30cd34
[SDK] Fix Worker and Master templates for PyTorchJob (#1988) 2024-01-16 19:09:19 +00:00
deepanker13 39f8b2202b
Train/Fine-tune API Proposal for LLMs (#1945)
* added train api proposal

* feedback changes

* Update docs/proposals/train_api_proposal.md

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update docs/proposals/train_api_proposal.md

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update docs/proposals/train_api_proposal.md

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* proposal review changes

* Update docs/proposals/train_api_proposal.md

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* review changes

* name fix

* review ehanges

* review changes

* review changes

* adding goal/nongoal header

* adding more non goals

* review changes

* adding br tags

* review changes

---------

Co-authored-by: Johnu George <johnu.george@nutanix.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2023-12-05 16:35:06 +00:00