Commit Graph

8 Commits

Author SHA1 Message Date
Yuki Iwai 998eaff199
Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-24 11:05:06 +00:00
Yuki Iwai d94538afc8
Construct Trainer based on trainer.kubeflow.org/trainjob-ancestor-step label (#2548)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-20 14:01:57 +00:00
Yuki Iwai 356aebe8a7
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-18 15:55:44 +00:00
Yuki Iwai 532e737ffc
Migrate InfoOptions.podSpecReplias and info.Scheduler.TotalRequests to info.TemplateSpec.PodSet (#2524)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

# Conflicts:
#	pkg/runtime/core/trainingruntime.go
#	pkg/runtime/runtime.go
2025-03-16 17:09:49 +00:00
Akshay Chitneni 250c1166f8
KEP-2170: Adding validation webhook for v2 trainjob (#2307)
Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Co-authored-by: Akshay Chitneni <achitneni@apple.com>
2025-03-16 03:41:51 +00:00
Yuki Iwai f64bdf2cc6
Implemenet MPI Plugin for OpenMPI (#2493)
* Implemenet MPI Plugin for OpenMPI

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Directory pass the JobSetApplyconfiguration to RuntimeInfo

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Make repeated string as constants

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Use numNodes=1 as default mpi_distributed ClusterTrainingRuntime

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove unused errors

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rename runLauncherAsWorker with runLauncherAsNode

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Fix unintended constants usage for ModelMountPath

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rename SecretDataComparer with MPISecretDataComparer

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Add TODO STAEEMENT to deprecated env wrappers.

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-03-13 05:00:56 +00:00
Antonin Stefanutti 3c6c90f231
KEP-2170: Use SSA to reconcile TrainJob components (#2431)
* KEP-2170: Use SSA to reconcile TrainJob components

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Enable Unstructured caching in controller manager config

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix PodGroup apply configuration

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* API to apply config conversion util functions

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Only add namespace to TrainingRuntime object key

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix EnvVar upsert

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Update unit tests

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix JobSet resource requirements

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Resolve build issues with launcher job

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Use apply config for MPI ConfigMap and Secret

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* ComponentBuilderPlugin now returns an array

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Use plain apply configurations instead of unstructured

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Use apply config in EnforceMLPolicy plugins

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Do not update JobSets that are not suspended

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Address review feedback

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Do not update PodGroup if TrainJob is not suspended

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Remove obsolete TODO

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
2025-02-26 08:39:09 +00:00
Andrey Velichkevich 3060332931
Update the naming conventions for Kubeflow Trainer (#2415)
* Update the naming conventions for Kubeflow Trainer

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix webhooks

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix paths for webhooks

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update go test cmd

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Rename kubeflowv1 to trainer pkg

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-02-06 13:48:30 +00:00