Yuki Iwai
|
998eaff199
|
Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
|
2025-03-24 11:05:06 +00:00 |
Yuki Iwai
|
d94538afc8
|
Construct Trainer based on trainer.kubeflow.org/trainjob-ancestor-step label (#2548)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
|
2025-03-20 14:01:57 +00:00 |
Yuki Iwai
|
356aebe8a7
|
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
|
2025-03-18 15:55:44 +00:00 |
Yuki Iwai
|
532e737ffc
|
Migrate InfoOptions.podSpecReplias and info.Scheduler.TotalRequests to info.TemplateSpec.PodSet (#2524)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
# Conflicts:
# pkg/runtime/core/trainingruntime.go
# pkg/runtime/runtime.go
|
2025-03-16 17:09:49 +00:00 |
Akshay Chitneni
|
250c1166f8
|
KEP-2170: Adding validation webhook for v2 trainjob (#2307)
Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Co-authored-by: Akshay Chitneni <achitneni@apple.com>
|
2025-03-16 03:41:51 +00:00 |
Yuki Iwai
|
f64bdf2cc6
|
Implemenet MPI Plugin for OpenMPI (#2493)
* Implemenet MPI Plugin for OpenMPI
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Directory pass the JobSetApplyconfiguration to RuntimeInfo
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Make repeated string as constants
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Use numNodes=1 as default mpi_distributed ClusterTrainingRuntime
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Remove unused errors
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Rename runLauncherAsWorker with runLauncherAsNode
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Fix unintended constants usage for ModelMountPath
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Rename SecretDataComparer with MPISecretDataComparer
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Add TODO STAEEMENT to deprecated env wrappers.
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
|
2025-03-13 05:00:56 +00:00 |
Antonin Stefanutti
|
3c6c90f231
|
KEP-2170: Use SSA to reconcile TrainJob components (#2431)
* KEP-2170: Use SSA to reconcile TrainJob components
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Enable Unstructured caching in controller manager config
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Fix PodGroup apply configuration
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* API to apply config conversion util functions
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Only add namespace to TrainingRuntime object key
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Fix EnvVar upsert
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Update unit tests
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Fix JobSet resource requirements
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Resolve build issues with launcher job
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Use apply config for MPI ConfigMap and Secret
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* ComponentBuilderPlugin now returns an array
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Use plain apply configurations instead of unstructured
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Use apply config in EnforceMLPolicy plugins
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Do not update JobSets that are not suspended
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Address review feedback
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Do not update PodGroup if TrainJob is not suspended
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* Remove obsolete TODO
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
---------
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
|
2025-02-26 08:39:09 +00:00 |
Andrey Velichkevich
|
3060332931
|
Update the naming conventions for Kubeflow Trainer (#2415)
* Update the naming conventions for Kubeflow Trainer
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix webhooks
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Fix paths for webhooks
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update go test cmd
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Rename kubeflowv1 to trainer pkg
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
---------
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
2025-02-06 13:48:30 +00:00 |