wang-mask
d659ed29c4
Merge edbde8caeb into b1c5c9d060
2025-09-24 21:03:30 +08:00
Yuki Iwai
d164ea463d
Upgrade golangci-lint v1 to v2 ( #714 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-09-02 01:32:01 +00:00
Yuki Iwai
6c4f285eba
Use Kubernetes v1.33 ( #710 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-31 05:22:59 +00:00
Yuki Iwai
c63710108d
Upgrade K8s version to v1.32.7 ( #708 )
...
* Upgrade K8s module version to v1.32
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Upgrade KIND version to v0.29.0
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Introduce scheduler-plugins RBAC workaround
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-25 21:28:19 +00:00
Yuki Iwai
2d901d0db6
Propagate ClusterDomain from server to reconciler ( #707 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-15 00:39:13 +00:00
Yuki Iwai
e02dfe57f4
Support cluster domain for MPI HostFile ( #704 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-14 23:52:13 +00:00
Yuki Iwai
e90c17619a
Enable publishNotReadyAddresses when the runLauncherAsWorker ( #703 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-14 19:00:14 +00:00
GonzaloSaez
535665c0f2
Fix missing ReplicaIndexLabel when using RunLauncherAsWorker ( #690 )
...
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
2025-04-14 20:40:22 +00:00
GonzaloSaez
7f94988ab1
Fix crash in podgroup when runLauncherAsWorker is true ( #669 )
...
* Fix crash in podgroup when runLauncherAsWorker is true
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
* Address comments
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
---------
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
2025-01-16 13:29:34 +00:00
Rotem Elad
cbe4f8aab9
Expose job controller's workqueue rate limiting configs ( #674 )
...
* Expose controller workqueue config via options
Signed-off-by: Rotem Elad <rotem.elad@run.ai>
* Fix double hyphen typo
Signed-off-by: Rotem Elad <rotem.elad@run.ai>
* Generate
Signed-off-by: Rotem Elad <rotem.elad@run.ai>
---------
Signed-off-by: Rotem Elad <rotem.elad@run.ai>
2025-01-13 01:47:07 +00:00
Yuki Iwai
c738a83b18
Fix the 'printf: non-constant format string in call to fmt.Errorf (govet)' lint errors ( #666 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-16 01:42:18 +00:00
Yuki Iwai
5caa9d5029
Reuse the core kubernetes API reason for the BackoffLimitExceeded ( #667 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-16 01:16:18 +00:00
Carlos Eduardo Arango Gutierrez
a869150953
Bump to k8s 1.31 ( #664 )
...
* Bump to k8s 1.31
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* Bump sigs.k8s.io/controller-runtime to v0.19.0
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* Bump golangci-lint to v1.61.0
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* change queue from RateLimitingInterface to TypedRateLimitingInterface[any]
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* Update kubectl url
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
---------
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-10-15 20:25:18 +00:00
Michał Szadkowski
c29c37ca7e
Introduce ManagedBy field in RunPolicy ( #650 )
...
* Introduce ManageBy field to RunPolicy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Make mpi-operator a default value for ManagedBy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add validation for ManagedBy field
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Make use of ManagedBy in reconciliation process
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Regenerate code after adding managedBy field
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add e2e tests
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Update after code review
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Update tests
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Remove default value for ManagedBy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add optional tag
Replace backoff and consistently with sleep
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Create common util package for integration and e2e tests with sleep/wait constants
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
---------
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
2024-10-10 17:16:10 +00:00
Yuki Iwai
8d806df31c
Introduce resource multiplication ( #634 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-17 19:22:28 +00:00
Yuki Iwai
4d5156d07a
Replace original pointer methods with ptr libs ( #635 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-17 16:17:28 +00:00
Chitsing KUI
942a20afa6
Fix: no overwrite when run launcher as worker ( #628 )
...
* no overwrite when run launcher as worker
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* add ut for rm nv env for launcher-as-worker
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
---------
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
2024-03-04 15:49:08 +00:00
Chitsing KUI
f92b9c7e74
Deprecated pointer, use ptr instead ( #627 )
...
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
2024-02-27 13:28:00 +00:00
Chitsing KUI
a6c2da887d
run worker process in launcher pod ( #612 )
...
* run worker in launcher pod; fix DCO issue
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* use ptr.Deref
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* update manifest
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* more Deref
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* create one service for both launcher and worker
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
---------
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
2024-02-26 15:17:58 +00:00
Chitsing KUI
a1ff84cbeb
Fix: add ns filter to podLister ( #622 )
2024-02-06 07:46:44 +00:00
emsixteeen
4c9ac06bf1
Fails mpi-operator early if access to list or watch objects is denied ( #619 )
2024-02-05 19:22:17 +00:00
wang-mask
edbde8caeb
fix the condition
...
Signed-off-by: wang-mask <2018091609006@std.uestc.edu.cn>
2024-01-26 19:56:32 +08:00
Yuki Iwai
3c7fad663a
Upgrade K8s dependencies to v0.27.4 ( #584 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-07 22:17:56 +00:00
dragon-fly
e1590ce61e
merge kubeflow/common.v1 to mpi-operator ( #571 )
...
* merge kubeflow/common.v1 to mpi-operator
Signed-off-by: lowang_bh <lhui_wang@163.com>
java gen Python SDK
Signed-off-by: lowang_bh <lhui_wang@163.com>
* update make generate and fix comment issues
Signed-off-by: lowang_bh <lhui_wang@163.com>
* Update pkg/apis/kubeflow/v2beta1/types.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* merge from master to solve conflict
Signed-off-by: lowang-bh <lhui_wang@163.com>
* change reference link to training-operator project
Signed-off-by: lowang-bh <lhui_wang@163.com>
---------
Signed-off-by: lowang_bh <lhui_wang@163.com>
Signed-off-by: lowang-bh <lhui_wang@163.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-07-08 19:52:53 +00:00
xhejtman
f8d815cdf4
Run workers first and wait for them ( #484 )
...
* Real rebase of waitforworkes option
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix generated API
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix format
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Add docs
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix typo
* Add tests for waitforworkers
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Add missing err test
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix cleanpodpolicy
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Remove debug
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix tests
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Rework api
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix generated api
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* One more fix of api
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Swagger fix
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix readme
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix readme again
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Add comments
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Add kubebuilder annotations
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix manifests
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
---------
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
2023-06-26 18:37:14 +00:00
Mateusz Kubica
21f326d1d2
MPICH support ( #562 )
...
* Add support for MPICH
* Fix CI errors
* Temporary: manual trigger
* Fix file name
* Add an empty line at the end of the file
* Fix formatting
* Revert "Temporary: manual trigger"
This reverts commit 15164a8b70 .
* fix formatting
* Regenerate the mpi-operator.yaml
* Adding an empy line at the end of Dockerfiles
* Share the same entrypoin for Intel and MPICH
* share hostfile generation between Intel and MPICH
* Add validation test for MPICH
* Fix formatting
* Don't over engineer the tests - be explicit
* add non-root tests for IntelMPI and MPICH
2023-06-16 17:57:36 +00:00
dragon-fly
caa1112993
add volcano gang-scheduler pg min resource calculation ( #566 )
...
* add volcano gang-scheduler pg min resource calculation
Signed-off-by: lowang_bh <lhui_wang@163.com>
* use priorityclass lister
Signed-off-by: lowang_bh <lhui_wang@163.com>
* Update pkg/controller/podgroup.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: lowang_bh <lhui_wang@163.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-06-16 15:04:37 +00:00
Yuki Iwai
fda0532ba1
Fix a bug that the PodGroupCtrl can not list priorityclass ( #561 )
...
* Fix a potentially null pointer error
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Fix a bug that the PodGroupCtrl can not list priorityclass
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Refactoring setups for gang-scheduling
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-06-08 17:11:59 +00:00
Yuki Iwai
03bba1ff48
Fix the logic to calculate minResources ( #543 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-04-04 20:17:02 +00:00
Yuki Iwai
5583ba9f7d
Update podgroups once schedulingPolicy of MPIJobs are changed ( #542 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-03-31 16:23:58 +00:00
Yuki Iwai
2495860427
Support the coscheduling plugin of scheduler-plugins ( #538 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-03-29 02:27:12 +00:00
Yuki Iwai
b302019be7
Respect SchedulingPolicy ( #520 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-28 15:37:42 +00:00
Yuki Iwai
c21942d1e2
Add slots to hostfile ( #523 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-10 19:10:02 +00:00
Yuki Iwai
0fac25de60
Remove duplicated imports ( #524 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-10 18:21:02 +00:00
Yuki Iwai
efbba01f8d
Clean up manifests ( #510 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-08 17:52:33 +00:00
Michał Woźniak
92e491e6e9
Support suspend semantics for MPIJob ( #511 )
...
* Implement Suspend semantics for MPIJob
# Conflicts:
# pkg/apis/kubeflow/v2beta1/types.go
# pkg/controller/mpi_job_controller.go
# pkg/controller/mpi_job_controller_status.go
# pkg/controller/mpi_job_controller_test.go
# test/integration/mpi_job_controller_test.go
* Changes
- add unit tests for creating suspended, suspending and resuming
- use fake clock for unit tests
- do not return from the syncHandler after worker pods cleanup on
suspend - this allows to continue with the MPIJob update in the same sync
# Conflicts:
# pkg/controller/mpi_job_controller.go
2023-02-03 15:44:02 +00:00
Yuki Iwai
4c8b4fc2e4
Use local copy of JobStatus by mpi-operator ( #514 )
...
* Use local copy of JobStatus by mpi-operator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* address comments
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-03 14:44:01 +00:00
Michał Woźniak
0b32af39c3
Use local copy of RunPolicy by MPI-operator ( #513 )
...
* Use local copy of RunPolicy by MPI-operator
Steps performed:
- copy the `RunPolicy` from common to `types.go`
- fix compilation errors by using the local RunPolicy definition
- run `make generate`
- run `make all`
- regenerate openapi_generated.go by `./hack/python-sdk/gen-sdk.sh` (with commented out rollback)
* Copy SchedulingPolicy and CleanPodPolicy for RunPolicy
2023-01-31 17:46:30 +00:00
Yuki Iwai
cd83424f65
Rename Go module name to 'github.com/kubeflow/mpi-operator' ( #506 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-25 16:28:53 +00:00
Yuki Iwai
dc36350d99
Move mpi-operator v2 to the top of the repository ( #496 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
2023-01-11 17:03:15 +00:00