Yuki Iwai
d164ea463d
Upgrade golangci-lint v1 to v2 ( #714 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-09-02 01:32:01 +00:00
Yuki Iwai
6c4f285eba
Use Kubernetes v1.33 ( #710 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-31 05:22:59 +00:00
Yuki Iwai
c63710108d
Upgrade K8s version to v1.32.7 ( #708 )
...
* Upgrade K8s module version to v1.32
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Upgrade KIND version to v0.29.0
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Introduce scheduler-plugins RBAC workaround
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-25 21:28:19 +00:00
GonzaloSaez
c50eb45b18
Fix E2E Intel MPI integ tests ( #676 )
...
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
2025-01-10 18:55:37 +00:00
Carlos Eduardo Arango Gutierrez
a869150953
Bump to k8s 1.31 ( #664 )
...
* Bump to k8s 1.31
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* Bump sigs.k8s.io/controller-runtime to v0.19.0
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* Bump golangci-lint to v1.61.0
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* change queue from RateLimitingInterface to TypedRateLimitingInterface[any]
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* Update kubectl url
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
---------
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-10-15 20:25:18 +00:00
Yuki Iwai
10f9e20b89
Upgrade the k8s dependency versions to 1.30 ( #657 )
...
* Upgrade the k8s dependency versions to 1.30
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Generate codes
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Update testing version
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-12 02:24:10 +00:00
Michał Szadkowski
1794cc0d44
Adjust the comment for managedBy ( #656 )
...
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
2024-10-11 09:49:10 +00:00
Michał Szadkowski
c29c37ca7e
Introduce ManagedBy field in RunPolicy ( #650 )
...
* Introduce ManageBy field to RunPolicy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Make mpi-operator a default value for ManagedBy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add validation for ManagedBy field
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Make use of ManagedBy in reconciliation process
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Regenerate code after adding managedBy field
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add e2e tests
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Update after code review
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Update tests
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Remove default value for ManagedBy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add optional tag
Replace backoff and consistently with sleep
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Create common util package for integration and e2e tests with sleep/wait constants
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
---------
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
2024-10-10 17:16:10 +00:00
Yuki Iwai
ae7c738d43
Upgrade K8s dependencies to v1.29 ( #633 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-17 18:51:27 +00:00
Yuki Iwai
4d5156d07a
Replace original pointer methods with ptr libs ( #635 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-17 16:17:28 +00:00
Chitsing KUI
a6c2da887d
run worker process in launcher pod ( #612 )
...
* run worker in launcher pod; fix DCO issue
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* use ptr.Deref
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* update manifest
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* more Deref
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* create one service for both launcher and worker
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
---------
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
2024-02-26 15:17:58 +00:00
Chitsing KUI
23c802b151
update auto gen file year to verify generate ( #623 )
2024-02-05 15:23:17 +00:00
dragon-fly
e1590ce61e
merge kubeflow/common.v1 to mpi-operator ( #571 )
...
* merge kubeflow/common.v1 to mpi-operator
Signed-off-by: lowang_bh <lhui_wang@163.com>
java gen Python SDK
Signed-off-by: lowang_bh <lhui_wang@163.com>
* update make generate and fix comment issues
Signed-off-by: lowang_bh <lhui_wang@163.com>
* Update pkg/apis/kubeflow/v2beta1/types.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* merge from master to solve conflict
Signed-off-by: lowang-bh <lhui_wang@163.com>
* change reference link to training-operator project
Signed-off-by: lowang-bh <lhui_wang@163.com>
---------
Signed-off-by: lowang_bh <lhui_wang@163.com>
Signed-off-by: lowang-bh <lhui_wang@163.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-07-08 19:52:53 +00:00
xhejtman
f8d815cdf4
Run workers first and wait for them ( #484 )
...
* Real rebase of waitforworkes option
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix generated API
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix format
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Add docs
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix typo
* Add tests for waitforworkers
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Add missing err test
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix cleanpodpolicy
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Remove debug
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix tests
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Rework api
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix generated api
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* One more fix of api
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Swagger fix
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix readme
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix readme again
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Add comments
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Add kubebuilder annotations
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
* Fix manifests
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
---------
Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
2023-06-26 18:37:14 +00:00
Mateusz Kubica
21f326d1d2
MPICH support ( #562 )
...
* Add support for MPICH
* Fix CI errors
* Temporary: manual trigger
* Fix file name
* Add an empty line at the end of the file
* Fix formatting
* Revert "Temporary: manual trigger"
This reverts commit 15164a8b70 .
* fix formatting
* Regenerate the mpi-operator.yaml
* Adding an empy line at the end of Dockerfiles
* Share the same entrypoin for Intel and MPICH
* share hostfile generation between Intel and MPICH
* Add validation test for MPICH
* Fix formatting
* Don't over engineer the tests - be explicit
* add non-root tests for IntelMPI and MPICH
2023-06-16 17:57:36 +00:00
Yuki Iwai
ccf2756f74
Commonize function newCleanPodPolicy() ( #557 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-05-14 13:04:30 +00:00
Yuki Iwai
2495860427
Support the coscheduling plugin of scheduler-plugins ( #538 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-03-29 02:27:12 +00:00
Yuki Iwai
b302019be7
Respect SchedulingPolicy ( #520 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-28 15:37:42 +00:00
Yuki Iwai
efbba01f8d
Clean up manifests ( #510 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-08 17:52:33 +00:00
Yuki Iwai
5f1914bfb2
Validate MPIJob name with the DNS 1035 label ( #517 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-04 00:51:58 +00:00
Michał Woźniak
92e491e6e9
Support suspend semantics for MPIJob ( #511 )
...
* Implement Suspend semantics for MPIJob
# Conflicts:
# pkg/apis/kubeflow/v2beta1/types.go
# pkg/controller/mpi_job_controller.go
# pkg/controller/mpi_job_controller_status.go
# pkg/controller/mpi_job_controller_test.go
# test/integration/mpi_job_controller_test.go
* Changes
- add unit tests for creating suspended, suspending and resuming
- use fake clock for unit tests
- do not return from the syncHandler after worker pods cleanup on
suspend - this allows to continue with the MPIJob update in the same sync
# Conflicts:
# pkg/controller/mpi_job_controller.go
2023-02-03 15:44:02 +00:00
Yuki Iwai
4c8b4fc2e4
Use local copy of JobStatus by mpi-operator ( #514 )
...
* Use local copy of JobStatus by mpi-operator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* address comments
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-03 14:44:01 +00:00
Michał Woźniak
0b32af39c3
Use local copy of RunPolicy by MPI-operator ( #513 )
...
* Use local copy of RunPolicy by MPI-operator
Steps performed:
- copy the `RunPolicy` from common to `types.go`
- fix compilation errors by using the local RunPolicy definition
- run `make generate`
- run `make all`
- regenerate openapi_generated.go by `./hack/python-sdk/gen-sdk.sh` (with commented out rollback)
* Copy SchedulingPolicy and CleanPodPolicy for RunPolicy
2023-01-31 17:46:30 +00:00
Yuki Iwai
05ac6addc0
Upgrade Kubernetes dependencies ( #502 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-26 18:13:09 +00:00
Yuki Iwai
cd83424f65
Rename Go module name to 'github.com/kubeflow/mpi-operator' ( #506 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-25 16:28:53 +00:00
Yuki Iwai
dc36350d99
Move mpi-operator v2 to the top of the repository ( #496 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
2023-01-11 17:03:15 +00:00
Yuki Iwai
c131315192
Remove MPI Operator V1 ( #492 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 13:40:56 +00:00
Gang Pu
b88edad03a
Generate sdk for v2 ( #434 )
...
* Generate sdk for v2
* Refine the version parameters of sdk generator
* add example for v2beta1
* make runPolicy optional
* 1: Ignore some generated files that is not needed
2: Add gitattributes file
2021-11-30 12:44:30 +00:00
Gang Pu
285cb98d59
Bump dependency versions to align with v2 ( #441 )
2021-11-22 14:33:37 +00:00
Gang Pu
a334c4c2b8
Remove v1alpha1 and v1alpha2 apis/controllers ( #438 )
2021-11-22 04:50:36 +00:00
Wang Zhang
680cd4db0f
Add python sdk and auto-generate script ( #357 )
2021-05-13 20:20:43 -04:00
Naveen
19173091b0
Go fmt changes that caused the git tree to be dirty ( #302 )
...
These go fmt changes caused the git tree to be dirty.
2020-12-15 15:20:20 -08:00
Tim Deng
07bbb45de9
add support for using Intel MPI(2019.7) and MVAPICH2 ( #283 )
...
* + support for IntelMPI and MPICH
+ local minikube test pass
+ add new Spec "mpiDistribution"
@ 2020/7/27
* * fix ineffectual assignment
* change email address
* * update variable name
* * fix some spelling and naming problems
* + add more notes
* + auto filter prefix parameters
* * fix some spelling problem
* update notes about hostfile generating
* + mpich-format hostfile split
* + generate hosts for hydra to resolve hostname
* * update notes
* * fix sh script
+ move hosts sending and merging here
* use special type instaed of string
* * check return value
* * update options' name
* + add unit test for generateHosts
* ^ fixed lint reported errors
2020-08-03 04:27:40 -07:00
Lei Xue
445cb4887d
Convert launcher job and statefulset worker to pod ( #203 )
...
* convert job and statefulset to pod
* fix issues
* remove duplicate yaml for v1
* modify unit test
* fix Dockerfile multiple version issue
2020-05-05 07:49:10 -07:00
Abhilash Pallerlamudi
ab8518d375
use volcano scheduler ( #242 )
...
* use volcano scheduler
Signed-off-by: Abhilash Pallerlamudi <stp.abhi@gmail.com>
* Trigger CI
2020-04-29 07:08:25 -07:00
Yuan Tang
94a21577df
Update to modify v1 API instead ( #221 )
...
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
2020-04-18 20:29:36 -07:00
Yuan Tang
da48dfba6c
Add initial v1 controller and APIs ( #225 )
...
* Initial v1 skeleton
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* Add v1 pkg/client
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* v1alpha2 -> v1
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* A couple fixes
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* Fix]
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* Fix lint issues
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* Add v1 deployment yaml
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* Fix versions
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
2020-04-16 06:46:46 -07:00
Lei Xue
9838cd4dea
add openapi_generated code ( #211 )
2020-04-09 05:33:44 -07:00
Abhilash Pallerlamudi
8c8d0d3002
update common api to latest ( #208 )
...
Signed-off-by: Abhilash Pallerlamudi <stp.abhi@gmail.com>
2020-03-26 09:48:27 -07:00
Yuan Tang
4e73e5ec38
Add RunPolicy to MPIJobSpec that reuses kubeflow/common spec ( #178 )
...
* Reuse RunPolicy from kubeflow/common spec
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* Update codegen
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
* Temporary fix
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
2020-01-28 13:23:40 -08:00
Wei Yan
5253329e1f
Expose main container name as a configurable field ( #174 )
...
* Expose main container name as a configurable field
* Move main container config to job level
* remove unnecessary configs
2020-01-23 09:21:26 -05:00
Yuan Tang
b58a10d99c
Update codegen to fix Travis CI ( #164 )
...
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
2020-01-13 11:57:38 -08:00
Ce Gao
815f720982
fix: Avoid nil pointer exception ( #148 )
...
Signed-off-by: Ce Gao <gaoce@caicloud.io>
2019-10-10 02:30:53 -07:00
Abhilash Pallerlamudi
d5d605c0c6
Move JobStatus to common apis ( #139 )
...
Signed-off-by: Abhilash Pallerlamudi <stp.abhi@gmail.com>
2019-08-19 11:41:32 -07:00
XsWack
a598596b6b
add more CI check ( #118 )
...
* add more CI check
* add vendor
* fix verify codegen
* fix CI error
bugfix
2019-06-26 05:55:17 -07:00
Fei Xu
a656d97708
When MpiJob finished, delete podgroup and set worker count to 0 ( #112 )
...
* add delete Podgroup
* switch kube-batch v1alpha2 to v1alpha1
2019-06-11 07:32:23 -07:00
Fei Xu
d8025e9c38
delete launcherOnMaster field ( #116 )
2019-06-11 07:06:22 -07:00
Fei Xu
4e7701d552
add default-gen ( #114 )
2019-06-10 06:57:11 -07:00
Fei Xu
b619d48e80
Add leader election ( #110 )
...
* add leader elector
* run dep ensure
2019-05-29 20:20:22 -07:00
zhujl1991
6c50d22631
fix_lint ( #107 )
2019-04-16 05:45:08 -07:00