Commit Graph

61 Commits

Author SHA1 Message Date
Yuki Iwai d164ea463d
Upgrade golangci-lint v1 to v2 (#714)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-09-02 01:32:01 +00:00
Yuki Iwai 6c4f285eba
Use Kubernetes v1.33 (#710)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-31 05:22:59 +00:00
Yuki Iwai c63710108d
Upgrade K8s version to v1.32.7 (#708)
* Upgrade K8s module version to v1.32

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Upgrade KIND version to v0.29.0

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Introduce scheduler-plugins RBAC workaround

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-25 21:28:19 +00:00
GonzaloSaez c50eb45b18
Fix E2E Intel MPI integ tests (#676)
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
2025-01-10 18:55:37 +00:00
Carlos Eduardo Arango Gutierrez a869150953
Bump to k8s 1.31 (#664)
* Bump to k8s 1.31

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* Bump sigs.k8s.io/controller-runtime to v0.19.0

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* Bump golangci-lint to v1.61.0

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* change queue from RateLimitingInterface  to TypedRateLimitingInterface[any]

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* Update kubectl url

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

---------

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-10-15 20:25:18 +00:00
Yuki Iwai 10f9e20b89
Upgrade the k8s dependency versions to 1.30 (#657)
* Upgrade the k8s dependency versions to 1.30

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Generate codes

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update testing version

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-12 02:24:10 +00:00
Michał Szadkowski 1794cc0d44
Adjust the comment for managedBy (#656)
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
2024-10-11 09:49:10 +00:00
Michał Szadkowski c29c37ca7e
Introduce ManagedBy field in RunPolicy (#650)
* Introduce ManageBy field to RunPolicy

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Make mpi-operator a default value for ManagedBy

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Add validation for ManagedBy field

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Make use of ManagedBy in reconciliation process

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Regenerate code after adding managedBy field

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Add e2e tests

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Update after code review

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Update tests

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Remove default value for ManagedBy

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Add optional tag
Replace backoff and consistently with sleep

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Create common util package for integration and e2e tests with sleep/wait constants

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

---------

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
2024-10-10 17:16:10 +00:00
Yuki Iwai ae7c738d43
Upgrade K8s dependencies to v1.29 (#633)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-17 18:51:27 +00:00
Yuki Iwai 4d5156d07a
Replace original pointer methods with ptr libs (#635)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-17 16:17:28 +00:00
Chitsing KUI a6c2da887d
run worker process in launcher pod (#612)
* run worker in launcher pod; fix DCO issue

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

* use ptr.Deref

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

* update manifest

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

* more Deref

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

* create one service for both launcher and worker

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

---------

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
2024-02-26 15:17:58 +00:00
Chitsing KUI 23c802b151
update auto gen file year to verify generate (#623) 2024-02-05 15:23:17 +00:00
dragon-fly e1590ce61e
merge kubeflow/common.v1 to mpi-operator (#571)
* merge kubeflow/common.v1 to mpi-operator

Signed-off-by: lowang_bh <lhui_wang@163.com>

java gen Python SDK

Signed-off-by: lowang_bh <lhui_wang@163.com>

* update make generate and fix comment issues

Signed-off-by: lowang_bh <lhui_wang@163.com>

* Update pkg/apis/kubeflow/v2beta1/types.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* merge from master to solve conflict

Signed-off-by: lowang-bh <lhui_wang@163.com>

* change reference link to training-operator project

Signed-off-by: lowang-bh <lhui_wang@163.com>

---------

Signed-off-by: lowang_bh <lhui_wang@163.com>
Signed-off-by: lowang-bh <lhui_wang@163.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-07-08 19:52:53 +00:00
xhejtman f8d815cdf4
Run workers first and wait for them (#484)
* Real rebase of waitforworkes option

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix generated API

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix format

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Add docs

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix typo

* Add tests for waitforworkers

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Add missing err test

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix cleanpodpolicy

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Remove debug

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix tests

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Rework api

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix generated api

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* One more fix of api

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Swagger fix

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix readme

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix readme again

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Add comments

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Add kubebuilder annotations

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix manifests

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

---------

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
2023-06-26 18:37:14 +00:00
Mateusz Kubica 21f326d1d2
MPICH support (#562)
* Add support for MPICH

* Fix CI errors

* Temporary: manual trigger

* Fix file name

* Add an empty line at the end of the file

* Fix formatting

* Revert "Temporary: manual trigger"

This reverts commit 15164a8b70.

* fix formatting

* Regenerate the mpi-operator.yaml

* Adding an empy line at the end of Dockerfiles

* Share the same entrypoin for Intel and MPICH

* share hostfile generation between Intel and MPICH

* Add validation test for MPICH

* Fix formatting

* Don't over engineer the tests - be explicit

* add non-root tests for IntelMPI and MPICH
2023-06-16 17:57:36 +00:00
Yuki Iwai ccf2756f74
Commonize function newCleanPodPolicy() (#557)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-05-14 13:04:30 +00:00
Yuki Iwai 2495860427
Support the coscheduling plugin of scheduler-plugins (#538)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-03-29 02:27:12 +00:00
Yuki Iwai b302019be7
Respect SchedulingPolicy (#520)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-28 15:37:42 +00:00
Yuki Iwai efbba01f8d
Clean up manifests (#510)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-08 17:52:33 +00:00
Yuki Iwai 5f1914bfb2
Validate MPIJob name with the DNS 1035 label (#517)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-04 00:51:58 +00:00
Michał Woźniak 92e491e6e9
Support suspend semantics for MPIJob (#511)
* Implement Suspend semantics for MPIJob

# Conflicts:
#	pkg/apis/kubeflow/v2beta1/types.go
#	pkg/controller/mpi_job_controller.go
#	pkg/controller/mpi_job_controller_status.go
#	pkg/controller/mpi_job_controller_test.go
#	test/integration/mpi_job_controller_test.go

* Changes
- add unit tests for creating suspended, suspending and resuming
- use fake clock for unit tests
- do not return from the syncHandler after worker pods cleanup on
suspend - this allows to continue with the MPIJob update in the same sync

# Conflicts:
#	pkg/controller/mpi_job_controller.go
2023-02-03 15:44:02 +00:00
Yuki Iwai 4c8b4fc2e4
Use local copy of JobStatus by mpi-operator (#514)
* Use local copy of JobStatus by mpi-operator

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* address comments

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-03 14:44:01 +00:00
Michał Woźniak 0b32af39c3
Use local copy of RunPolicy by MPI-operator (#513)
* Use local copy of RunPolicy by MPI-operator

Steps performed:
- copy the `RunPolicy` from common to `types.go`
- fix compilation errors by using the local RunPolicy definition
- run `make generate`
- run `make all`
- regenerate openapi_generated.go by `./hack/python-sdk/gen-sdk.sh` (with commented out rollback)

* Copy SchedulingPolicy and CleanPodPolicy for RunPolicy
2023-01-31 17:46:30 +00:00
Yuki Iwai 05ac6addc0
Upgrade Kubernetes dependencies (#502)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-26 18:13:09 +00:00
Yuki Iwai cd83424f65
Rename Go module name to 'github.com/kubeflow/mpi-operator' (#506)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-25 16:28:53 +00:00
Yuki Iwai dc36350d99
Move mpi-operator v2 to the top of the repository (#496)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
2023-01-11 17:03:15 +00:00
Yuki Iwai c131315192
Remove MPI Operator V1 (#492)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 13:40:56 +00:00
Gang Pu b88edad03a
Generate sdk for v2 (#434)
* Generate sdk for v2

* Refine the version parameters of sdk generator

* add example for v2beta1

* make runPolicy optional

* 1: Ignore some generated files that is not needed
2: Add gitattributes file
2021-11-30 12:44:30 +00:00
Gang Pu 285cb98d59
Bump dependency versions to align with v2 (#441) 2021-11-22 14:33:37 +00:00
Gang Pu a334c4c2b8
Remove v1alpha1 and v1alpha2 apis/controllers (#438) 2021-11-22 04:50:36 +00:00
Wang Zhang 680cd4db0f
Add python sdk and auto-generate script (#357) 2021-05-13 20:20:43 -04:00
Naveen 19173091b0
Go fmt changes that caused the git tree to be dirty (#302)
These go fmt changes caused the git tree to be dirty.
2020-12-15 15:20:20 -08:00
Tim Deng 07bbb45de9
add support for using Intel MPI(2019.7) and MVAPICH2 (#283)
* + support for IntelMPI and MPICH
+ local minikube test pass
+ add new Spec "mpiDistribution"
@ 2020/7/27

* * fix ineffectual assignment
* change email address

* * update variable name

* * fix some spelling and naming problems

* + add more notes

* + auto filter prefix parameters

* * fix some spelling problem
* update notes about hostfile generating

* + mpich-format hostfile split

* + generate hosts for hydra to resolve hostname

* * update notes

* * fix sh script
+ move hosts sending and merging here
* use special type instaed of string

* * check return value

* * update options' name

* + add unit test for generateHosts

* ^ fixed lint reported errors
2020-08-03 04:27:40 -07:00
Lei Xue 445cb4887d
Convert launcher job and statefulset worker to pod (#203)
* convert job and statefulset to pod

* fix issues

* remove duplicate yaml for v1

* modify unit test

* fix Dockerfile multiple version issue
2020-05-05 07:49:10 -07:00
Abhilash Pallerlamudi ab8518d375
use volcano scheduler (#242)
* use volcano scheduler

Signed-off-by: Abhilash Pallerlamudi <stp.abhi@gmail.com>

* Trigger CI
2020-04-29 07:08:25 -07:00
Yuan Tang 94a21577df
Update to modify v1 API instead (#221)
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
2020-04-18 20:29:36 -07:00
Yuan Tang da48dfba6c
Add initial v1 controller and APIs (#225)
* Initial v1 skeleton

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* Add v1 pkg/client

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* v1alpha2 -> v1

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* A couple fixes

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* Fix]

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* Fix lint issues

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* Add v1 deployment yaml

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* Fix versions

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
2020-04-16 06:46:46 -07:00
Lei Xue 9838cd4dea
add openapi_generated code (#211) 2020-04-09 05:33:44 -07:00
Abhilash Pallerlamudi 8c8d0d3002
update common api to latest (#208)
Signed-off-by: Abhilash Pallerlamudi <stp.abhi@gmail.com>
2020-03-26 09:48:27 -07:00
Yuan Tang 4e73e5ec38 Add RunPolicy to MPIJobSpec that reuses kubeflow/common spec (#178)
* Reuse RunPolicy from kubeflow/common spec

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* Update codegen

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

* Temporary fix

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
2020-01-28 13:23:40 -08:00
Wei Yan 5253329e1f Expose main container name as a configurable field (#174)
* Expose main container name as a configurable field

* Move main container config to job level

* remove unnecessary configs
2020-01-23 09:21:26 -05:00
Yuan Tang b58a10d99c Update codegen to fix Travis CI (#164)
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
2020-01-13 11:57:38 -08:00
Ce Gao 815f720982 fix: Avoid nil pointer exception (#148)
Signed-off-by: Ce Gao <gaoce@caicloud.io>
2019-10-10 02:30:53 -07:00
Abhilash Pallerlamudi d5d605c0c6 Move JobStatus to common apis (#139)
Signed-off-by: Abhilash Pallerlamudi <stp.abhi@gmail.com>
2019-08-19 11:41:32 -07:00
XsWack a598596b6b add more CI check (#118)
* add more CI check

* add vendor

* fix verify codegen

* fix CI error

bugfix
2019-06-26 05:55:17 -07:00
Fei Xu a656d97708 When MpiJob finished, delete podgroup and set worker count to 0 (#112)
* add delete Podgroup

* switch kube-batch v1alpha2 to v1alpha1
2019-06-11 07:32:23 -07:00
Fei Xu d8025e9c38 delete launcherOnMaster field (#116) 2019-06-11 07:06:22 -07:00
Fei Xu 4e7701d552 add default-gen (#114) 2019-06-10 06:57:11 -07:00
Fei Xu b619d48e80 Add leader election (#110)
* add leader elector

* run dep ensure
2019-05-29 20:20:22 -07:00
zhujl1991 6c50d22631 fix_lint (#107) 2019-04-16 05:45:08 -07:00