Commit Graph

34 Commits

Author SHA1 Message Date
Yuki Iwai 6c4f285eba
Use Kubernetes v1.33 (#710)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-31 05:22:59 +00:00
Yuki Iwai c63710108d
Upgrade K8s version to v1.32.7 (#708)
* Upgrade K8s module version to v1.32

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Upgrade KIND version to v0.29.0

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Introduce scheduler-plugins RBAC workaround

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-08-25 21:28:19 +00:00
Vikas Saxena 82c9ad303d
New fix kustomize5 warnings (#700)
* fixed kustomize warnings in base

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed kustomize warnings in standalone

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed kustomize warnings in kubeflow

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

---------

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>
2025-05-13 08:48:21 +00:00
Carlos Eduardo Arango Gutierrez a869150953
Bump to k8s 1.31 (#664)
* Bump to k8s 1.31

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* Bump sigs.k8s.io/controller-runtime to v0.19.0

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* Bump golangci-lint to v1.61.0

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* change queue from RateLimitingInterface  to TypedRateLimitingInterface[any]

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* Update kubectl url

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

---------

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-10-15 20:25:18 +00:00
Yuki Iwai 10f9e20b89
Upgrade the k8s dependency versions to 1.30 (#657)
* Upgrade the k8s dependency versions to 1.30

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Generate codes

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update testing version

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-10-12 02:24:10 +00:00
Michał Szadkowski 1794cc0d44
Adjust the comment for managedBy (#656)
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
2024-10-11 09:49:10 +00:00
Michał Szadkowski c29c37ca7e
Introduce ManagedBy field in RunPolicy (#650)
* Introduce ManageBy field to RunPolicy

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Make mpi-operator a default value for ManagedBy

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Add validation for ManagedBy field

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Make use of ManagedBy in reconciliation process

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Regenerate code after adding managedBy field

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Add e2e tests

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Update after code review

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Update tests

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Remove default value for ManagedBy

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Add optional tag
Replace backoff and consistently with sleep

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

* Create common util package for integration and e2e tests with sleep/wait constants

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>

---------

Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
2024-10-10 17:16:10 +00:00
Yuki Iwai ae7c738d43
Upgrade K8s dependencies to v1.29 (#633)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-17 18:51:27 +00:00
Yuki Iwai a1453cf9a6
Upgrade golang and controller-gen (#637)
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Co-authored-by: Aldo Culquicondor <acondor@google.com>
2024-04-17 14:24:27 +00:00
Chitsing KUI a6c2da887d
run worker process in launcher pod (#612)
* run worker in launcher pod; fix DCO issue

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

* use ptr.Deref

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

* update manifest

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

* more Deref

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

* create one service for both launcher and worker

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>

---------

Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
2024-02-26 15:17:58 +00:00
Yuki Iwai 3c7fad663a
Upgrade K8s dependencies to v0.27.4 (#584)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-07 22:17:56 +00:00
xhejtman f8d815cdf4
Run workers first and wait for them (#484)
* Real rebase of waitforworkes option

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix generated API

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix format

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Add docs

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix typo

* Add tests for waitforworkers

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Add missing err test

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix cleanpodpolicy

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Remove debug

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix tests

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Rework api

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix generated api

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* One more fix of api

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Swagger fix

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix readme

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix readme again

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Add comments

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Add kubebuilder annotations

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

* Fix manifests

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>

---------

Signed-off-by: Lukas Hejtmanek <xhejtman@gmail.com>
2023-06-26 18:37:14 +00:00
Mateusz Kubica 21f326d1d2
MPICH support (#562)
* Add support for MPICH

* Fix CI errors

* Temporary: manual trigger

* Fix file name

* Add an empty line at the end of the file

* Fix formatting

* Revert "Temporary: manual trigger"

This reverts commit 15164a8b70.

* fix formatting

* Regenerate the mpi-operator.yaml

* Adding an empy line at the end of Dockerfiles

* Share the same entrypoin for Intel and MPICH

* share hostfile generation between Intel and MPICH

* Add validation test for MPICH

* Fix formatting

* Don't over engineer the tests - be explicit

* add non-root tests for IntelMPI and MPICH
2023-06-16 17:57:36 +00:00
Yuki Iwai 2da8e05048
Bumping controller-gen to fix unknown field error (#559)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-06-01 18:30:08 +00:00
Yuki Iwai 2495860427
Support the coscheduling plugin of scheduler-plugins (#538)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-03-29 02:27:12 +00:00
Yuki Iwai 11a2940e45
Stop using the kustomize vars feature (#533)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-28 16:52:02 +00:00
Yuki Iwai b302019be7
Respect SchedulingPolicy (#520)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-28 15:37:42 +00:00
Yuki Iwai 10b5f46921
Remove unnecessary permissions (#522)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-09 14:03:06 +00:00
Yuki Iwai efbba01f8d
Clean up manifests (#510)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-08 17:52:33 +00:00
Michał Woźniak 92e491e6e9
Support suspend semantics for MPIJob (#511)
* Implement Suspend semantics for MPIJob

# Conflicts:
#	pkg/apis/kubeflow/v2beta1/types.go
#	pkg/controller/mpi_job_controller.go
#	pkg/controller/mpi_job_controller_status.go
#	pkg/controller/mpi_job_controller_test.go
#	test/integration/mpi_job_controller_test.go

* Changes
- add unit tests for creating suspended, suspending and resuming
- use fake clock for unit tests
- do not return from the syncHandler after worker pods cleanup on
suspend - this allows to continue with the MPIJob update in the same sync

# Conflicts:
#	pkg/controller/mpi_job_controller.go
2023-02-03 15:44:02 +00:00
Yuki Iwai 382da780a7
Add scheduling.volcano.sh to ClusterRole (#512)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-29 02:14:03 +00:00
Yuki Iwai 05ac6addc0
Upgrade Kubernetes dependencies (#502)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-26 18:13:09 +00:00
Yuki Iwai c131315192
Remove MPI Operator V1 (#492)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 13:40:56 +00:00
Aldo Culquicondor d61992f91d
E2E test for v2 controller (#399) 2021-08-14 08:12:04 -07:00
Aldo Culquicondor b4b62cc302
Pass runPolicy fields to the launcher Job (#392)
* Add runPolicy to MPIJob.spec

* Pass runPolicy fields to the launcher Job
2021-08-13 07:50:54 -07:00
Aldo Culquicondor 3ba33750b5
Manage launcher through k8s Job (#391)
* Ensure restart policy is Never or OnFailure

Always doesn't make sense for Jobs

* Manage launcher through k8s Job

Still tracking Running status of the job pods.

* Add launcher Pod failed reason
2021-08-12 20:38:54 -07:00
Aldo Culquicondor 990bf1c39d
Add support for Intel MPI (#389)
* Add support for Intel MPI

Adds the field .spec.mpiImplementation, defaults to OpenMPI

The Intel implementation requires a Service fronting the launcher.

* Add an example image that uses Intel MPI
2021-08-03 11:23:41 -07:00
Aldo Culquicondor 9ce646773a
Allow running MPI applications as non-root (#383)
* Allow running MPI applications as non-root

Adds the spec field sshAuthMountPath for MPIJob.
The init script sets the permissions and ownership based on the securityContext of the launcherPod

* Add pure MPI sample that run as non-root
2021-07-26 22:35:11 -07:00
Aldo Culquicondor 7b6c1bfe22
Upgrade to apiextensions.k8s.io/v1 (#379) 2021-07-23 14:06:33 -04:00
Aldo Culquicondor 70a866ee52
Downgrade v2 API to v2beta1 (#378)
To leave the path open for improving the API without having to release a v3.
2021-07-16 11:29:46 -04:00
Aldo Culquicondor 6afa62ca0b
Add integration tests for v2 controller (#375)
* Do inter-pod communication through SSH

The controller generates keys and mounts them to the containers. The container images must know how to place the credentials and set file permissions.

* Use init-container instead of entrypoint

* Fix scheme for recorder and defaults

* Add integration tests for v2 controller
2021-07-15 06:43:51 -07:00
Aldo Culquicondor b453a9b395
Add development options to build system (#362)
and fix CRD and ClusterRole on kustomize
2021-05-22 18:52:30 -07:00
Yannis Zarkadas b367aa5588
MPI Operator: Consolidate manifests (#340)
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
2021-03-17 23:11:16 -07:00
Yannis Zarkadas c39111e53d
MPI Operator: Move manifests development upstream (#326)
* manifests: Move manifests development upstream

As part of the work of wg-manifests for 1.3
(https://github.com/kubeflow/manifests/issues/1735), we are moving manifests
development in upstream repos. This gives the application developers full
ownership of their manifests, tracked in a single place.

This commit copies the manifests for application `MPI Operator`
from path `apps/mpi-job/upstream` of kubeflow/manifests to path
`manifests` of the upstream repo (https://github.com/kubeflow/mpi-operator).

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* README: Update README to point to new manifests location

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
2021-03-02 06:27:48 -08:00