* Introduce ManageBy field to RunPolicy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Make mpi-operator a default value for ManagedBy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add validation for ManagedBy field
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Make use of ManagedBy in reconciliation process
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Regenerate code after adding managedBy field
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add e2e tests
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Update after code review
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Update tests
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Remove default value for ManagedBy
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Add optional tag
Replace backoff and consistently with sleep
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* Create common util package for integration and e2e tests with sleep/wait constants
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
---------
Signed-off-by: Michal Szadkowski <michal_szadkowski@epam.com>
* no overwrite when run launcher as worker
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* add ut for rm nv env for launcher-as-worker
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
---------
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
* Add support for MPICH
* Fix CI errors
* Temporary: manual trigger
* Fix file name
* Add an empty line at the end of the file
* Fix formatting
* Revert "Temporary: manual trigger"
This reverts commit 15164a8b70.
* fix formatting
* Regenerate the mpi-operator.yaml
* Adding an empy line at the end of Dockerfiles
* Share the same entrypoin for Intel and MPICH
* share hostfile generation between Intel and MPICH
* Add validation test for MPICH
* Fix formatting
* Don't over engineer the tests - be explicit
* add non-root tests for IntelMPI and MPICH
* Fix a potentially null pointer error
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Fix a bug that the PodGroupCtrl can not list priorityclass
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Refactoring setups for gang-scheduling
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Implement Suspend semantics for MPIJob
# Conflicts:
# pkg/apis/kubeflow/v2beta1/types.go
# pkg/controller/mpi_job_controller.go
# pkg/controller/mpi_job_controller_status.go
# pkg/controller/mpi_job_controller_test.go
# test/integration/mpi_job_controller_test.go
* Changes
- add unit tests for creating suspended, suspending and resuming
- use fake clock for unit tests
- do not return from the syncHandler after worker pods cleanup on
suspend - this allows to continue with the MPIJob update in the same sync
# Conflicts:
# pkg/controller/mpi_job_controller.go
* Use local copy of RunPolicy by MPI-operator
Steps performed:
- copy the `RunPolicy` from common to `types.go`
- fix compilation errors by using the local RunPolicy definition
- run `make generate`
- run `make all`
- regenerate openapi_generated.go by `./hack/python-sdk/gen-sdk.sh` (with commented out rollback)
* Copy SchedulingPolicy and CleanPodPolicy for RunPolicy