Commit Graph

272 Commits

Author SHA1 Message Date
Yuki Iwai 05ac6addc0
Upgrade Kubernetes dependencies (#502)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-26 18:13:09 +00:00
Yuki Iwai cd83424f65
Rename Go module name to 'github.com/kubeflow/mpi-operator' (#506)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-25 16:28:53 +00:00
Yuki Iwai 6d0d42ceba
Remove the openapi-generator-cli.jar (#499)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-12 02:40:59 +00:00
Yuki Iwai 85ed6442ca
Upgrade Go version to v1.19 (#497)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-12 02:00:59 +00:00
Yuki Iwai dc36350d99
Move mpi-operator v2 to the top of the repository (#496)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
2023-01-11 17:03:15 +00:00
Yuki Iwai 52f0b81c48
Fix documentation (#495)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 17:19:56 +00:00
Yuki Iwai 6079247133
Remove kubectl-delivery (#494)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 17:18:56 +00:00
Yuki Iwai c131315192
Remove MPI Operator V1 (#492)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 13:40:56 +00:00
adilhusain-s b0815ad90c
Adding mpi-operator workflow to release multi-arch docker image (#489)
Signed-off-by: adilhusain-s <Adilhusain.Shaikh@ibm.com>

Signed-off-by: adilhusain-s <Adilhusain.Shaikh@ibm.com>
2023-01-05 15:48:11 +00:00
Aldo Culquicondor c9454356e2
Upgrade golangci-lint (#485) 2022-12-20 14:16:18 +00:00
davidLif 5be2a42bf5
Update README.md - filtering labels (#475)
In v2, The labels mpi_job_name and mpi_role_type has been changed to training.kubeflow.org/job-name and training.kubeflow.org/job-role
2022-08-31 18:24:55 +00:00
Carlos Eduardo Arango Gutierrez 993b010e05
Enhance CONTRIBUTING.md (#466)
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-04-06 15:18:45 -04:00
Carlos Eduardo Arango Gutierrez e267b015ae
Bump Go to 1.17 (#458)
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-03-08 15:05:29 +00:00
Carlos Eduardo Arango Gutierrez bb5e538085
Build CRDs using kubebuilder (#452)
Fixes #408

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-03-02 18:22:03 +00:00
Shaowei Su 8a1cf898d9
Add namespace and `svc` suffix for host configmap (#454)
* add ns and svc suffic

* indent

Co-authored-by: shaowei su <shaowei.su@airbnb.com>
2022-02-15 15:53:41 +00:00
Carlos Eduardo Arango Gutierrez c7ca541451
Fix broken E2E test (#455)
* Fix broken E2E test

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>

* Add missing dependencies

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-02-12 00:51:10 +00:00
Carlos Eduardo Arango Gutierrez 3f808b1c59
Organize examples folder by api compatibility (#451)
* MV base Dockerfile to build forlder, they are not an example

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>

* Consolidate tensorflow-benchmarks under v2beta1

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>

* Move pi demo under v2beta1

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>

* MV mxnet examples under examples/v1

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>

* MV horovod and tensorflow examples under the compatible API

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>

* Update Makefile after reorg of examples folder

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-02-07 23:42:42 +00:00
Peng Gao d7fc50603a
Fix pods map lock (#446)
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2021-12-09 02:46:55 +00:00
Gang Pu b88edad03a
Generate sdk for v2 (#434)
* Generate sdk for v2

* Refine the version parameters of sdk generator

* add example for v2beta1

* make runPolicy optional

* 1: Ignore some generated files that is not needed
2: Add gitattributes file
2021-11-30 12:44:30 +00:00
Gang Pu 285cb98d59
Bump dependency versions to align with v2 (#441) 2021-11-22 14:33:37 +00:00
Gang Pu a334c4c2b8
Remove v1alpha1 and v1alpha2 apis/controllers (#438) 2021-11-22 04:50:36 +00:00
HeGaoYuan 8943cf734d
remove unnecessary namespace field of ClusterRoleBinding in deploy yaml file (#431) 2021-10-08 06:53:23 -07:00
HeGaoYuan 18b1822f4a
typo on proposals/elastic-horovod.md (#430) 2021-10-08 04:24:23 -07:00
Dmitry Kartsev c5c0c3ef99
Add support for using existing ServiceAccount for Launcher pod (#394) (#395)
* Do not create a SA, Role and RoleBinding when --use-launcher-pod-spec-serviceaccount=true,
      instead, use SA configured in spec.mpiReplicaSpecs.Launcher.template.spec.serviceAccountName
2021-09-20 11:16:00 -07:00
Aldo Culquicondor fee9913c6c
Set ClusterFirstWithHostNet DNS policy when the Pods use host network. (#428)
* Configure SSH port for base image

Use 2222 by default.

This should make it easier to use host networks, as generally the port 22 is taken by the host's sshd.

* Set ClusterFirstWithHostNet DNS policy

when the Pods use host network.

This allows resolving the worker and launcher hostnames without needing to include the namespace or cluster domain.
2021-09-14 07:06:33 -07:00
Aldo Culquicondor 08324e728d
Add readiness probe to Intel MPI jobs (#425)
This improves the reliability of MPI Jobs.

The readiness probe ensures that sshd is up and running before the hostname is resolvable.
2021-09-08 10:22:07 -07:00
Aldo Culquicondor db6930dcd5
Bundle all controller versions in the image (#421) 2021-09-02 11:34:20 -07:00
Aldo Culquicondor b9141c0540
Preparing release of v0.3.0 (#414)
Also
- Updated Makefile to use new version
- extra notes for developers
2021-09-01 08:04:45 -07:00
xhejtman 5fca3284a0
Set OnFailure default restart policy for launcher (#420)
* Add separate restart policy

Add separate restart policy for launcher with OnFailure default

* Set default restart policy

Set default restart policy for launcher to OnFailure

* Fix go tests
2021-08-30 11:55:25 -07:00
Aldo Culquicondor 470d9821d7
Add base images and make PI samples inherit from it (#419) 2021-08-27 13:34:06 -07:00
Aldo Culquicondor 8f5bbd8203
Mount SSH Secret directly on main container (#416)
Remove the init container for faster startup.

Possible by disabling StrictModes in sshd_config.
2021-08-26 15:42:06 -07:00
Aldo Culquicondor 0bccdb9672
Fix intel MPI E2E test image (#417)
Print last launcher logs when E2E test fails
2021-08-25 12:27:45 -07:00
Aldo Culquicondor a566d1d180
Add compiled manifest for v2beta1 (#411)
Used `kubectl kustomize manifests/overlays/standalone` and removed the unnecessary ConfigMap
2021-08-22 19:34:56 -07:00
Aldo Culquicondor c73ef6b0b1
Use fully-qualified label names from common (#409) 2021-08-19 19:01:54 -07:00
Aldo Culquicondor 24bbfe7c27
Increase unit coverage of v2 controller (#406) 2021-08-17 19:13:37 -07:00
Aldo Culquicondor a84e8a2381
Increase E2E wait timeout (#405)
To reduce flakiness
2021-08-17 07:51:42 -07:00
Aldo Culquicondor d7044775b2
Add alculquicondor as reviewer (#404) 2021-08-17 10:17:45 -04:00
Aldo Culquicondor 85aefc60c8
Remove ability to run ranks in launcher (#398) 2021-08-16 13:42:42 -07:00
Aldo Culquicondor bb76ce1b4c
Add E2E tests for failure, root and Intel (#403) 2021-08-16 13:18:42 -07:00
Aldo Culquicondor d61992f91d
E2E test for v2 controller (#399) 2021-08-14 08:12:04 -07:00
Aldo Culquicondor b4b62cc302
Pass runPolicy fields to the launcher Job (#392)
* Add runPolicy to MPIJob.spec

* Pass runPolicy fields to the launcher Job
2021-08-13 07:50:54 -07:00
Aldo Culquicondor 3ba33750b5
Manage launcher through k8s Job (#391)
* Ensure restart policy is Never or OnFailure

Always doesn't make sense for Jobs

* Manage launcher through k8s Job

Still tracking Running status of the job pods.

* Add launcher Pod failed reason
2021-08-12 20:38:54 -07:00
Aldo Culquicondor b9dbbc5750
Fix Discovery script for intel (#397)
Slots are handled through an environment variable instead.
2021-08-11 13:19:01 -07:00
Aldo Culquicondor 990bf1c39d
Add support for Intel MPI (#389)
* Add support for Intel MPI

Adds the field .spec.mpiImplementation, defaults to OpenMPI

The Intel implementation requires a Service fronting the launcher.

* Add an example image that uses Intel MPI
2021-08-03 11:23:41 -07:00
Aldo Culquicondor 50d7f24539
Optimize OpenMPI image size (#390) 2021-07-29 05:28:18 -04:00
Aldo Culquicondor 108a697fb3
Fix validation tests and account for invalid cleanPodPolicy (#387) 2021-07-28 00:49:10 -04:00
Aldo Culquicondor 9ce646773a
Allow running MPI applications as non-root (#383)
* Allow running MPI applications as non-root

Adds the spec field sshAuthMountPath for MPIJob.
The init script sets the permissions and ownership based on the securityContext of the launcherPod

* Add pure MPI sample that run as non-root
2021-07-26 22:35:11 -07:00
Aldo Culquicondor 84604c807d
Validate that MPIJob produces valid hostnames (#385)
Hostnames must be valid DNS labels. This includes checking for invalid characters and a maximum length
2021-07-26 17:32:11 -07:00
Aldo Culquicondor 7b6c1bfe22
Upgrade to apiextensions.k8s.io/v1 (#379) 2021-07-23 14:06:33 -04:00
HeGaoYuan fe99cf04dc
fix comment typo related to statefulset (#382) 2021-07-21 03:39:50 -07:00