Yuki Iwai
05ac6addc0
Upgrade Kubernetes dependencies ( #502 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-26 18:13:09 +00:00
Yuki Iwai
cd83424f65
Rename Go module name to 'github.com/kubeflow/mpi-operator' ( #506 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-25 16:28:53 +00:00
Yuki Iwai
6d0d42ceba
Remove the openapi-generator-cli.jar ( #499 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-12 02:40:59 +00:00
Yuki Iwai
85ed6442ca
Upgrade Go version to v1.19 ( #497 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-12 02:00:59 +00:00
Yuki Iwai
dc36350d99
Move mpi-operator v2 to the top of the repository ( #496 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
2023-01-11 17:03:15 +00:00
Yuki Iwai
52f0b81c48
Fix documentation ( #495 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 17:19:56 +00:00
Yuki Iwai
6079247133
Remove kubectl-delivery ( #494 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 17:18:56 +00:00
Yuki Iwai
c131315192
Remove MPI Operator V1 ( #492 )
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-01-06 13:40:56 +00:00
adilhusain-s
b0815ad90c
Adding mpi-operator workflow to release multi-arch docker image ( #489 )
...
Signed-off-by: adilhusain-s <Adilhusain.Shaikh@ibm.com>
Signed-off-by: adilhusain-s <Adilhusain.Shaikh@ibm.com>
2023-01-05 15:48:11 +00:00
Aldo Culquicondor
c9454356e2
Upgrade golangci-lint ( #485 )
2022-12-20 14:16:18 +00:00
davidLif
5be2a42bf5
Update README.md - filtering labels ( #475 )
...
In v2, The labels mpi_job_name and mpi_role_type has been changed to training.kubeflow.org/job-name and training.kubeflow.org/job-role
2022-08-31 18:24:55 +00:00
Carlos Eduardo Arango Gutierrez
993b010e05
Enhance CONTRIBUTING.md ( #466 )
...
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-04-06 15:18:45 -04:00
Carlos Eduardo Arango Gutierrez
e267b015ae
Bump Go to 1.17 ( #458 )
...
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-03-08 15:05:29 +00:00
Carlos Eduardo Arango Gutierrez
bb5e538085
Build CRDs using kubebuilder ( #452 )
...
Fixes #408
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-03-02 18:22:03 +00:00
Shaowei Su
8a1cf898d9
Add namespace and `svc` suffix for host configmap ( #454 )
...
* add ns and svc suffic
* indent
Co-authored-by: shaowei su <shaowei.su@airbnb.com>
2022-02-15 15:53:41 +00:00
Carlos Eduardo Arango Gutierrez
c7ca541451
Fix broken E2E test ( #455 )
...
* Fix broken E2E test
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* Add missing dependencies
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-02-12 00:51:10 +00:00
Carlos Eduardo Arango Gutierrez
3f808b1c59
Organize examples folder by api compatibility ( #451 )
...
* MV base Dockerfile to build forlder, they are not an example
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* Consolidate tensorflow-benchmarks under v2beta1
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* Move pi demo under v2beta1
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* MV mxnet examples under examples/v1
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* MV horovod and tensorflow examples under the compatible API
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* Update Makefile after reorg of examples folder
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
2022-02-07 23:42:42 +00:00
Peng Gao
d7fc50603a
Fix pods map lock ( #446 )
...
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2021-12-09 02:46:55 +00:00
Gang Pu
b88edad03a
Generate sdk for v2 ( #434 )
...
* Generate sdk for v2
* Refine the version parameters of sdk generator
* add example for v2beta1
* make runPolicy optional
* 1: Ignore some generated files that is not needed
2: Add gitattributes file
2021-11-30 12:44:30 +00:00
Gang Pu
285cb98d59
Bump dependency versions to align with v2 ( #441 )
2021-11-22 14:33:37 +00:00
Gang Pu
a334c4c2b8
Remove v1alpha1 and v1alpha2 apis/controllers ( #438 )
2021-11-22 04:50:36 +00:00
HeGaoYuan
8943cf734d
remove unnecessary namespace field of ClusterRoleBinding in deploy yaml file ( #431 )
2021-10-08 06:53:23 -07:00
HeGaoYuan
18b1822f4a
typo on proposals/elastic-horovod.md ( #430 )
2021-10-08 04:24:23 -07:00
Dmitry Kartsev
c5c0c3ef99
Add support for using existing ServiceAccount for Launcher pod ( #394 ) ( #395 )
...
* Do not create a SA, Role and RoleBinding when --use-launcher-pod-spec-serviceaccount=true,
instead, use SA configured in spec.mpiReplicaSpecs.Launcher.template.spec.serviceAccountName
2021-09-20 11:16:00 -07:00
Aldo Culquicondor
fee9913c6c
Set ClusterFirstWithHostNet DNS policy when the Pods use host network. ( #428 )
...
* Configure SSH port for base image
Use 2222 by default.
This should make it easier to use host networks, as generally the port 22 is taken by the host's sshd.
* Set ClusterFirstWithHostNet DNS policy
when the Pods use host network.
This allows resolving the worker and launcher hostnames without needing to include the namespace or cluster domain.
2021-09-14 07:06:33 -07:00
Aldo Culquicondor
08324e728d
Add readiness probe to Intel MPI jobs ( #425 )
...
This improves the reliability of MPI Jobs.
The readiness probe ensures that sshd is up and running before the hostname is resolvable.
2021-09-08 10:22:07 -07:00
Aldo Culquicondor
db6930dcd5
Bundle all controller versions in the image ( #421 )
2021-09-02 11:34:20 -07:00
Aldo Culquicondor
b9141c0540
Preparing release of v0.3.0 ( #414 )
...
Also
- Updated Makefile to use new version
- extra notes for developers
2021-09-01 08:04:45 -07:00
xhejtman
5fca3284a0
Set OnFailure default restart policy for launcher ( #420 )
...
* Add separate restart policy
Add separate restart policy for launcher with OnFailure default
* Set default restart policy
Set default restart policy for launcher to OnFailure
* Fix go tests
2021-08-30 11:55:25 -07:00
Aldo Culquicondor
470d9821d7
Add base images and make PI samples inherit from it ( #419 )
2021-08-27 13:34:06 -07:00
Aldo Culquicondor
8f5bbd8203
Mount SSH Secret directly on main container ( #416 )
...
Remove the init container for faster startup.
Possible by disabling StrictModes in sshd_config.
2021-08-26 15:42:06 -07:00
Aldo Culquicondor
0bccdb9672
Fix intel MPI E2E test image ( #417 )
...
Print last launcher logs when E2E test fails
2021-08-25 12:27:45 -07:00
Aldo Culquicondor
a566d1d180
Add compiled manifest for v2beta1 ( #411 )
...
Used `kubectl kustomize manifests/overlays/standalone` and removed the unnecessary ConfigMap
2021-08-22 19:34:56 -07:00
Aldo Culquicondor
c73ef6b0b1
Use fully-qualified label names from common ( #409 )
2021-08-19 19:01:54 -07:00
Aldo Culquicondor
24bbfe7c27
Increase unit coverage of v2 controller ( #406 )
2021-08-17 19:13:37 -07:00
Aldo Culquicondor
a84e8a2381
Increase E2E wait timeout ( #405 )
...
To reduce flakiness
2021-08-17 07:51:42 -07:00
Aldo Culquicondor
d7044775b2
Add alculquicondor as reviewer ( #404 )
2021-08-17 10:17:45 -04:00
Aldo Culquicondor
85aefc60c8
Remove ability to run ranks in launcher ( #398 )
2021-08-16 13:42:42 -07:00
Aldo Culquicondor
bb76ce1b4c
Add E2E tests for failure, root and Intel ( #403 )
2021-08-16 13:18:42 -07:00
Aldo Culquicondor
d61992f91d
E2E test for v2 controller ( #399 )
2021-08-14 08:12:04 -07:00
Aldo Culquicondor
b4b62cc302
Pass runPolicy fields to the launcher Job ( #392 )
...
* Add runPolicy to MPIJob.spec
* Pass runPolicy fields to the launcher Job
2021-08-13 07:50:54 -07:00
Aldo Culquicondor
3ba33750b5
Manage launcher through k8s Job ( #391 )
...
* Ensure restart policy is Never or OnFailure
Always doesn't make sense for Jobs
* Manage launcher through k8s Job
Still tracking Running status of the job pods.
* Add launcher Pod failed reason
2021-08-12 20:38:54 -07:00
Aldo Culquicondor
b9dbbc5750
Fix Discovery script for intel ( #397 )
...
Slots are handled through an environment variable instead.
2021-08-11 13:19:01 -07:00
Aldo Culquicondor
990bf1c39d
Add support for Intel MPI ( #389 )
...
* Add support for Intel MPI
Adds the field .spec.mpiImplementation, defaults to OpenMPI
The Intel implementation requires a Service fronting the launcher.
* Add an example image that uses Intel MPI
2021-08-03 11:23:41 -07:00
Aldo Culquicondor
50d7f24539
Optimize OpenMPI image size ( #390 )
2021-07-29 05:28:18 -04:00
Aldo Culquicondor
108a697fb3
Fix validation tests and account for invalid cleanPodPolicy ( #387 )
2021-07-28 00:49:10 -04:00
Aldo Culquicondor
9ce646773a
Allow running MPI applications as non-root ( #383 )
...
* Allow running MPI applications as non-root
Adds the spec field sshAuthMountPath for MPIJob.
The init script sets the permissions and ownership based on the securityContext of the launcherPod
* Add pure MPI sample that run as non-root
2021-07-26 22:35:11 -07:00
Aldo Culquicondor
84604c807d
Validate that MPIJob produces valid hostnames ( #385 )
...
Hostnames must be valid DNS labels. This includes checking for invalid characters and a maximum length
2021-07-26 17:32:11 -07:00
Aldo Culquicondor
7b6c1bfe22
Upgrade to apiextensions.k8s.io/v1 ( #379 )
2021-07-23 14:06:33 -04:00
HeGaoYuan
fe99cf04dc
fix comment typo related to statefulset ( #382 )
2021-07-21 03:39:50 -07:00