* Upgrade debian version to bookworm
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Add obviously verifications if all Ranks reached final phase in the pi example
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Add support for MPICH
* Fix CI errors
* Temporary: manual trigger
* Fix file name
* Add an empty line at the end of the file
* Fix formatting
* Revert "Temporary: manual trigger"
This reverts commit 15164a8b70.
* fix formatting
* Regenerate the mpi-operator.yaml
* Adding an empy line at the end of Dockerfiles
* Share the same entrypoin for Intel and MPICH
* share hostfile generation between Intel and MPICH
* Add validation test for MPICH
* Fix formatting
* Don't over engineer the tests - be explicit
* add non-root tests for IntelMPI and MPICH
* MV base Dockerfile to build forlder, they are not an example
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* Consolidate tensorflow-benchmarks under v2beta1
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* Move pi demo under v2beta1
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* MV mxnet examples under examples/v1
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* MV horovod and tensorflow examples under the compatible API
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* Update Makefile after reorg of examples folder
Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
* Configure SSH port for base image
Use 2222 by default.
This should make it easier to use host networks, as generally the port 22 is taken by the host's sshd.
* Set ClusterFirstWithHostNet DNS policy
when the Pods use host network.
This allows resolving the worker and launcher hostnames without needing to include the namespace or cluster domain.
* Add support for Intel MPI
Adds the field .spec.mpiImplementation, defaults to OpenMPI
The Intel implementation requires a Service fronting the launcher.
* Add an example image that uses Intel MPI
* Allow running MPI applications as non-root
Adds the spec field sshAuthMountPath for MPIJob.
The init script sets the permissions and ownership based on the securityContext of the launcherPod
* Add pure MPI sample that run as non-root
* Do inter-pod communication through SSH
The controller generates keys and mounts them to the containers. The container images must know how to place the credentials and set file permissions.
* Use init-container instead of entrypoint
* Fix scheme for recorder and defaults
* Add integration tests for v2 controller