289 lines
9.9 KiB
Markdown
289 lines
9.9 KiB
Markdown
# A scalable and robust operator
|
||
|
||
Authors: @alculquicondor, @ahg-g
|
||
|
||
- [Motivation](#motivation)
|
||
- [Goals](#goals)
|
||
- [Background](#background)
|
||
- [Design](#design)
|
||
- [Alternatives Considered](#alternatives-considered)
|
||
- [Appendix](#appendix-prototype-objects)
|
||
|
||
## Motivation
|
||
|
||
A scalable MPI setup on Kubernetes is important for:
|
||
- Running jobs from multiple users in a single cluster.
|
||
- High-performance computing jobs that can require a big number of workers.
|
||
|
||
A robust MPI setup should be tolerant to failures, implementing retries while
|
||
keeping track of failures.
|
||
|
||
## Goals
|
||
|
||
- Allow driver-to-worker control to scale by removing kube-apiserver from
|
||
the communication channel.
|
||
- Reduce the complexity of the controller by relaying on Kubernetes workload
|
||
APIs for Pod creation and management.
|
||
|
||
## Background
|
||
|
||
The original design of the operator can be found as a
|
||
[proposal to the kubeflow community](https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md).
|
||
The latest release includes a v1alpha2 API and controller.
|
||
A v1 is under development. Something to highlight in this new version is the
|
||
replacement of the Job and StatefulSet by plain Pods, with the intent of
|
||
[tracking running Pods](https://github.com/kubeflow/mpi-operator/issues/201#issuecomment-827837831).
|
||
|
||
An MPIJob CRD describes the Job. Important fields include:
|
||
- The workers template
|
||
- The number of workers
|
||
- The launcher template, which should have a `mpirun` command.
|
||
|
||
The images are expected to have the MPI implementation binaries (such as
|
||
OpenMPI, Intel MPI or MPICH) the user’s MPI executable.
|
||
|
||
A controller processes the MPIJob, starting a Job with the following steps:
|
||
1. Creates ConfigMap, which contains:
|
||
- A script `kubexec.sh` that wraps `kubectl exec` and is used in replacement
|
||
of `ssh`. This script, before executing the command provided by `mpirun`,
|
||
transfers a file containing a mapping of pod names to IPs and appends it to
|
||
the worker’s `/etc/hosts`.
|
||
|
||
Note: The v1 controller no longer copies the pod-to-IP mapping file. The
|
||
OpenMPI implementation does the [routing](https://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3).
|
||
- The `hostfile` for `mpirun`, listing the worker pod names and number of
|
||
slots (which could be the number of CPUs or GPUs). This list is built
|
||
programmatically.
|
||
2. Creates a ServiceAccount+Role+RoleBinding for the launcher, which allow it
|
||
to:
|
||
- get/list/watch on Pods
|
||
- do pods/exec
|
||
This allows the launcher Pod to obtain details of the worker Pods and start
|
||
the process managers on them.
|
||
3. If configured, it creates a Volcano PodGroup
|
||
4. Creates a StatefulSet for workers (plain pods in v1). The Pod template
|
||
includes:
|
||
- mounts for the ConfigMap.
|
||
- `sleep` as command.
|
||
5. Creates launcher Job (plain pod in v1). The Pod template includes:
|
||
- An init container, `kubectl-delivery`, described below.
|
||
- Environment variables for:
|
||
- replacing `ssh` for `kubexec.sh`
|
||
- the `hostfile` location
|
||
- Volumes for:
|
||
- The ConfigMap
|
||
- Sharing files from `kubectl-delivery`
|
||
|
||
The launcher Job, as previously mentioned, contains an init container:
|
||
`kubectl-delivery`. This is a Kubernetes controller that watches pods. It does
|
||
the following:
|
||
|
||
1. Copy kubectl from the image into the volume shared with the main container.
|
||
2. Wait for all Pods in the hostfile to be running
|
||
3. Generates a file mapping pod name to IP, in `/etc/hosts` format.
|
||
|
||
To update the status of an MPIJob, the controller uses the status of the
|
||
launcher Job. That is, when the launcher Job fails or succeeds, it’s status is
|
||
copied to the MPIJob. In v1, it bases the status on the termination condition of
|
||
the Pod.
|
||
|
||
### Analysis
|
||
|
||
The above architecture for MPI Jobs puts a lot of pressure in the
|
||
`kube-apiserver`. The load increases with the number of workers in a job
|
||
and with the number of jobs in a cluster.
|
||
|
||
The reasons for this are:
|
||
- Due to the use of `kubectl exec`, every worker spawn goes through
|
||
kube-apiserver. `mpirun` starts a daemon in each worker
|
||
(like [`orted`](https://www.open-mpi.org/doc/v3.0/man1/orted.1.php)).
|
||
This process handles the worker-to-worker communication, which happens
|
||
without the intervention of kube-apiserver. However, the `exec` connection
|
||
stays up for control during the entirety of the job.
|
||
- The `kubectl-delivery` controller does a full cache sync to be able to watch
|
||
Pods. This startup penalty increases with the number of pods in the cluster
|
||
and has to be paid for every job. The API calls also cause additional
|
||
stress on the apiserver.
|
||
- The launcher role has to include a list of all the pods in the job.
|
||
Potentially, the object might not be able to accommodate jobs with immense
|
||
number of workers.
|
||
|
||
Another problem is that the v1 controller doesn’t implement launcher pod
|
||
retries, although there are plans to. So the MPIJob behaves like a plain Pod in
|
||
this version.
|
||
|
||
## Design
|
||
|
||
In order to address the problems of the existing architecture, we propose the
|
||
following changes:
|
||
|
||
- **The use of `ssh` instead of `kubectl exec`.**
|
||
|
||
This would avoid any pod-to-pod communication happening through apiserver and
|
||
doesn’t require giving execution permissions at the namespace level.
|
||
This can be achieved like this:
|
||
- The controller generates a single key and share it with the launcher and
|
||
worker pods through a Secret.
|
||
- The launcher and workers mount the Secret and set appropriate file
|
||
permissions.
|
||
- The workers run an SSH server instead of `sleep`.
|
||
- When encrypted communication is not a requirement, users have the choice to
|
||
use `rsh` for faster communication.
|
||
|
||
- **The use of stable hostnames and a headless Service for the workers**
|
||
- This removes the need to query Pod IPs, as the Pods can discover each other
|
||
through DNS resolution. The hostfile can be generated statically by the
|
||
controller using the stable hostnames.
|
||
- Starting with k8s 1.22, we can use
|
||
[Indexed Jobs with stable hostnames](https://git.k8s.io/enhancements/keps/sig-apps/2214-indexed-job)
|
||
to delegate the pod management to Kubernetes. Additionally, this will give
|
||
us robust failure tracking so that we can give users control over retry
|
||
limits.
|
||
In the meantime, we can continue using plain Pods.
|
||
- Caveat 1: The Job controller doesn’t report the number of running Pods.
|
||
Instead, it reports active Pods, which include running and pending
|
||
(scheduling or starting). But Kubernetes SIG Apps is
|
||
[open to add a status field for running Pods](https://kubernetes.slack.com/archives/C18NZM5K9/p1619549430067400).
|
||
- Caveat 2: Horovod supports elastic workers, but the Kubernetes Job
|
||
doesn’t support changes to the completions field. This can be supported
|
||
starting from 1.23. In the meantime, we can replicate the behavior by
|
||
creating a new Job and doing Pod adoption.
|
||
- For Intel MPI and MPICH, we also need a headless Service to front the launcher,
|
||
because workers communicate back to the launcher using its hostname.
|
||
- **Revert the use of the Job API for the launcher.**
|
||
- The Job controller handles retries when the launcher or any of the workers fail.
|
||
- Caveat 1 also applies: The Job controller doesn’t report if the Pod is running.
|
||
We can continue watching Pods in the meantime.
|
||
- With the above changes, **the following objects can be removed**:
|
||
- The ServiceAccount+Role+RoleBinding for the launcher.
|
||
- The `kubectl-delivery` init container in the launcher, as there is no need
|
||
to obtain IPs, speeding up startup time.
|
||
|
||
## Alternatives Considered
|
||
|
||
TBD from discussions
|
||
|
||
## Appendix: Prototype objects
|
||
|
||
It uses a StatefulSet in place of Indexed Jobs, as they are still an alpha
|
||
feature in Kubernetes.
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: ConfigMap
|
||
metadata:
|
||
name: mpi-config
|
||
data:
|
||
hostfile: |
|
||
mpi-workers-0.mpi-workers slots=3
|
||
mpi-workers-1.mpi-workers slots=3
|
||
mpi-workers-2.mpi-workers slots=3
|
||
```
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Secret
|
||
type: kubernetes.io/ssh-auth
|
||
data:
|
||
ssh-privatekey: PRIVATE_KEY
|
||
ssh-publickey: PUBLIC_KEY
|
||
```
|
||
|
||
```yaml
|
||
apiVersion: batch/v1
|
||
kind: Job
|
||
metadata:
|
||
name: mpi-launcher
|
||
spec:
|
||
template:
|
||
spec:
|
||
restartPolicy: OnFailure
|
||
containers:
|
||
- name: driver
|
||
image: '<USER_IMAGE>'
|
||
args:
|
||
- 'mpirun'
|
||
- '-np'
|
||
- '9'
|
||
- '<USER_EXECUTABLE>'
|
||
env:
|
||
- name: 'OMPI_MCA_orte_keep_fqdn_hostnames'
|
||
value: 'true'
|
||
- name: 'OMPI_MCA_orte_default_hostfile'
|
||
value: '/home/mpiuser/config/hostfile'
|
||
volumeMounts:
|
||
- name: ssh-auth
|
||
mountPath: /mnt/ssh
|
||
readOnly: true
|
||
- name: mpi-config
|
||
mountPath: /home/mpiuser/config
|
||
readOnly: true
|
||
volumes:
|
||
- name: mpi-config
|
||
configMap:
|
||
name: mpi-config
|
||
- name: ssh-auth
|
||
secret:
|
||
secretName: ssh-auth
|
||
items:
|
||
- key: ssh-privatekey
|
||
path: id_rsa
|
||
- key: ssh-publickey
|
||
path: id_rsa.pub
|
||
- key: ssh-publickey
|
||
path: authorized_keys
|
||
```
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata:
|
||
name: mpi-workers
|
||
spec:
|
||
clusterIP: None
|
||
selector:
|
||
app: mpi-workers
|
||
```
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: StatefulSet
|
||
metadata:
|
||
name: mpi-workers
|
||
spec:
|
||
selector:
|
||
matchLabels:
|
||
app: mpi-workers
|
||
serviceName: mpi-workers
|
||
replicas: 3
|
||
podManagementPolicy: Parallel
|
||
template:
|
||
metadata:
|
||
labels:
|
||
app: mpi-workers
|
||
spec:
|
||
containers:
|
||
- name: worker
|
||
image: '<USER_IMAGE>'
|
||
volumeMounts:
|
||
- name: ssh-auth
|
||
mountPath: /mnt/ssh
|
||
readOnly: true
|
||
- name: mpi-config
|
||
mountPath: /home/mpiuser/config
|
||
readOnly: true
|
||
volumes:
|
||
- name: mpi-config
|
||
configMap:
|
||
name: mpi-config
|
||
- name: ssh-auth
|
||
secret:
|
||
secretName: ssh-auth
|
||
items:
|
||
- key: ssh-privatekey
|
||
path: id_rsa
|
||
- key: ssh-publickey
|
||
path: id_rsa.pub
|
||
- key: ssh-publickey
|
||
path: authorized_keys
|
||
``` |