trainer: Update Kubeflow Trainer personas diagram (#4144)

* trainer: Update Kubeflow Trainer personas diagram

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update personas

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update Lifecycle Diagram

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update runtime guide

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add dependsOn

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
This commit is contained in:
Andrey Velichkevich 2025-07-16 16:42:39 +01:00 committed by GitHub
parent ab27782aee
commit 0453687e2b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 158 additions and 107 deletions

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 888 KiB

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 880 KiB

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 492 KiB

After

Width:  |  Height:  |  Size: 614 KiB

View File

@ -5,51 +5,69 @@ weight = 30
+++
## Overview
This guide explains how cluster administrators should manage `TrainingRuntime` and `ClusterTrainingRuntime`. It describes how to configure `MLPolicy`, `PodGroupPolicy`, and `Template` APIs.
**Note**: **Runtimes** are the configurations or the blueprints which have the optimal configuration to run desired/specific tasks.
### What is ClusterTrainingRuntime
The ClusterTrainingRuntime is a cluster-scoped API in Kubeflow Trainer that allows platform administrators to manage templates for TrainJobs. Runtimes can be deployed across the entire Kubernetes cluster and reused by ML engineers in their TrainJobs. It simplifies the process of running training jobs by providing standardized blueprints and ready-to-use environments.
### Example of ClusterTrainingRuntime
```YAML
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-cluster-runtime
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-distributed
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: auto
template:
spec:
mlPolicy:
numNodes: 2
torch:
numProcPerNode: auto
podGroupPolicy:
coscheduling:
scheduleTimeoutSeconds: 100
template:
spec:
replicatedJobs:
- name: dataset-initializer
- name: model-initializer
- name: node
replicatedJobs:
- name: node
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- name: node
image: pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
command:
- /bin/bash
- -c
- |
echo "Torch Distributed Runtime"
echo "--------------------------------------"
echo "Torch Default Runtime Env"
env | grep PET_
pip list
```
- Referencing:
In Kubeflow, a ClusterTrainingRuntime defines a reusable template for distributed training, specifying node count, processes, and scheduling policies. A TrainJob references this runtime via the runtimeRef field, linking to its apiGroup, kind and name. This enables the TrainJob to use the runtimes configuration for consistent and modular training setups.
```YAML
apiVersion: trainer.kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: example-train-job
namespace: default
spec:
runtimeRef:
apiGroup: kubeflow.org
name: torch-cluster-runtime
kind: ClusterTrainingRuntime
...
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: example-train-job
namespace: default
spec:
runtimeRef:
apiGroup: kubeflow.org
name: torch-distributed
kind: ClusterTrainingRuntime
```
### What is TrainingRuntime
The TrainingRuntime is a namespace-scoped API in Kubeflow Trainer that allows platform administrators to manage templates for TrainJobs per namespace. It is ideal for teams or projects that need their own customized training setups, offering flexibility for decentralized control.
@ -57,53 +75,64 @@ The TrainingRuntime is a namespace-scoped API in Kubeflow Trainer that allows pl
### Example of TrainingRuntime
```YAML
apiVersion: kubeflow.org/v2alpha1
kind: TrainingRuntime
metadata:
name: pytorch-team-runtime
namespace: team-a
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainingRuntime
metadata:
name: pytorch-team-runtime
namespace: team-a
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: auto
template:
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: 4
template:
spec:
containers:
- name: pytorch-container
image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
command: ["python", "/path/to/train.py"]
resources:
requests:
cpu: "1"
memory: "2Gi"
nvidia.com/gpu: "1"
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
replicatedJobs:
- name: node
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- name: node
image: pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
command:
- /bin/bash
- -c
- |
echo "Torch Distributed Runtime"
echo "--------------------------------------"
echo "Torch Default Runtime Env"
env | grep PET_
pip list
```
Referencing: When using TrainingRuntime, the Kubernetes namespace must be the same as the TrainJob's namespace.
```YAML
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: example-train-job
namespace: team-a # Only accessible to the namespace for which it is defined
spec:
runtimeRef:
apiGroup: kubeflow.org
name: pytorch-team-runtime
kind: TrainingRuntime
...
```
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: example-train-job
namespace: team-a # Only accessible to the namespace for which it is defined
spec:
runtimeRef:
apiGroup: kubeflow.org
name: pytorch-team-runtime
kind: TrainingRuntime
```
### What is MLPolicy
The `MLPolicy` API configures the ML-specific parameters. For example, configuration for PyTorch Distributed or MPI hostfile location.
To define MLPolicy in ClusterTrainingRuntime or TrainingRuntime:
```YAML
mlPolicy:
numNodes: 3
@ -112,53 +141,69 @@ mlPolicy:
```
#### Torch and MPI
- **Torch**: Configures distributed training for PyTorch jobs. Use this policy to set options like the number of processes per node (`numProcPerNode`) for PyTorch distributed workloads.
- **MPI**: Configures distributed training using MPI. This policy allows you to specify options such as the number of processes per node and MPI implementation details.
- **Torch**: Configures distributed training for PyTorch. Use this policy to set options like the
number of processes per node (`numProcPerNode`) for PyTorch distributed workloads.
- **MPI**: Configures distributed training using MPI. This policy allows you to specify options
such as the number of processes per node and MPI implementation details.
For a complete list of available options and detailed API fields, refer to the [Kubeflow Trainer API reference](https://pkg.go.dev/github.com/kubeflow/trainer/v2/pkg/apis/trainer/v1alpha1#MLPolicy).
### What is Template
The `Template` API configures [the JobSet template](https://jobset.sigs.k8s.io/docs/overview/) to execute the TrainJob. Kubeflow Trainer controller manager creates the appropriate JobSet based on `Template` and other configurations from the runtime (e.g. `MLPolicy`).
#### Template Configuration
#### Template Configuration
For each job in replicatedJobs, you can provide detailed settings, like the Job specification,
container image, commands, and resource requirements:
For each job in replicatedJobs, you can provide detailed settings, like the container image, commands, and resource requirements.
Here's an example below.
```YAML
replicatedJobs:
- name: initializer
- name: model-initializer
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: model-initializer
spec:
template:
spec:
template:
spec:
containers:
- name: init-container
image: busybox
command: ["echo", "Initializing..."]
- name: node
containers:
- name: model-initializer
image: ghcr.io/kubeflow/trainer/model-initializer
- name: node
dependsOn:
- name: model-initializer
status: Complete
template:
spec:
template:
spec:
template:
spec:
containers:
- name: trainer-container
image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
command: ["python", "/path/to/train.py"]
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
nvidia.com/gpu: "1"
containers:
- name: node
image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
command: ["python", "/path/to/train.py"]
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
nvidia.com/gpu: "1"
```
### Ancestor Label Requirements for ReplicatedJobs
When defining `replicatedJobs` such as `initializer` and `node`, it is important to ensure that each job template includes the necessary ancestor labels. These labels are used by the Kubeflow Trainer controller to inject values from the TrainJob to the underlying training job.
When defining `replicatedJobs` such as `dataset-initializer`, `model-initializer`, and `node`,
it is important to ensure that each job template includes the necessary ancestor labels.
These labels are used by the Kubeflow Trainer controller to inject values from the TrainJob to
the underlying training job.
**Required Labels:**
- `trainer.kubeflow.org/trainjob-ancestor-step`: Specifies the role or step of the replicated job in the training workflow (e.g., `dataset-initializer`, `model-initializer` or `trainer`).
**Example:**
```YAML
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
@ -180,6 +225,9 @@ spec:
- name: dataset-initializer
image: ghcr.io/kubeflow/trainer/dataset-initializer
- name: model-initializer
dependsOn:
- name: dataset-initializer
status: Complete
template:
metadata:
labels:
@ -191,6 +239,9 @@ spec:
- name: model-initializer
image: ghcr.io/kubeflow/trainer/model-initializer
- name: node
dependsOn:
- name: model-initializer
status: Complete
template:
metadata:
labels:

View File

@ -43,10 +43,10 @@ responsibilities:
Kubeflow Trainer documentation is separated between these user personas:
- [ML Users](/docs/components/trainer/user-guides): engineers and scientists who develop AI models
using the Kubeflow Python SDK and TrainJob.
- [Cluster Operators](/docs/components/trainer/operator-guides): administrators responsible for managing
Kubernetes clusters and Kubeflow Training Runtimes.
- [AI Practitioners](/docs/components/trainer/user-guides): ML engineers and data scientists who
develop AI models using the Kubeflow Python SDK and TrainJob.
- [Platform Administrators](/docs/components/trainer/operator-guides): administrators and DevOps
engineers responsible for managing Kubernetes clusters and Kubeflow Training Runtimes.
- [Contributors](/docs/components/trainer/contributor-guides): open source contributors working on
[Kubeflow Trainer project](https://github.com/kubeflow/trainer).
@ -58,11 +58,11 @@ Watch the following KubeCon + CloudNativeCon 2024 talk which provides an overvie
## Why use Kubeflow Trainer
The Kubeflow Trainer supports key phases on the AI/ML lifecycle, including model training and LLMs
fine-tuning, as shown in the diagram below:
The Kubeflow Trainer supports key phases on the [AI lifecycle](/docs/started/architecture/#kubeflow-components-in-the-ml-lifecycle),
including model training and LLMs fine-tuning, as shown in the diagram below:
<img src="/docs/components/trainer/images/ml-lifecycle-trainer.drawio.svg"
alt="AI/ML Lifecycle Trainer"
<img src="/docs/components/trainer/images/ai-lifecycle-trainer.drawio.svg"
alt="AI Lifecycle Trainer"
class="mt-3 mb-3 border rounded p-3 bg-white">
### Key Benefits
@ -83,9 +83,9 @@ Fine-tune the latest LLMs on Kubernetes with ready-to-use Kubeflow LLM blueprint
- **Reduce GPU Cost**
- Kubeflow Trainer implements custom dataset and model initializers to reduce GPU cost by
offloading I/O tasks to CPU workloads and to streamline assets initialization across distributed
training nodes.
Kubeflow Trainer implements custom dataset and model initializers to reduce GPU cost by
offloading I/O tasks to CPU workloads and to streamline assets initialization across distributed
training nodes.
- **Seamless Kubernetes Integration**