mirror of https://github.com/kubeflow/website.git
trainer: Update Kubeflow Trainer personas diagram (#4144)
* trainer: Update Kubeflow Trainer personas diagram Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update personas Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update Lifecycle Diagram Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update runtime guide Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add dependsOn Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
This commit is contained in:
parent
ab27782aee
commit
0453687e2b
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 888 KiB |
File diff suppressed because one or more lines are too long
|
Before Width: | Height: | Size: 880 KiB |
File diff suppressed because one or more lines are too long
|
Before Width: | Height: | Size: 492 KiB After Width: | Height: | Size: 614 KiB |
|
|
@ -5,51 +5,69 @@ weight = 30
|
|||
+++
|
||||
|
||||
## Overview
|
||||
|
||||
This guide explains how cluster administrators should manage `TrainingRuntime` and `ClusterTrainingRuntime`. It describes how to configure `MLPolicy`, `PodGroupPolicy`, and `Template` APIs.
|
||||
|
||||
**Note**: **Runtimes** are the configurations or the blueprints which have the optimal configuration to run desired/specific tasks.
|
||||
|
||||
### What is ClusterTrainingRuntime
|
||||
|
||||
The ClusterTrainingRuntime is a cluster-scoped API in Kubeflow Trainer that allows platform administrators to manage templates for TrainJobs. Runtimes can be deployed across the entire Kubernetes cluster and reused by ML engineers in their TrainJobs. It simplifies the process of running training jobs by providing standardized blueprints and ready-to-use environments.
|
||||
|
||||
### Example of ClusterTrainingRuntime
|
||||
|
||||
```YAML
|
||||
apiVersion: kubeflow.org/v2alpha1
|
||||
kind: ClusterTrainingRuntime
|
||||
metadata:
|
||||
name: torch-cluster-runtime
|
||||
apiVersion: trainer.kubeflow.org/v1alpha1
|
||||
kind: ClusterTrainingRuntime
|
||||
metadata:
|
||||
name: torch-distributed
|
||||
spec:
|
||||
mlPolicy:
|
||||
numNodes: 1
|
||||
torch:
|
||||
numProcPerNode: auto
|
||||
template:
|
||||
spec:
|
||||
mlPolicy:
|
||||
numNodes: 2
|
||||
torch:
|
||||
numProcPerNode: auto
|
||||
podGroupPolicy:
|
||||
coscheduling:
|
||||
scheduleTimeoutSeconds: 100
|
||||
template:
|
||||
spec:
|
||||
replicatedJobs:
|
||||
- name: dataset-initializer
|
||||
- name: model-initializer
|
||||
- name: node
|
||||
replicatedJobs:
|
||||
- name: node
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
trainer.kubeflow.org/trainjob-ancestor-step: trainer
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: node
|
||||
image: pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- |
|
||||
echo "Torch Distributed Runtime"
|
||||
|
||||
echo "--------------------------------------"
|
||||
echo "Torch Default Runtime Env"
|
||||
env | grep PET_
|
||||
|
||||
pip list
|
||||
```
|
||||
- Referencing:
|
||||
|
||||
In Kubeflow, a ClusterTrainingRuntime defines a reusable template for distributed training, specifying node count, processes, and scheduling policies. A TrainJob references this runtime via the runtimeRef field, linking to its apiGroup, kind and name. This enables the TrainJob to use the runtime’s configuration for consistent and modular training setups.
|
||||
|
||||
```YAML
|
||||
apiVersion: trainer.kubeflow.org/v2alpha1
|
||||
kind: TrainJob
|
||||
metadata:
|
||||
name: example-train-job
|
||||
namespace: default
|
||||
spec:
|
||||
runtimeRef:
|
||||
apiGroup: kubeflow.org
|
||||
name: torch-cluster-runtime
|
||||
kind: ClusterTrainingRuntime
|
||||
...
|
||||
apiVersion: trainer.kubeflow.org/v1alpha1
|
||||
kind: TrainJob
|
||||
metadata:
|
||||
name: example-train-job
|
||||
namespace: default
|
||||
spec:
|
||||
runtimeRef:
|
||||
apiGroup: kubeflow.org
|
||||
name: torch-distributed
|
||||
kind: ClusterTrainingRuntime
|
||||
```
|
||||
|
||||
### What is TrainingRuntime
|
||||
|
||||
The TrainingRuntime is a namespace-scoped API in Kubeflow Trainer that allows platform administrators to manage templates for TrainJobs per namespace. It is ideal for teams or projects that need their own customized training setups, offering flexibility for decentralized control.
|
||||
|
|
@ -57,53 +75,64 @@ The TrainingRuntime is a namespace-scoped API in Kubeflow Trainer that allows pl
|
|||
### Example of TrainingRuntime
|
||||
|
||||
```YAML
|
||||
apiVersion: kubeflow.org/v2alpha1
|
||||
kind: TrainingRuntime
|
||||
metadata:
|
||||
name: pytorch-team-runtime
|
||||
namespace: team-a
|
||||
apiVersion: trainer.kubeflow.org/v1alpha1
|
||||
kind: TrainingRuntime
|
||||
metadata:
|
||||
name: pytorch-team-runtime
|
||||
namespace: team-a
|
||||
spec:
|
||||
mlPolicy:
|
||||
numNodes: 1
|
||||
torch:
|
||||
numProcPerNode: auto
|
||||
template:
|
||||
spec:
|
||||
mlPolicy:
|
||||
numNodes: 1
|
||||
torch:
|
||||
numProcPerNode: 4
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: pytorch-container
|
||||
image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
|
||||
command: ["python", "/path/to/train.py"]
|
||||
resources:
|
||||
requests:
|
||||
cpu: "1"
|
||||
memory: "2Gi"
|
||||
nvidia.com/gpu: "1"
|
||||
limits:
|
||||
cpu: "2"
|
||||
memory: "4Gi"
|
||||
nvidia.com/gpu: "1"
|
||||
replicatedJobs:
|
||||
- name: node
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
trainer.kubeflow.org/trainjob-ancestor-step: trainer
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: node
|
||||
image: pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- |
|
||||
echo "Torch Distributed Runtime"
|
||||
|
||||
echo "--------------------------------------"
|
||||
echo "Torch Default Runtime Env"
|
||||
env | grep PET_
|
||||
|
||||
pip list
|
||||
```
|
||||
|
||||
Referencing: When using TrainingRuntime, the Kubernetes namespace must be the same as the TrainJob's namespace.
|
||||
|
||||
```YAML
|
||||
apiVersion: kubeflow.org/v2alpha1
|
||||
kind: TrainJob
|
||||
metadata:
|
||||
name: example-train-job
|
||||
namespace: team-a # Only accessible to the namespace for which it is defined
|
||||
spec:
|
||||
runtimeRef:
|
||||
apiGroup: kubeflow.org
|
||||
name: pytorch-team-runtime
|
||||
kind: TrainingRuntime
|
||||
...
|
||||
```
|
||||
apiVersion: kubeflow.org/v2alpha1
|
||||
kind: TrainJob
|
||||
metadata:
|
||||
name: example-train-job
|
||||
namespace: team-a # Only accessible to the namespace for which it is defined
|
||||
spec:
|
||||
runtimeRef:
|
||||
apiGroup: kubeflow.org
|
||||
name: pytorch-team-runtime
|
||||
kind: TrainingRuntime
|
||||
```
|
||||
|
||||
### What is MLPolicy
|
||||
|
||||
The `MLPolicy` API configures the ML-specific parameters. For example, configuration for PyTorch Distributed or MPI hostfile location.
|
||||
|
||||
To define MLPolicy in ClusterTrainingRuntime or TrainingRuntime:
|
||||
|
||||
```YAML
|
||||
mlPolicy:
|
||||
numNodes: 3
|
||||
|
|
@ -112,53 +141,69 @@ mlPolicy:
|
|||
```
|
||||
|
||||
#### Torch and MPI
|
||||
- **Torch**: Configures distributed training for PyTorch jobs. Use this policy to set options like the number of processes per node (`numProcPerNode`) for PyTorch distributed workloads.
|
||||
- **MPI**: Configures distributed training using MPI. This policy allows you to specify options such as the number of processes per node and MPI implementation details.
|
||||
|
||||
- **Torch**: Configures distributed training for PyTorch. Use this policy to set options like the
|
||||
number of processes per node (`numProcPerNode`) for PyTorch distributed workloads.
|
||||
- **MPI**: Configures distributed training using MPI. This policy allows you to specify options
|
||||
such as the number of processes per node and MPI implementation details.
|
||||
|
||||
For a complete list of available options and detailed API fields, refer to the [Kubeflow Trainer API reference](https://pkg.go.dev/github.com/kubeflow/trainer/v2/pkg/apis/trainer/v1alpha1#MLPolicy).
|
||||
|
||||
### What is Template
|
||||
|
||||
The `Template` API configures [the JobSet template](https://jobset.sigs.k8s.io/docs/overview/) to execute the TrainJob. Kubeflow Trainer controller manager creates the appropriate JobSet based on `Template` and other configurations from the runtime (e.g. `MLPolicy`).
|
||||
|
||||
#### Template Configuration
|
||||
#### Template Configuration
|
||||
|
||||
For each job in replicatedJobs, you can provide detailed settings, like the Job specification,
|
||||
container image, commands, and resource requirements:
|
||||
|
||||
For each job in replicatedJobs, you can provide detailed settings, like the container image, commands, and resource requirements.
|
||||
Here's an example below.
|
||||
```YAML
|
||||
replicatedJobs:
|
||||
- name: initializer
|
||||
- name: model-initializer
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
trainer.kubeflow.org/trainjob-ancestor-step: model-initializer
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: init-container
|
||||
image: busybox
|
||||
command: ["echo", "Initializing..."]
|
||||
- name: node
|
||||
containers:
|
||||
- name: model-initializer
|
||||
image: ghcr.io/kubeflow/trainer/model-initializer
|
||||
- name: node
|
||||
dependsOn:
|
||||
- name: model-initializer
|
||||
status: Complete
|
||||
template:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: trainer-container
|
||||
image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
|
||||
command: ["python", "/path/to/train.py"]
|
||||
resources:
|
||||
requests:
|
||||
cpu: "2"
|
||||
memory: "4Gi"
|
||||
limits:
|
||||
nvidia.com/gpu: "1"
|
||||
containers:
|
||||
- name: node
|
||||
image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
|
||||
command: ["python", "/path/to/train.py"]
|
||||
resources:
|
||||
requests:
|
||||
cpu: "2"
|
||||
memory: "4Gi"
|
||||
limits:
|
||||
nvidia.com/gpu: "1"
|
||||
```
|
||||
|
||||
### Ancestor Label Requirements for ReplicatedJobs
|
||||
When defining `replicatedJobs` such as `initializer` and `node`, it is important to ensure that each job template includes the necessary ancestor labels. These labels are used by the Kubeflow Trainer controller to inject values from the TrainJob to the underlying training job.
|
||||
|
||||
When defining `replicatedJobs` such as `dataset-initializer`, `model-initializer`, and `node`,
|
||||
it is important to ensure that each job template includes the necessary ancestor labels.
|
||||
These labels are used by the Kubeflow Trainer controller to inject values from the TrainJob to
|
||||
the underlying training job.
|
||||
|
||||
**Required Labels:**
|
||||
|
||||
- `trainer.kubeflow.org/trainjob-ancestor-step`: Specifies the role or step of the replicated job in the training workflow (e.g., `dataset-initializer`, `model-initializer` or `trainer`).
|
||||
|
||||
**Example:**
|
||||
|
||||
```YAML
|
||||
apiVersion: kubeflow.org/v2alpha1
|
||||
kind: ClusterTrainingRuntime
|
||||
|
|
@ -180,6 +225,9 @@ spec:
|
|||
- name: dataset-initializer
|
||||
image: ghcr.io/kubeflow/trainer/dataset-initializer
|
||||
- name: model-initializer
|
||||
dependsOn:
|
||||
- name: dataset-initializer
|
||||
status: Complete
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
|
|
@ -191,6 +239,9 @@ spec:
|
|||
- name: model-initializer
|
||||
image: ghcr.io/kubeflow/trainer/model-initializer
|
||||
- name: node
|
||||
dependsOn:
|
||||
- name: model-initializer
|
||||
status: Complete
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
|
|
@ -43,10 +43,10 @@ responsibilities:
|
|||
|
||||
Kubeflow Trainer documentation is separated between these user personas:
|
||||
|
||||
- [ML Users](/docs/components/trainer/user-guides): engineers and scientists who develop AI models
|
||||
using the Kubeflow Python SDK and TrainJob.
|
||||
- [Cluster Operators](/docs/components/trainer/operator-guides): administrators responsible for managing
|
||||
Kubernetes clusters and Kubeflow Training Runtimes.
|
||||
- [AI Practitioners](/docs/components/trainer/user-guides): ML engineers and data scientists who
|
||||
develop AI models using the Kubeflow Python SDK and TrainJob.
|
||||
- [Platform Administrators](/docs/components/trainer/operator-guides): administrators and DevOps
|
||||
engineers responsible for managing Kubernetes clusters and Kubeflow Training Runtimes.
|
||||
- [Contributors](/docs/components/trainer/contributor-guides): open source contributors working on
|
||||
[Kubeflow Trainer project](https://github.com/kubeflow/trainer).
|
||||
|
||||
|
|
@ -58,11 +58,11 @@ Watch the following KubeCon + CloudNativeCon 2024 talk which provides an overvie
|
|||
|
||||
## Why use Kubeflow Trainer
|
||||
|
||||
The Kubeflow Trainer supports key phases on the AI/ML lifecycle, including model training and LLMs
|
||||
fine-tuning, as shown in the diagram below:
|
||||
The Kubeflow Trainer supports key phases on the [AI lifecycle](/docs/started/architecture/#kubeflow-components-in-the-ml-lifecycle),
|
||||
including model training and LLMs fine-tuning, as shown in the diagram below:
|
||||
|
||||
<img src="/docs/components/trainer/images/ml-lifecycle-trainer.drawio.svg"
|
||||
alt="AI/ML Lifecycle Trainer"
|
||||
<img src="/docs/components/trainer/images/ai-lifecycle-trainer.drawio.svg"
|
||||
alt="AI Lifecycle Trainer"
|
||||
class="mt-3 mb-3 border rounded p-3 bg-white">
|
||||
|
||||
### Key Benefits
|
||||
|
|
@ -83,9 +83,9 @@ Fine-tune the latest LLMs on Kubernetes with ready-to-use Kubeflow LLM blueprint
|
|||
|
||||
- **Reduce GPU Cost**
|
||||
|
||||
- Kubeflow Trainer implements custom dataset and model initializers to reduce GPU cost by
|
||||
offloading I/O tasks to CPU workloads and to streamline assets initialization across distributed
|
||||
training nodes.
|
||||
Kubeflow Trainer implements custom dataset and model initializers to reduce GPU cost by
|
||||
offloading I/O tasks to CPU workloads and to streamline assets initialization across distributed
|
||||
training nodes.
|
||||
|
||||
- **Seamless Kubernetes Integration**
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue