trainer: Add documentation for the MultiKueue and spec.managedBy API (#3956)

* resolves kubeflow/training/#2279

Signed-off-by: Garvit-77 <garvitname@gmail.com>

* Update content/en/docs/components/training/user-guides/managedby.md

Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Update content/en/docs/components/training/user-guides/managedby.md

Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Update content/en/docs/components/training/user-guides/managedby.md

Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Update content/en/docs/components/training/user-guides/managedby.md

Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Update content/en/docs/components/training/user-guides/managedby.md

Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Update content/en/docs/components/training/user-guides/managedby.md

Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Create managedby.md

Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Delete content/en/docs/components/training/user-guides/managedby.md

Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Comments-updated

Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Update tensorflow.md

Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* Updated weight managedby.md

Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

* updated

Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>

---------

Signed-off-by: Garvit-77 <garvitname@gmail.com>
Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com>
Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
This commit is contained in:
Garvit Khandelwal 2025-04-10 23:43:14 +05:30 committed by GitHub
parent e3163e3812
commit 0d462bd03e
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 70 additions and 1 deletions

View File

@ -0,0 +1,69 @@
+++
title = "How to manage Jobs in multi-cluster environment"
Desciption = "Using managedBy feild for MultiKueue"
weight = 10
+++
## Overview
This documentation details the usage of the `MultiKueue` feature within the Kueue project, specifically for Kubeflow MPI Jobs. The `MultiKueue` capability allows for efficient management and scheduling of multiple queues, optimizing resource allocation and improving the overall efficiency of MPI Jobs.
The `spec.runPolicy.managedBy` field is a new feature introduced for MultiKueue support in the Kubeflow Training Operator. This field allows for more robust management of multi-cluster job dispatching by specifying the managing entity.
## Prerequisites
1. Ensure that you have the version upto 1.9 of the Kubeflow Training Operator installed and version 0.11+ for kueue.
2. Make sure Kueue is compiled against the new operator to leverage the `spec.runPolicy.managedBy` field.
## Usage
To use the `spec.runPolicy.managedBy` field in your training jobs, include it in the job specification as shown below:
```yaml
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "example-tfjob"
spec:
runPolicy:
managedBy: "kueue.x-k8s.io/multikueue"
tfReplicaSpecs:
...
```
Example
Here is a complete example of a TensorFlow job using the spec.managedBy field:
```YAML
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "example-tfjob"
spec:
runPolicy:
managedBy: "kueue.x-k8s.io/multikueue"
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest
args: ["python", "model.py"]
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest
args: ["python", "model.py"]
```
## More Details
For more details on setting up and using MultiKueue with the Kubeflow Training Operator, refer to the following documentation pages:
- [Kueue/Kubeflow](https://kueue.sigs.k8s.io/docs/tasks/run/multikueue/kubeflow/)
- [kueue Docs]{https://kueue.sigs.k8s.io/docs/concepts/multikueue/}

View File

@ -1,7 +1,7 @@
+++
title = "TensorFlow Training (TFJob)"
description = "Using TFJob to train a model with TensorFlow"
weight = 10
weight = 20
+++
{{% alert title="Old Version" color="warning" %}}