mirror of https://github.com/kubeflow/website.git
trainer: Add documentation for the MultiKueue and spec.managedBy API (#3956)
* resolves kubeflow/training/#2279 Signed-off-by: Garvit-77 <garvitname@gmail.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com> Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com> Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com> Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com> Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com> Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com> Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Create managedby.md Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Delete content/en/docs/components/training/user-guides/managedby.md Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Comments-updated Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update tensorflow.md Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Updated weight managedby.md Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * updated Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> --------- Signed-off-by: Garvit-77 <garvitname@gmail.com> Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
This commit is contained in:
parent
e3163e3812
commit
0d462bd03e
|
|
@ -0,0 +1,69 @@
|
|||
+++
|
||||
title = "How to manage Jobs in multi-cluster environment"
|
||||
Desciption = "Using managedBy feild for MultiKueue"
|
||||
weight = 10
|
||||
+++
|
||||
|
||||
## Overview
|
||||
|
||||
This documentation details the usage of the `MultiKueue` feature within the Kueue project, specifically for Kubeflow MPI Jobs. The `MultiKueue` capability allows for efficient management and scheduling of multiple queues, optimizing resource allocation and improving the overall efficiency of MPI Jobs.
|
||||
The `spec.runPolicy.managedBy` field is a new feature introduced for MultiKueue support in the Kubeflow Training Operator. This field allows for more robust management of multi-cluster job dispatching by specifying the managing entity.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. Ensure that you have the version upto 1.9 of the Kubeflow Training Operator installed and version 0.11+ for kueue.
|
||||
2. Make sure Kueue is compiled against the new operator to leverage the `spec.runPolicy.managedBy` field.
|
||||
|
||||
## Usage
|
||||
|
||||
To use the `spec.runPolicy.managedBy` field in your training jobs, include it in the job specification as shown below:
|
||||
|
||||
```yaml
|
||||
apiVersion: "kubeflow.org/v1"
|
||||
kind: "TFJob"
|
||||
metadata:
|
||||
name: "example-tfjob"
|
||||
spec:
|
||||
runPolicy:
|
||||
managedBy: "kueue.x-k8s.io/multikueue"
|
||||
tfReplicaSpecs:
|
||||
...
|
||||
```
|
||||
|
||||
Example
|
||||
|
||||
Here is a complete example of a TensorFlow job using the spec.managedBy field:
|
||||
|
||||
```YAML
|
||||
apiVersion: "kubeflow.org/v1"
|
||||
kind: "TFJob"
|
||||
metadata:
|
||||
name: "example-tfjob"
|
||||
spec:
|
||||
runPolicy:
|
||||
managedBy: "kueue.x-k8s.io/multikueue"
|
||||
tfReplicaSpecs:
|
||||
Chief:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: tensorflow
|
||||
image: tensorflow/tensorflow:latest
|
||||
args: ["python", "model.py"]
|
||||
Worker:
|
||||
replicas: 2
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: tensorflow
|
||||
image: tensorflow/tensorflow:latest
|
||||
args: ["python", "model.py"]
|
||||
```
|
||||
|
||||
## More Details
|
||||
|
||||
For more details on setting up and using MultiKueue with the Kubeflow Training Operator, refer to the following documentation pages:
|
||||
|
||||
- [Kueue/Kubeflow](https://kueue.sigs.k8s.io/docs/tasks/run/multikueue/kubeflow/)
|
||||
- [kueue Docs]{https://kueue.sigs.k8s.io/docs/concepts/multikueue/}
|
||||
|
|
@ -1,7 +1,7 @@
|
|||
+++
|
||||
title = "TensorFlow Training (TFJob)"
|
||||
description = "Using TFJob to train a model with TensorFlow"
|
||||
weight = 10
|
||||
weight = 20
|
||||
+++
|
||||
|
||||
{{% alert title="Old Version" color="warning" %}}
|
||||
|
|
|
|||
Loading…
Reference in New Issue