From 0d462bd03e1e4abea3187c4944e23f5f9301e5d9 Mon Sep 17 00:00:00 2001 From: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> Date: Thu, 10 Apr 2025 23:43:14 +0530 Subject: [PATCH] trainer: Add documentation for the MultiKueue and spec.managedBy API (#3956) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * resolves kubeflow/training/#2279 Signed-off-by: Garvit-77 * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update content/en/docs/components/training/user-guides/managedby.md Co-authored-by: Michał Woźniak Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Create managedby.md Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Delete content/en/docs/components/training/user-guides/managedby.md Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Comments-updated Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Update tensorflow.md Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * Updated weight managedby.md Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> * updated Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> --------- Signed-off-by: Garvit-77 Signed-off-by: Garvit Khandelwal <70192868+Garvit-77@users.noreply.github.com> Co-authored-by: Michał Woźniak --- .../legacy-v1/user-guides/managedby.md | 69 +++++++++++++++++++ .../legacy-v1/user-guides/tensorflow.md | 2 +- 2 files changed, 70 insertions(+), 1 deletion(-) create mode 100644 content/en/docs/components/trainer/legacy-v1/user-guides/managedby.md diff --git a/content/en/docs/components/trainer/legacy-v1/user-guides/managedby.md b/content/en/docs/components/trainer/legacy-v1/user-guides/managedby.md new file mode 100644 index 000000000..69fe9d489 --- /dev/null +++ b/content/en/docs/components/trainer/legacy-v1/user-guides/managedby.md @@ -0,0 +1,69 @@ ++++ +title = "How to manage Jobs in multi-cluster environment" +Desciption = "Using managedBy feild for MultiKueue" +weight = 10 ++++ + +## Overview + +This documentation details the usage of the `MultiKueue` feature within the Kueue project, specifically for Kubeflow MPI Jobs. The `MultiKueue` capability allows for efficient management and scheduling of multiple queues, optimizing resource allocation and improving the overall efficiency of MPI Jobs. +The `spec.runPolicy.managedBy` field is a new feature introduced for MultiKueue support in the Kubeflow Training Operator. This field allows for more robust management of multi-cluster job dispatching by specifying the managing entity. + +## Prerequisites + +1. Ensure that you have the version upto 1.9 of the Kubeflow Training Operator installed and version 0.11+ for kueue. +2. Make sure Kueue is compiled against the new operator to leverage the `spec.runPolicy.managedBy` field. + +## Usage + +To use the `spec.runPolicy.managedBy` field in your training jobs, include it in the job specification as shown below: + +```yaml +apiVersion: "kubeflow.org/v1" +kind: "TFJob" +metadata: + name: "example-tfjob" +spec: + runPolicy: + managedBy: "kueue.x-k8s.io/multikueue" + tfReplicaSpecs: + ... +``` + +Example + +Here is a complete example of a TensorFlow job using the spec.managedBy field: + +```YAML +apiVersion: "kubeflow.org/v1" +kind: "TFJob" +metadata: + name: "example-tfjob" +spec: + runPolicy: + managedBy: "kueue.x-k8s.io/multikueue" + tfReplicaSpecs: + Chief: + replicas: 1 + template: + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest + args: ["python", "model.py"] + Worker: + replicas: 2 + template: + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest + args: ["python", "model.py"] +``` + +## More Details + +For more details on setting up and using MultiKueue with the Kubeflow Training Operator, refer to the following documentation pages: + +- [Kueue/Kubeflow](https://kueue.sigs.k8s.io/docs/tasks/run/multikueue/kubeflow/) +- [kueue Docs]{https://kueue.sigs.k8s.io/docs/concepts/multikueue/} diff --git a/content/en/docs/components/trainer/legacy-v1/user-guides/tensorflow.md b/content/en/docs/components/trainer/legacy-v1/user-guides/tensorflow.md index 0d2fb5266..6cefb5856 100644 --- a/content/en/docs/components/trainer/legacy-v1/user-guides/tensorflow.md +++ b/content/en/docs/components/trainer/legacy-v1/user-guides/tensorflow.md @@ -1,7 +1,7 @@ +++ title = "TensorFlow Training (TFJob)" description = "Using TFJob to train a model with TensorFlow" -weight = 10 +weight = 20 +++ {{% alert title="Old Version" color="warning" %}}