Add Key Features item and following doc:

1. Hierarchical Queue
2. Queue Resource Management
3. Unified Scheduling
Besides, reconstructed queue.md, moved the usage scenario to queue resource management, and then updated some fields

Signed-off-by: JesseStutler <chenzicong4@huawei.com>
This commit is contained in:
JesseStutler 2025-01-14 12:20:53 +08:00
parent 12169bc5e0
commit 005621aa21
11 changed files with 1242 additions and 159 deletions

View File

@ -97,23 +97,28 @@
identifier = "concepts"
[[zh.menu.docs]]
name = "生态"
name = "关键特性"
weight = 4
identifier = "features"
[[zh.menu.docs]]
name = "生态"
weight = 5
identifier = "zoology"
[[zh.menu.docs]]
name = "Scheduler"
weight = 5
weight = 6
identifier = "scheduler"
[[zh.menu.docs]]
name = "CLI"
weight = 6
weight = 7
identifier = "cli"
[[zh.menu.docs]]
name = "贡献"
weight = 7
weight = 8
identifier = "contribution"

View File

@ -73,23 +73,28 @@
identifier = "concepts"
[[docs]]
name = "Ecosystem"
name = "Key Features"
weight = 4
identifier = "features"
[[docs]]
name = "Ecosystem"
weight = 5
identifier = "ecosystem"
[[docs]]
name = "Scheduler"
weight = 5
weight = 6
identifier = "scheduler"
[[docs]]
name = "CLI"
weight = 6
weight = 7
identifier = "cli"
[[docs]]
name = "Contribution"
weight = 7
weight = 8
identifier = "contribution"

View File

@ -0,0 +1,149 @@
+++
title = "Hierarchical Queue"
date = 2024-12-28
lastmod = 2024-12-28
draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
type = "docs" # Do not modify.
# Add menu entry to sidebar.
[menu.docs]
parent = "features"
weight = 2
+++
## Background
In multi-tenant scenarios, queues are a core mechanism for achieving fair scheduling, resource isolation, and task priority control. However, in the current version of Volcano, queues only support a flat structure and lack hierarchical concepts. In practical applications, different queues often belong to different departments, with hierarchical relationships between departments, leading to more refined requirements for resource allocation and preemption. To address this, Volcano latest version introduces the hierarchical queue feature, significantly enhancing queue capabilities. With this feature, users can achieve finer-grained resource quota management and preemption strategies based on hierarchical queues, building a more efficient unified scheduling platform.
For users using YARN, this feature allows seamless migration of big data workloads to Kubernetes clusters using Volcano. YARN's Capacity Scheduler already supports hierarchical queues, enabling cross-level resource allocation and preemption. Volcano latest version adopts a similar hierarchical queue design, providing more flexible resource management and scheduling strategies.
## Features Support
- Supports configuring hierarchical relationships between queues.
- Supports resource sharing and reclamation between tasks in cross-level queues.
- Supports setting resource capability limits `capability` for each resource dimension, resource entitlements `deserved` (if the allocated resources of a queue exceed its `deserved` value, the queue's resources can be reclaimed), and reserved resources `guarantee` (resources reserved for the queue that cannot be shared with other queues).
## User Guide
### Scheduler Configuration
In the new version, the hierarchical queue capability is built on the `capacity` plugin. The scheduler configuration needs to enable the `capacity` plugin, set `enableHierarchy` to `true`, and enable the `reclaim` action to support resource reclamation between queues. The scheduler configuration example is as follows:
```
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "allocate, preempt, reclaim"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- plugins:
- name: drf
enablePreemptable: false
- name: predicates
- name: capacity # capacity plugin must be enabled
enableHierarchy: true # enable hierarchical queue
- name: nodeorder
```
### Building Hierarchical Queues
A new `parent` field has been added to the Queue spec to specify the parent queue:
```
type QueueSpec struct {
...
// Parent defines the parent of the queue
// +optional
Parent string `json:"parent,omitempty" protobuf:"bytes,8,opt,name=parent"`
...
}
```
Volcano Scheduler will automatically create a root queue as the root of all queues upon startup. Users can build a hierarchical queue tree based on the root queue, such as the following tree structure:
{{<figure library="1" src="hierarchical-queue-example.png" title="Figure 1: Hierarchical Queue Example" width="50%">}}
```
# The parent of child-queue-a is the root queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: child-queue-a
spec:
reclaimable: true
parent: root
deserved:
cpu: 64
memory: 128Gi
---
# The parent of child-queue-b is the root queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: child-queue-b
spec:
reclaimable: true
parent: root
deserved:
cpu: 64
memory: 128Gi
---
# The parent of subchild-queue-a1 is child-queue-a
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: subchild-queue-a1
spec:
reclaimable: true
parent: child-queue-a
# You can set deserved values as needed. If the allocated resources of the queue exceed the deserved value, tasks in the queue can be reclaimed.
deserved:
cpu: 32
memory: 64Gi
---
# The parent of subchild-queue-a2 is child-queue-a
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: subchild-queue-a2
spec:
reclaimable: true
parent: child-queue-a
# You can set deserved values as needed. If the allocated resources of the queue exceed the deserved value, tasks in the queue can be reclaimed.
deserved:
cpu: 32
memory: 64Gi
---
# Submit a sample vc-job to the leaf queue subchild-queue-a1
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job-a
spec:
queue: subchild-queue-a1
schedulerName: volcano
minAvailable: 1
tasks:
- replicas: 1
name: test
template:
spec:
containers:
- image: alpine
command: ["/bin/sh", "-c", "sleep 1000"]
imagePullPolicy: IfNotPresent
name: alpine
resources:
requests:
cpu: "1"
memory: 2Gi
```
When cluster resources are insufficient for pod requirement, pod's resources can be reclaimed. For pods in different queues, they will first reclaim pods in sibling queues (if the allocated resources of the sibling queue exceed the `deserved` value). If the resources in sibling queues are still insufficient to meet the pod's requirements, the hierarchical structure of the queues (i.e., ancestor queues) will be traversed upward to find sufficient resources. For example, if job-a and job-c are submitted first and the cluster resources are insufficient for job-b, job-b will first reclaim job-a. If reclaiming job-a does not meet the resource requirements, job-c will then be considered for reclaiming.
Note that in the current version, users can only submit jobs to **leaf queues**. If tasks have already been submitted to a parent queue, child queues cannot be created under that queue. This ensures effective management of resources and tasks across different levels in the queue hierarchy. Additionally, the sum of the `deserved`/`guarantee` values of child queues cannot exceed the `deserved`/`guarantee` values configured for the parent queue. Each child queue's `capability` values cannot exceed the `capability` limits of the parent queue. If a queue does not specify the `capability` value for a certain resource dimension, it will inherit the `capability` from its parent queue. If the parent queue and all ancestor queues do not specify it, the value will finally inherit from the root queue. By default, the root queue's `capability` is set to the total available resources of that dimension in the cluster.

View File

@ -3,7 +3,7 @@ title = "Queue"
date = 2019-01-28
lastmod = 2020-08-29
lastmod = 2024-12-30
draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
@ -23,28 +23,73 @@ Queue is a collection of PodGroups, which adopts FIFO. It is also used as the ba
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
creationTimestamp: "2020-08-10T11:54:36Z"
creationTimestamp: "2024-12-30T09:31:12Z"
generation: 1
name: default
resourceVersion: "559"
selfLink: /apis/scheduling.volcano.sh/v1beta1/queues/default
uid: 14082e4c-bef6-4248-a414-1e06d8352bf0
name: test
resourceVersion: "987630"
uid: 88babd01-c83f-4010-9701-c2471c1dd040
spec:
reclaimable: true
weight: 1
capability:
cpu: "8"
memory: 16Gi
# deserved field is only used by capacity plugin
deserved:
cpu: "4"
memory: "4096Mi"
memory: 8Gi
guarantee:
resource:
cpu: "2"
memory: 4Gi
priority: 100
reclaimable: true
# weight field is only used by proportion plugin
weight: 1
status:
allocated:
cpu: "0"
memory: "0"
state: Open
```
## Key Fields
### weight
`weight` indicates the **relative** weight of a queue in cluster resource division. The resource allocated to the queue equals **(weight/total-weight) x total-resource**. `total-weight` is the total weight of all queues. `total-resource` is the total number of cluster resources. `weight` is a soft constraint.
### capability
`capability` indicates the upper limit of resources the queue can use. It is a hard constraint.
### reclaimable
* guarantee, *optional*
guarantee indicates the resources reserved for all PodGroups in this queue. Other queues cannot use these reserved resources.
> **Note**: If guarantee value needs to be configured, it must be less than or equal to the deserved value
* deserved, *optional*
deserved indicates the expected resource amount for all PodGroups in this queue. If the allocated resources of this queue exceed the configured deserved value, the allocated resources can be reclaimed by other queues.
> **Note**:
>
> 1. This field can only be configured when the capacity plugin is enabled, and must be less than or equal to the capability value. The proportion plugin uses weight to automatically calculate the queue's deserved value. For more information on using the capacity plugin, see: [capacity plugin user guide](https://github.com/volcano-sh/volcano/blob/5b817b1cdf3a5638ba38e934b44af051c9fb419e/docs/user-guide/how_to_use_capacity_plugin.md)
> 2. If the allocated resources of a queue exceed its configured deserved value, the queue cannot reclaim resources from other queues
`weight` indicates the **relative** weight of a queue in cluster resource division. The deserved resource amount is calculated as **(weight/total-weight) * total-resource**. `total-weight` is the total weight of all queues. `total-resource` is the total number of cluster resources. `weight` is a soft constraint.
> **Note**:
>
> 1. This field can only be configured when the proportion plugin is enabled. If weight is not set, it defaults to 1. The capacity plugin does not need this field.
>
> 2. This field is a soft constraint. The Deserved value is calculated based on weight. When other queues' resource usage is below their Deserved values, this queue can exceed its Deserved value by borrowing resources from other queues. However, when cluster resources become scarce and other queues need their borrowed resources for tasks, this queue must return the borrowed resources until its usage matches its Deserved value. This design ensures maximum utilization of cluster resources.
* capability, *optional*
`capability` indicates the upper limit of resources the queue can use. It is a hard constraint.If this field is not set, the queue's capability will be set to realCapability (total cluster resources minus the total guarantee values of other queues).
* reclaimable, *optional*
`reclaimable` specifies whether to allow other queues to reclaim extra resources occupied by a queue when the queue uses more resources than allocated. The default value is `true`.
* priority, *optional*
priority indicates the priority of this queue. During resource allocation and resource preemption/reclamation, higher priority queues will have precedence in allocation/preemption/reclamation.
* parent, *optional*
This field is used to configure [hierarchical queues](hierarchical_queue.md). parent specifies the parent queue. If parent is not specified, the queue will be set as a child queue of the root queue by default.
## Status
### Open
`Open` indicates that the queue is available and can accept new PodGroups.
@ -54,70 +99,11 @@ status:
`Closing` indicates that the queue is becoming unavailable. It is a transient state. A `Closing` queue cannot accept any new PodGroups.
### Unknown
`Unknown` indicates that the queue status is unknown because of unexpected situations such as network jitter.
## Usage
### Weight for Cluster Resource Division - 1
#### Preparations
* A total of 4 CPUs in a cluster are available.
* A queue with `name` set to `default` and `weight` set to `1` has been created by Volcano.
* No running tasks are in the cluster.
#### Operation
1. If no other queues are created, queue `default` can use all CPUs.
2. Create queue `test` whose weight is `3`. The CPU resource allocated to queue `default` changes to 1C and that allocated to queue `test` is 3C because weight(default):weight(test) equals 1:3.
3. Create PodGroups `p1` and `p2`, which belong to queues `default` and `test`, respectively.
4. Create job `j1` that has a CPU request of 1C in `p1`.
5. Create job `j2` that has a CPU request of 3C in `p2`.
6. Check the status of `j1` and `j2`. Both the jobs are running normally.
### Weight for Cluster Resource Division - 2
#### Preparations
* A total of 4 CPUs in a cluster are available.
* A queue with name set to default and weight set to 1 has been created by Volcano.
* No running tasks are in the cluster.
#### Operation
1. If no other queues are created, queue `default` can use all CPUs.
2. Create PodGroup `p1` that belongs to queue `default`.
3. Create job `j1` with a CPU request of 1C and job `j2` with a CPU request of 3C in `p1`. Both the jobs are running normally.
4. Create queue `test` whose weight is `3`. The CPU resource allocated to queue `default` changes to 1C and that allocated to queue `test` is 3C because weight(default):weight(test) equals 1:3. As no tasks in queue `test`, jobs in queue `default` can still run normally.
5. Create PodGroup `p2` that belongs to queue `test`.
6. Create job `j2` with a CPU request of 3C in `p2`. `j2` will be evicted to return the resource to queue `test`.
### Capability for Overuse of Resources
#### Preparations
* A total of 4 CPUs in a cluster are available.
* A queue with name set to default and weight set to 1 has been created by Volcano.
* No running tasks are in the cluster.
#### Operation
1. Create queue `test` whose `capability` is 2C.
2. Create PodGroup `p1` that belongs to queue `test`.
3. Create job `j1` that has a CPU request of 1C in `p1`. `j1` runs normally.
4. Create job `j2` that has a CPU request of 3C in `p1`. `j2` becomes `pending` because of the limit of `capability`.
### Reclaimable for Resource Return
#### Preparations
* A total of 4 CPUs in a cluster are available.
* A queue with name set to default and weight set to 1 has been created by Volcano.
* No running tasks are in the cluster.
#### Operation
1. Create queue `test` whose `reclaimable` is `false` and `weight` is `1`. The CPU resources allocated to queues `default` and
`test` are both 2C.
2. Create PodGroups `p1` and `p2`, which belong to queues `test` and `default`, respectively.
3. Create job `j1` that has a CPU request of 3C in `p1`. `j1` runs normally because there are no tasks in queue `default`.
4. Create job `j2` that has a CPU request of 2C in `p2`. The status of `j2` is `pending` because `reclaimable` is set to `false` for queue `test`. Queue `test` will NOT return resources to other queues until some tasks in it are completed.
## Note
#### default Queue
When Volcano starts, it automatically creates queue `default` whose `weight` is `1`. Subsequent jobs that are not assigned to a queue will be assigned to queue `default`.
#### Soft Constraint About weight
`weight` determines the resources allocated to a queue, but not the upper limit. As per the preceding examples, a queue can use more resources than allocated when there are idle resources in other queues. This a good characteristic of Volcano and delivers a better cluster resource usage.
#### root queue
When Volcano starts, it also creates a queue named root by default. This queue is used when the [hierarchical queue](/en/docs/hierarchical_queue) feature is enabled, serving as the root queue for all queues, with the default queue being a child queue of the root queue.
> For more information on queue usage scenarios, please refer to [Queue Resource Management](/en/docs/queue_resource_management)

View File

@ -0,0 +1,241 @@
+++
title = "Queue Resource Management"
date = 2024-12-30
lastmod = 2024-12-30
draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
toc_depth = 5
type = "docs" # Do not modify.
# Add menu entry to sidebar.
[menu.docs]
parent = "features"
weight = 1
+++
## Overview
[Queue](/en/docs/queue) is one of the core concepts in Volcano, designed to support resource allocation and task scheduling in multi-tenant scenarios. Through queues, users can implement multi-tenant resource allocation, task priority control, resource preemption and reclamation, significantly improving cluster resource utilization and task scheduling efficiency.
## Core Features
### 1. Flexible Resource Configuration
* Supports multi-dimensional resource quota control (CPU, Memory, GPU, NPU, etc.)
* Provides three-level resource configuration mechanism:
* capability: Upper limit of queue resource usage
* deserverd: Deserved resource amount (when no other queues submit jobs, jobs in this queue can exceed the deserverd value; when multiple queues submit jobs and cluster resources are insufficient, resources exceeding the deserverd value can be reclaimed by other queues)
* guarantee: Reserved resource amount (reserved resources can only be used by this queue, other queues cannot use them)
> Recommendations and Notes:
>
> 1. When configuring three-level resources, follow: guarantee <= deserverd <= capability;
> 2. guarantee/capability can be configured as needed, deserverd value must be configured when capacity plugin is enabled;
> 3. deserverd configuration recommendations: In peer queue scenarios, the sum of deserverd values of all queues equals the total cluster resources; In hierarchical queue scenarios, the sum of child queues' deserverd values equals the parent queue's deserverd value, but cannot exceed it.
> 4. capability configuration notes: In hierarchical queue scenarios, child queue's capability value cannot exceed parent queue's capability value. If child queue's capability is not set, it will inherit parent queue's capability value.
* Supports dynamic resource quota adjustment
### 2. Hierarchical Queue Management
* Supports [hierarchical queue](/en/docs/hierarchical_queue) structure
* Provides resource inheritance and isolation between parent and child queues
* Compatible with Yarn-style resource management mode, facilitating big data workload migration
* Supports cross-level queue resource sharing and reclamation
### 3. Intelligent Resource Scheduling
* Resource borrowing: Allows queues to use idle resources from other queues
* Resource reclamation: Prioritizes reclaiming excess resources when resources are tight
* Resource preemption: Ensures resource requirements for high-priority tasks
### 4. Multi-tenant Isolation
* Strict resource quota control
* Priority-based resource allocation
* Prevents single tenant from over-consuming resources
## Queue Scheduling Implementation
### Queue-related Actions
Queue scheduling in Volcano involves the following core actions:
1. `enqueue`: Controls job admission into queues, decides whether to allow new jobs based on queue resource quotas and current usage.
2. `allocate`: Handles resource allocation process, ensures allocations comply with queue quota limits while supporting resource borrowing between queues to improve utilization.
3. `preempt`: Supports resource preemption **within queues**. High-priority jobs can preempt resources from lower-priority jobs in the same queue, ensuring timely execution of critical tasks.
4. `reclaim`: Supports resource reclamation **between queues**. Triggers when queue resources are tight. Prioritizes reclaiming resources exceeding queue's deserved value, considering queue/job priorities when selecting victims.
> **Note**
> enqueue action conflicts with reclaim/preempt actions. If enqueue action determines a podgroup is not allowed to enter the queue, vc-controller won't create pods in pending state, and reclaim/preempt actions won't execute.
### Queue Scheduling Plugins
Volcano provides two core queue scheduling plugins:
#### capacity plugin
[capacity plugin](https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_capacity_plugin.md) supports setting queue's deserved resource amount through explicit configuration, as shown in this example:
```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: capacity-queue
spec:
deserved:
cpu: "10"
memory: "20Gi"
capability:
cpu: "20"
memory: "40Gi"
```
The capacity plugin enables quota control through precise resource configuration. Combined with [hierarchical queues](/en/docs/hierarchical_queue), it can achieve more fine-grained multi-tenant resource allocation and facilitates big data workload migration to Kubernetes clusters.
> **Note**: When using cluster autoscaling components like Cluster Autoscaler or Karpenter, total cluster resources change dynamically. In this case, using capacity plugin requires manual adjustment of queue's deserverd values to adapt to resource changes.
#### proportion plugin
Unlike the capacity plugin, the proportion plugin automatically calculates queue's deserved resource amount by configuring queue Weight values, without explicitly configuring deserverd values:
```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: proportion-queue
spec:
weight: 1
capability:
cpu: "20"
memory: "40Gi"
```
When total cluster resource is `total_resource`, each queue's deserverd value is calculated as:
```
queue_deserved = (queue_weight / total_weight) * total_resource
```
Where `queue_weight` represents current queue's weight, `total_weight` represents sum of all queue weights, `total_resource` represents total cluster resources.
Compared to capacity plugin, capacity plugin allows direct configuration of queue's deserverd value, while proportion plugin automatically calculates queue's deserverd value through weight ratio. When cluster resources change (e.g., through Cluster Autoscaler or Karpenter scaling), proportion plugin automatically recalculates each queue's deserverd value based on weight ratios, requiring no manual intervention.
> **Important Note**: The actual deserverd value is dynamically adjusted. If calculated `queue_deserved` is greater than total resource requests of PodGroups waiting to be scheduled in the queue, the final deserverd value will be set to the total request amount to avoid over-reservation of resources and improve overall utilization.
> **Note**:
> 1. capacity plugin and proportion plugin must be used exclusively, they cannot be used simultaneously
> 2. The choice between plugins depends on whether you want to set resource amounts directly (capacity) or calculate automatically through weights (proportion)
> 3. After Volcano v1.9.0, capacity plugin is recommended as it provides more intuitive resource configuration
#### Usage Example
The following example demonstrates a typical queue resource management scenario through 4 steps to illustrate the resource reclamation mechanism:
**Step 1: Initial State**
In the initial cluster state, default queue can use all resources (4C).
**Step 2: Create Initial Jobs**
Create two jobs in default queue requesting 1C and 3C resources:
```yaml
# job1.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job1
spec:
queue: default
tasks:
- replicas: 1
template:
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "1"
---
# job2.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job2
spec:
queue: default
tasks:
- replicas: 1
template:
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "3"
```
At this point, both jobs can run normally as they can temporarily use resources exceeding deserved amount.
**Step 3: Create New Queue**
Create test queue and set resource ratio. You can choose either capacity plugin or proportion plugin:
```yaml
# Using capacity plugin
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test
spec:
reclaimable: true
deserved:
cpu: 3
```
or
```yaml
# Using proportion plugin
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test
spec:
reclaimable: true
weight: 3 # Resource allocation ratio default:test = 1:3
```
**Step 4: Trigger Resource Reclamation**
Create job3 in test queue requesting 3C resources (configuration similar to job2, just change queue to test):
```yaml
# job3.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job3
spec:
queue: test # Change queue to test
tasks:
- replicas: 1
template:
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "3"
```
After submitting job3, system starts resource reclamation:
* System reclaims resources exceeding deserved amount from default queue
* job2 (3C) is evicted
* job1 (1C) continues running
* job3 (3C) starts running
This scenario works with both capacity plugin and proportion plugin:
* capacity plugin: Directly configure deserved values (default=1C, test=3C)
* proportion plugin: Configure weight values (default=1, test=3) resulting in the same deserved values

View File

@ -0,0 +1,146 @@
+++
title = "Unified Scheduling"
date = 2024-12-30
lastmod = 2024-12-30
draft = false
toc = true
type = "docs"
[menu.docs]
parent = "features"
weight = 3
+++
## Overview
As the industry's leading cloud-native batch processing system scheduler, Volcano achieves support for all types of workloads through a unified scheduling system:
- Powerful batch scheduling capabilities: Perfect support for mainstream AI and big data frameworks like Ray, TensorFlow, PyTorch, MindSpore, Spark, Flink through VcJob
- Complete Kubernetes workload support: Direct scheduling of native workloads like Deployment, StatefulSet, Job, DaemonSet
This unified scheduling capability allows users to manage all types of workloads using a single scheduler, greatly simplifying cluster management complexity.
## Compatible with Kubernetes Scheduling Capabilities
Volcano achieves full compatibility with Kubernetes scheduling mechanisms through the implementation of two core scheduling plugins: predicates and nodeorder. These plugins correspond to the "PreFilter/Filter" and "Score" stages in the Kubernetes scheduling framework.
### 1. predicates plugin
Volcano fully implements the PreFilter-Filter stages from Kube-Scheduler, including:
- Basic resource filtering: node schedulability, Pod count limits, etc.
- Affinity/Anti-affinity: node affinity, inter-Pod affinity, etc.
- Resource constraints: node ports, volume limits, etc.
- Topology distribution: Pod topology distribution constraints, etc.
In addition to compatible Kubernetes filters, Volcano provides the following enhanced features:
#### Node Filtering Result Cache (PredicateWithCache)
When the scheduler selects nodes for Pods, it needs to perform a series of checks (such as resource availability, affinity requirements, etc.). These check results can be cached. If a Pod with identical configuration needs to be scheduled shortly after, previous check results can be reused, avoiding repeated node filtering calculations and significantly improving scheduling performance when creating Pods in batch.
##### Configuration
Enable caching in volcano-scheduler-configmap:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: predicates
arguments:
predicate.CacheEnable: true # Enable node filtering result cache
```
##### Use Cases
1. Creating multiple Pods with identical configuration
- Example: Creating multiple identical TensorFlow training tasks
- After the first Pod completes node filtering, subsequent Pods can use cached results
2. Large-scale cluster scheduling optimization
> **Note**:
>
> - Only static check results are cached (like node labels, taints)
> - Dynamic resource-related checks (like CPU, memory usage) are recalculated each time
> - Related cache is automatically invalidated when node status changes
### 2. nodeorder plugin
Volcano is fully compatible with Kubernetes default scoring mechanism and implements a configurable weight system for more flexible node selection strategies. Additionally, Volcano implements parallel scoring processing, significantly improving scheduling efficiency in large-scale clusters, particularly suitable for AI training and other batch processing scenarios.
#### Supported Scoring Dimensions
1. **Resource Dimension**
- `leastrequested`: Prefer nodes with fewer resource requests, suitable for resource spreading
- `mostrequested`: Prefer nodes with more resource requests, suitable for resource packing
- `balancedresource`: Seek balance between CPU, memory and other resources, avoid single resource bottlenecks
2. **Affinity Dimension**
- `nodeaffinity`: Score based on node affinity rules
- `podaffinity`: Score based on inter-Pod affinity rules
- `tainttoleration`: Score based on node taints and Pod tolerations
3. **Other Dimensions**
- `imagelocality`: Prefer nodes that already have required container images
- `podtopologyspread`: Ensure Pods are evenly distributed across different topology domains
#### Configuration Example
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: nodeorder
arguments:
# Resource dimension weights
leastrequested.weight: 1 # Default weight is 1
mostrequested.weight: 0 # Default weight is 0 (disabled by default)
balancedresource.weight: 1 # Default weight is 1
# Affinity dimension weights
nodeaffinity.weight: 2 # Default weight is 2
podaffinity.weight: 2 # Default weight is 2
tainttoleration.weight: 3 # Default weight is 3
# Other dimension weights
imagelocality.weight: 1 # Default weight is 1
podtopologyspread.weight: 2 # Default weight is 2
```
### Advantages of Unified Scheduling
As a general-purpose batch computing system, Volcano extends Kubernetes native scheduling capabilities with the following key advantages:
#### 1. Rich Ecosystem Support
* **Complete Framework Support**
- Supports mainstream AI training frameworks including Ray, TensorFlow, PyTorch, MindSpore
- Supports big data processing frameworks like Spark, Flink
- Supports high-performance computing frameworks like MPI
* **Heterogeneous Device Support**
- Supports GPU (CUDA/MIG) scheduling
- Supports NPU scheduling
#### 2. Enhanced Scheduling Capabilities
* **Gang Scheduling**
- Supports job-level scheduling
- Prevents resource fragmentation
- Suitable for distributed training scenarios
* **Queue Resource Management**
- Supports multi-tenant resource isolation
- Supports resource borrowing and reclamation between queues
- Supports resource quota management
#### 3. Unified Resource Management
* **Unified Resource View**
- Unified management of CPU, memory, GPU/NPU and other heterogeneous resources
- Implements resource sharing and isolation
- Improves overall resource utilization

View File

@ -0,0 +1,145 @@
+++
title = "层级队列"
date = 2024-12-28
lastmod = 2024-12-28
draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
type = "docs" # Do not modify.
# Add menu entry to sidebar.
[menu.docs]
parent = "features"
weight = 2
+++
## 背景
在多租户场景中队列是实现公平调度、资源隔离和任务优先级控制的核心机制。然而在目前的Volcano中队列仅支持平级结构缺乏层级概念。而在实际应用中不同队列往往隶属于不同部门部门之间存在层级关系对资源的分配和抢占需求也更为精细。为此Volcano latest版本引入了层级队列功能大幅增强了队列的能力。通过这一功能用户可以在层级队列的基础上实现更细粒度的资源配额管理和抢占策略构建更高效的统一调度平台。
对于使用YARN的用户可以使用Volcano无缝将大数据业务迁移到Kubernetes集群之上。YARN的Capacity Scheduler已经具备层级队列功能支持跨层级的资源分配和抢占而Volcano latest版本采用类似的层级队列设计提供更灵活的资源管理和调度策略。
## 功能支持
- 支持配置队列层级关系
- 支持跨层级队列任务间资源共享与回收
- 支持为每个维度的资源设置队列容量上限`capability`,资源应得量`deserved`(若队列已分配资源量超过设置的`deserved`值,则队列中的资源可被回收),资源预留量`guarantee`(当前队列预留资源,无法与其他队列共享)
## 使用指南
### 调度器配置
在新版本中,层级队列能力基于`capacity`插件构建,调度器配置需要打开`capacity`插件,并且将`enableHierarchy`设置为true同时需要打开`reclaim` action支持队列间的资源回收调度器配置示例如下所示
```
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "allocate, preempt, reclaim"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- plugins:
- name: drf
enablePreemptable: false
- name: predicates
- name: capacity # capacity插件必须打开
enableHierarchy: true # 开启层级队列
- name: nodeorder
```
### 构建层级队列
Queue spec中新增了parent属性可以用来指定队列所属的父队列:
```
type QueueSpec struct {
...
// Parent define the parent of queue
// +optional
Parent string `json:"parent,omitempty" protobuf:"bytes,8,opt,name=parent"`
...
}
```
Volcano Scheduler在启动后会默认创建一个root队列作为所有队列的根队列用户可以基于root队列进一步构建层级队列树如构建这样一个树结构
{{<figure library="1" src="hierarchical-queue-example.png" title="图1 层级队列示例" width="50%">}}
```
#child-queue-a的父队列为root队列
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: child-queue-a
spec:
reclaimable: true
parent: root
deserved:
cpu: 64
memory: 128Gi
---
#child-queue-b的父队列为root队列
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: child-queue-b
spec:
reclaimable: true
parent: root
deserved:
cpu: 64
memory: 128Gi
---
#subchild-queue-a1的父队列为child-queue-a队列
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: subchild-queue-a1
spec:
reclaimable: true
parent: child-queue-a
#可根据需要设置deserved队列已分配资源若已超过deserved值则队列中任务可被抢占
deserved:
cpu: 32
memory: 64Gi
---
#subchild-queue-a2的父队列为child-queue-a队列
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: subchild-queue-a2
spec:
reclaimable: true
parent: child-queue-a
#可根据需要设置deserved队列已分配资源若已超过deserved值则队列中任务可被抢占
deserved:
cpu: 32
memory: 64Gi
---
# 提交一个示例vc-job到叶子队列subchild-queue-a1中
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job-a
spec:
queue: subchild-queue-a1
schedulerName: volcano
minAvailable: 1
tasks:
- replicas: 1
name: test
template:
spec:
containers:
- image: alpine
command: ["/bin/sh", "-c", "sleep 1000"]
imagePullPolicy: IfNotPresent
name: alpine
resources:
requests:
cpu: "1"
memory: 2Gi
```
当集群资源不够pod部署时pod所占资源可以被抢占对于不同队列的pod将优先抢占兄弟队列中的Pod若兄弟队列中的已分配资源量已超过`deserved`值),若兄弟队列中的资源不足以满足 Pod 的需求那么就按照队列的层级结构即祖先队列逐层向上查找资源直到找到足够的资源为止。如图中job-a和job-c先提交集群资源不够满足job-b要求则job-b会优先抢占job-a若抢占job-a后资源仍得不到满足则会再考虑抢占job-c。
需要注意的是,目前版本用户只能在**叶子队列**提交作业,且如果已有任务提交到了父队列中,则不能在该队列下创建子队列,这确保了队列层次结构中不同层级的资源和任务的有效管理。同时需要注意的是,子队列的`deserved/guarantee`总和不能超过父队列配置的`deserved/guarantee`值,每个子队列的 `capability` 值不能超过父队列的 `capability` 限制。如果某个队列未设置某一维度资源的 `capability` 值,则该维度的 `capability` 将继承自其父队列的设置,如果父队列及其所有祖先队列均未设置,则最终继承自根队列的配置。根队列的 `capability` 默认设置为集群中该维度资源的全部可用量。

View File

@ -3,7 +3,7 @@ title = "Queue"
date = 2019-01-28
lastmod = 2020-09-03
lastmod = 2024-12-30
draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
@ -24,36 +24,77 @@ queue是容纳一组**podgroup**的队列也是该组podgroup获取集群资
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
creationTimestamp: "2020-08-10T11:54:36Z"
creationTimestamp: "2024-12-30T09:31:12Z"
generation: 1
name: default
resourceVersion: "559"
selfLink: /apis/scheduling.volcano.sh/v1beta1/queues/default
uid: 14082e4c-bef6-4248-a414-1e06d8352bf0
name: test
resourceVersion: "987630"
uid: 88babd01-c83f-4010-9701-c2471c1dd040
spec:
reclaimable: true
weight: 1
capability:
cpu: "8"
memory: 16Gi
# deserved字段仅用于capacity插件
deserved:
cpu: "4"
memory: "4096Mi"
memory: 8Gi
guarantee:
resource:
cpu: "2"
memory: 4Gi
priority: 100
reclaimable: true
# weight字段仅用于proportion插件
weight: 1
status:
allocated:
cpu: "0"
memory: "0"
state: Open
```
### 关键字段
* weight
* guarantee,*可选*
weight表示该queue在集群资源划分中所占的**相对**比重该queue应得资源总量为 **(weight/total-weight) * total-resource**。其中,
guarantee表示该queue为自己队列中的所有podgroup预留的资源其他队列无法使用该部分资源。
> **注意**: 若需配置guarantee值则需要小于等于deserved值的配置
* deserved,*可选*
deserved表示该queue内所有podgroup的资源应得量若该queue已分配资源量超过了设置的deserved值则queue中已分配的资源可被其他queue回收
> **注意**
>
> 1. 该字段只有在capacity插件开启时可按需配置需要小于等于capability值proportion插件使用weight来自动计算queue的deserved值。capacity插件使用文档详见[capacity plugin User Guide](https://github.com/volcano-sh/volcano/blob/5b817b1cdf3a5638ba38e934b44af051c9fb419e/docs/user-guide/how_to_use_capacity_plugin.md)
> 2. 若queue中已分配的资源量超过了自己配置的deserved值则queue不可再回收其他队列中的资源
<!--目前capacity插件使用指导文档引用的是github中的链接后续若官方网站文档中添加了中文的capacity插件使用指导则替换为官方网站中的文档链接-->
* weight,*可选*
weight表示该queue在集群资源划分中所占的**相对**比重该queue应得资源量deserved的计算方式为 **(weight/total-weight) * total-resource**。其中,
total-weight表示所有的queue的weight总和total-resource表示集群的资源总量。weight是一个**软约束**,取值范围为[1, 2^31-1]
* capability
> **注意**
>
> 1. 该字段只有在proportion插件开启时可按需配置若不设置weight则默认设置为1capacity插件无需设置此字段
> 2. 该字段为软约束Deserved值由weight计算得到当其他queue中的资源占用量未达到Deserved值时该队列的资源使用量可超过Deserved值即从其他队列借用资源但当集群资源不够用且其他队列有任务需要用到这部分借出去的资源时则该队列需要归还借出去的资源回收到Deserved值为止。这种设计可以保证集群资源的最大化利用。
capability表示该queue内所有podgroup使用资源量之和的上限它是一个**硬约束**
* capability,*可选*
* reclaimable
capability表示该queue内所有podgroup使用资源量之和的上限它是一个**硬约束**若不设置该字段则队列的capability会设置为realCapability集群的资源总量减去其他队列的总guarantee值
* reclaimable,*可选*
reclaimable表示该queue在资源使用量超过该queue所应得的资源份额时是否允许其他queue回收该queue使用超额的资源默认值为**true**
* priority,*可选*
priority表示该queue的优先级在资源分配和资源抢占/回收时,更高优先级的队列将会优先分配/抢占/回收资源
* parent,*可选*
该字段用于配置[层级队列](/zh/docs/hierarchical_queue)。parent用来指定queue的父队列若未指定parent则默认会作为root queue的子队列
### 资源状态
* Open
@ -71,65 +112,13 @@ reclaimable表示该queue在资源使用量超过该queue所应得的资源份
该queue当前处于不可知状态可能是网络或其他原因导致queue的状态暂时无法感知
## 使用场景
### weight的资源划分-1
#### 背景:
* 集群CPU总量为4C
* 已默认创建名为default的queueweight为1
* 集群中无任务运行
#### 操作:
1. 当前情况下default queue可是使用全部集群资源即4C
2. 创建名为test的queueweight为3。此时default weight:test weight = 1:3,即default queue可使用1Ctest queue可使用3C
3. 创建名为p1和p2的podgroup分别属于default queue和test queue
4. 分别向p1和p2中投递job1和job2资源申请量分别为1C和3C2个job均能正常工作
### weight的资源划分-2
#### 背景:
* 集群CPU总量为4C
* 已默认创建名为default的queueweight为1
* 集群中无任务运行
#### 操作:
1. 当前情况下default queue可是使用全部集群资源即4C
2. 创建名为p1的podgroup属于default queue。
3. 分别创建名为job1和job2的job属于p1,资源申请量分别为1C和3Cjob1和job2均能正常工作
4. 创建名为test的queueweight为3。此时default weight:test weight = 1:3,即default queue可使用1Ctest queue可使用3C。但由于test
queue内此时无任务job1和job2仍可正常工作
5. 创建名为p2的podgroup属于test queue。
6. 创建名为job3的job属于p2资源申请量为3C。此时job2将被驱逐将资源归还给job3即default queue将3C资源归还给test queue。
### capability的使用
#### 背景:
* 集群CPU总量为4C
* 已默认创建名为default的queueweight为1
* 集群中无任务运行
#### 操作:
1. 创建名为test的queuecapability设置cpu为2C即test queue使用资源上限为2C
2. 创建名为p1的podgroup属于test queue
3. 分别创建名为job1和job2的job属于p1资源申请量分别为1C和3C依次下发。由于capability的限制job1正常运行job2处于pending状态
### reclaimable的使用
#### 背景:
* 集群CPU总量为4C
* 已默认创建名为default的queueweight为1
* 集群中无任务运行
#### 操作:
1. 创建名为test的queuereclaimable设置为falseweight为1。此时default weight:test weight = 1:1,即default queue和test queue均可使用2C。
2. 创建名为p1、p2的podgroup分别属于test queue和default queue
3. 创建名为job1的job属于p1资源申请量3Cjob1可正常运行。此时由于default queue中尚无任务test queue多占用1C
4. 创建名为job2的job属于p2资源申请量2C任务下发后处于pending状态即test queue的reclaimable为false导致该queue不归还多占的资源
### 说明事项
#### default queue
volcano启动后会默认创建名为default的queueweight为1。后续下发的job若未指定queue默认属于default queue
#### weight的软约束
weight的软约束是指weight决定的queue应得资源的份额并不是不能超出使用的。当其他queue的资源未充分利用时需要超出使用资源的queue可临时多占。但其
他queue后续若有任务下发需要用到这部分资源将驱逐该queue多占资源的任务以达到weight规定的份额前提是queue的reclaimable为true。这种设计可以
保证集群资源的最大化利用。
* default queue
volcano启动后会默认创建名为default的queue。后续下发的job若未指定queue默认属于default queue
* root queue
volcano启动后同样会默认创建名为root的queue该queue为开启[层级队列](/zh/docs/hierarchical_queue)功能时使用作为所有队列的根队列default queue为root queue的子队列
> 队列的详细使用场景请参考[队列资源管理](/zh/docs/queue_resource_management)

View File

@ -0,0 +1,241 @@
+++
title = "队列资源管理"
date = 2024-12-30
lastmod = 2024-12-30
draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
toc_depth = 5
type = "docs" # Do not modify.
# Add menu entry to sidebar.
[menu.docs]
parent = "features"
weight = 1
+++
## 功能概述
[队列](/zh/docs/queue)是Volcano的核心概念之一用于支持多租户场景下的资源分配与任务调度。通过队列用户可以实现多租资源分配、任务优先级控制、资源抢占与回收等功能显著提升集群的资源利用率和任务调度效率。
## 核心特性
### 1. 灵活的资源配置
* 支持多维度资源配额控制(CPU、内存、GPU、NPU等)
* 提供三级资源配置机制:
* capability: 队列资源使用上限
* deserverd: 资源应得量在无其他队列提交作业时该队列内作业所占资源量可超过deserverd值当有多个队列提交作业且集群资源不够用时超过deserverd值的资源量可以被其他队列回收
* guarantee: 资源预留量(预留资源只可被该队列所使用,其他队列无法使用)
> 建议及注意事项:
>
> 1. 进行三级资源配置时,需遵循: guarantee <= deserverd <= capability
> 2. guarantee/capability可按需配置在开启capacity插件时需要配置deserverd值
> 3. deserverd配置建议在平级队列场景所有队列的deserverd值总和等于集群资源总量在层级队列场景子队列的deserverd值总和等于父队列的deserverd值但不能超过父队列的deserverd值。
> 4. capability配置注意事项在层级队列场景子队列的capability值不能超过父队列的capability值若子队列的capability未设置则会继承父队列的capability值。
* 支持动态资源配额调整
### 2.层级队列管理
* 支持多[层级队列](/zh/docs/hierarchical_queue)结构
* 提供父子队列间的资源继承与隔离
* 兼容Yarn式的资源管理模式便于大数据工作负载迁移
* 支持跨层级队列的资源共享与回收
### 3.智能资源调度
* 资源借用:允许队列使用其他队列的空闲资源
* 资源回收:当资源紧张时,优先回收超额使用的资源
* 资源抢占:确保高优先级任务的资源需求
### 4.多租户隔离
* 严格的资源配额控制
* 基于优先级的资源分配
* 防止单个租户过度占用资源
## 队列调度实现机制
### 队列相关Actions
Volcano中的队列调度涉及以下核心action
1. `enqueue`:控制作业进入队列的准入机制,根据队列的资源配额和当前使用情况决定是否允许新作业进入队列。
2. `allocate`:负责资源分配过程,确保分配符合队列配额限制,同时支持队列间的资源借用机制,提高资源利用率。
3. `preempt`:支持**队列内**资源抢占。高优先级作业可以抢占同队列内低优先级作业的资源,确保关键任务的及时执行。
4. `reclaim`:支持**队列间**的资源回收。当队列资源紧张时触发资源回收机制。优先回收超出队列deserved值的资源并结合队列/作业优先级选择合适的牺牲者。
> **注意**
> enqueue action和reclaim/preempt action是互相冲突的如果enqueue action判断podgroup不允许入队则vc-controller不会创建pending状态的podreclaim/preempt action也不会执行。
### 队列调度插件
Volcano提供了两个核心的队列调度插件
#### capacity插件
[capacity插件](https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_capacity_plugin.md)支持通过显式配置deserverd值来设置队列资源应得量如以下队列配置示例
<!--目前capacity插件介绍引用的是volcano主仓库中的capicity插件使用文档后续需要更新为官网中的capicity插件使用文档-->
```
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: capacity-queue
spec:
deserved:
cpu: "10"
memory: "20Gi"
capability:
cpu: "20"
memory: "40Gi"
```
capacity插件通过精确的资源配置来进行配额控制结合[层级队列](/zh/docs/hierarchical_queue)能实现更加精细的多租资源分配也便于大数据工作负载迁移到Kubernetes集群上
> **注意**:当使用 Cluster Autoscaler 或 Karpenter 等集群弹性伸缩组件时,集群资源总量会动态变化。此时使用 capacity 插件需要手动调整队列的 deserverd 值以适应资源变化。
#### proportion插件
与capacity插件不同的是proportion插件通过配置队列的Weight值来自动计算队列资源应得量无需显式配置deserverd值如以下队列配置示例
```
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: proportion-queue
spec:
weight: 1
capability:
cpu: "20"
memory: "40Gi"
```
当集群总资源为 `total_resource` 时,每个队列的 deserverd 值计算公式为:
```
queue_deserved = (queue_weight / total_weight) * total_resource
```
其中,`queue_weight` 表示当前队列的权重,`total_weight` 表示所有队列权重之和,`total_resource` 表示集群总资源量。
与capacity插件相比capacity插件可直接配置队列的deserverd值而proportion插件通过权重比例自动计算队列的deserverd值当集群资源发生变化时如通过 Cluster Autoscaler 或 Karpenter 扩缩容proportion插件会自动根据权重比例重新计算各队列的 deserverd 值,无需人工干预。
> **重要说明**:实际的 deserverd 值会进行动态调整,如果计算得到的 `queue_deserved` 大于队列中待调度 PodGroup 的总资源请求量Request则最终的 deserverd 值会被设置为总请求量Request这样可以避免资源的过度预留提高整体利用率
#### 使用样例
以下示例展示了一个典型的队列资源管理场景通过4个步骤说明资源回收机制
**步骤1初始状态**
集群初始状态下default队列可使用全部资源(4C)。
**步骤2创建初始作业**
在default队列中创建两个作业分别申请1C和3C资源
```yaml
# job1.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job1
spec:
queue: default
tasks:
- replicas: 1
template:
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "1"
---
# job2.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job2
spec:
queue: default
tasks:
- replicas: 1
template:
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "3"
```
此时两个job都能正常运行因为暂时可以使用超出deserved的资源
**步骤3创建新队列**
创建test队列并设置资源比例。可以选择使用capacity插件或proportion插件
```yaml
# 使用capacity插件时的队列配置
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test
spec:
reclaimable: true
deserved:
cpu: 3
```
```yaml
# 使用proportion插件时的队列配置
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test
spec:
reclaimable: true
weight: 3 # 资源分配比例为 default:test = 1:3
```
**步骤4触发资源回收**
在test队列创建job3并申请3C资源配置与job2类似只需将queue改为test
```yaml
# job3.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job3
spec:
queue: test # 将队列改为test
tasks:
- replicas: 1
template:
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "3"
```
提交job3后系统开始资源回收
1. 系统回收default队列超出deserved的资源
2. job23C被驱逐
3. job11C保留运行
4. job33C开始运行
这个场景同时适用于capacity plugin和proportion plugin
* capacity plugin直接配置deserved值default=1C, test=3C
* proportion plugin配置weight值default=1, test=3最终计算得到相同的deserved值
> **注意**
> capacity 插件和 proportion 插件必须二选一,不能同时使用。选择哪个插件主要取决于您是想直接设置资源量(capacity)还是通过权重自动计算(proportion)。Volcano v1.9.0版本后推荐使用capacity插件因为它提供了更直观的资源配置方式

View File

@ -0,0 +1,176 @@
+++
title = "统一调度"
date = 2024-12-30
lastmod = 2024-12-30
draft = false
toc = true
type = "docs"
[menu.docs]
parent = "features"
weight = 3
+++
## 概述
Volcano 作为业界领先的云原生批处理系统调度器,通过统一的调度系统实现了对所有工作负载类型的支持:
- 强大的批处理调度能力:通过 VcJob 完美支持 Ray、TensorFlow、PyTorch、MindSpore、Spark、Flink 等主流 AI 和大数据框架
- 完整的 Kubernetes 工作负载支持:可直接调度 Deployment、StatefulSet、Job、DaemonSet 等原生工作负载
这种统一调度能力让用户能够使用单一调度器来管理所有类型的工作负载,极大简化了集群管理复杂度。
## 兼容Kubernetes调度能力
Volcano 通过实现 predicates 和 nodeorder 这两个核心调度插件,完全兼容了 Kubernetes 的调度机制。这两个插件分别对应了 Kubernetes 调度框架中的"预过滤(PreFilter)/过滤(Filter)"和"打分(Score)"阶段。
### 1. predicates 插件
Volcano完整实现了Kube-Scheduler中的预过滤(PreFilter)-过滤(Filter)阶段,包括:
- 基础资源过滤节点可调度性、Pod 数量限制等
- 亲和性/反亲和性节点亲和性、Pod 间亲和性等
- 资源约束:节点端口、存储卷限制等
- 拓扑分布Pod 拓扑分布约束等
除了兼容 Kubernetes 的过滤器外Volcano 还提供了以下增强特性:
#### 节点过滤结果缓存 (PredicateWithCache)
当调度器为 Pod 选择节点时,需要对每个节点进行一系列检查(如资源是否足够、是否满足亲和性要求等),这些节点过滤结果可以被缓存下来,如果后续还要调度多个完全相同配置的 Pod可以直接使用之前的检查结果避免重复进行相同的节点过滤计算显著提升批量创建 Pod 时的调度性能。
##### 配置方法
在 volcano-scheduler-configmap 中启用缓存:
```
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: predicates
arguments:
predicate.CacheEnable: true # 启用节点过滤结果缓存
```
##### 使用场景
1. 批量创建相同配置的 Pod
- 例如:创建多个相同的 TensorFlow 训练任务
- 第一个 Pod 完成节点过滤后,后续 Pod 可直接使用缓存结果
2. 大规模集群调度优化
> **注意**
>
> - 只有静态的检查结果才会被缓存(如节点标签、污点等)
> - 动态资源相关的检查(如 CPU、内存用量每次都会重新计算
> - 当节点状态发生变化时,相关缓存会自动失效
### 2. nodeorder 插件
Volcano 在完全兼容 Kubernetes 默认打分机制的基础上通过可配置的权重系统来实现更灵活的节点选择策略。同时Volcano 实现了并行打分处理,显著提升了大规模集群的调度效率,特别适合 AI 训练等批处理场景。
#### 支持的打分维度
1. **资源维度**
- `leastrequested`: 优先选择资源占用少的节点,适合资源打散场景
- `mostrequested`: 优先选择资源占用多的节点,适合资源打包场景
- `balancedresource`: 在 CPU、内存等资源间寻求平衡避免单一资源瓶颈
2. **亲和性维度**
- `nodeaffinity`: 根据节点亲和性规则打分
- `podaffinity`: 根据 Pod 间亲和性规则打分
- `tainttoleration`: 根据节点污点和 Pod 容忍度打分
3. **其他维度**
- `imagelocality`: 优先选择已有所需容器镜像的节点
- `podtopologyspread`: 确保 Pod 在不同拓扑域(如可用区)均匀分布
#### 配置示例
```
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: nodeorder
arguments:
# 资源维度权重配置
leastrequested.weight: 1 # 默认权重为 1
mostrequested.weight: 0 # 默认权重为 0即默认不启用
balancedresource.weight: 1 # 默认权重为 1
# 亲和性维度权重配置
nodeaffinity.weight: 2 # 默认权重为 2
podaffinity.weight: 2 # 默认权重为 2
tainttoleration.weight: 3 # 默认权重为 3
# 其他维度权重配置
imagelocality.weight: 1 # 默认权重为 1
podtopologyspread.weight: 2 # 默认权重为 2
```
## 统一调度配置方式
通过配置 `schedulerName: volcano`Volcano 可以统一调度 Kubernetes 原生工作负载和 Volcano 工作负载。
### Kubernetes 原生工作负载
```
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
replicas: 1
template:
spec:
schedulerName: volcano # 指定使用 Volcano 调度器
...
```
### Volcano 工作负载
```
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: test
spec:
minAvailable: 1
schedulerName: volcano # Volcano 工作负载默认使用 volcano 调度器
...
```
### 统一调度的优势
Volcano 作为一个通用的批量计算系统,在继承 Kubernetes 原生调度能力的基础上,具有以下突出优势:
#### 1. 丰富的生态支持
* **完整的框架支持**
- 支持 Ray、TensorFlow、PyTorch、MindSpore 等主流 AI 训练框架
- 支持 Spark、Flink 等大数据处理框架
- 支持 MPI 等高性能计算框架
* **异构设备支持**
- 支持 GPUCUDA/MIG调度
- 支持 NPU 调度
#### 2. 增强的调度能力
* **Gang Scheduling**
- 支持作业的整体调度
- 避免资源碎片化
- 适用于分布式训练等场景
* **队列资源管理**
- 支持多租户资源隔离
- 支持队列间资源借用和回收
- 支持资源配额管理
#### 3. 统一的资源管理
* **资源视图统一**
- 统一管理 CPU、内存、GPU/NPU 等异构资源
- 实现资源共享与隔离
- 提升整体资源利用率

Binary file not shown.

After

Width:  |  Height:  |  Size: 90 KiB