3.1 KiB
Background
Volcano scheduler handles jobs requiring different types of resources, such as GPU, CPU, memory. Under particular circumstances, we may specify a 'primary' resource(e.g., GPU in deep learning), and preserve the amount of associated 'secondary' resources by a pre-set proportion. This plugin works in the phase of predicates, dedicates to ensure the node's idle resource is enough for the proportion after jobs requiring secondary resources are scheduled.
Scenario of default scheduler
Considering we have a node with 74CPUs, 8GPUs, 128G memory. As no job is submitted, resource NodeAllocatable is equal to NodeIdle.
| Node | NodeAllocatable | NodeIdle |
|---|---|---|
| nodeC0-0 | cpu 74, memory 128G, nvidia.com/gpu 8 | cpu 74, memory 128G, nvidia.com/gpu 8 |
Then two jobs requiring 8CPUs, 8G memory are submitted, and scheduled to the node; the resource status is as below:
| Job | Pod | Resource | Node | NodeAllocatable | NodeIdle |
|---|---|---|---|---|---|
| default/single-1000-0 | single-1000-0 | cpu 8, memory 8G, nvidia.com/gpu 0 | nodeC0-0 | cpu 74, memory 128G, nvidia.com/gpu 8 | cpu 66, memory 120G, nvidia.com/gpu 8 |
| default/single-1000-1 | single-1000-1 | cpu 8, memory 8G, nvidia.com/gpu 0 | nodeC0-0 | cpu 66, memory 120G, nvidia.com/gpu 8 | cpu 58, memory 112G, nvidia.com/gpu 8 |
If we take GPU as primary resource and want to use 1GPU 'binded' with 8CPUs, the left 58 CPUs are insufficent for 8 GPUs; the proportion plugin is designed to solve this problem.
with proportion plugin
Firstly set the proportion binding in volcano-scheduler.conf:
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: predicates
arguments:
predicate.ProportionalEnable: true
predicate.resources: nvidia.com/gpu,nvidia.com/v100-sxm2-16gb
predicate.resources.nvidia.com/gpu.cpu: 8
predicate.resources.nvidia.com/gpu.memory: 8
predicate.resources.nvidia.com/v100-sxm2-16gb.cpu: 16
predicate.resources.nvidia.com/v100-sxm2-16gb.memory: 16
The proportion is GPU:CPU:MEMORY=1:8:8, and let the test scenario just as above:
| Node | NodeAllocatable | NodeIdle |
|---|---|---|
| nodeC0-0 | cpu 74, memory 128G, nvidia.com/gpu 8 | cpu 74, memory 128G, nvidia.com/gpu 8 |
| Job | Pod | Resource | Node | NodeAllocatable | NodeIdle |
|---|---|---|---|---|---|
| default/single-1000-0 | single-1000-0 | cpu 8, memory 8G, nvidia.com/gpu 0 | nodeC0-0 | cpu 74, memory 128G, nvidia.com/gpu 8 | cpu 66, memory 120G, nvidia.com/gpu 8 |
| default/single-1000-1 | single-1000-1 | cpu 8, memory 8G, nvidia.com/gpu 0 | - | - | - |
After job single-1000-0 is scheduled, the Idel resouce is 8GPUs, 66CPUs, 120G memory. During the predicate phase, this plugin caculates the resource left if job single-1000-1 is scheduled(node.Idel.CPU - task.Resreq.CPU < node.Idel.GPU * cpuRatio || node.Idel.Memory - task.Resreq.Memory < node.Idel.GPU * memoryRatio); the result is 8GPUs, 58CPUs, 112G memory, that unsatisfies the 1:8:8 proportion. Therefore nodeC0-0 is removed from the predicateNodes, and NodeIdle remains 8GPUs, 66CPUs, 120G memory.
