From 12ae2a2d7ac7c1a470534ac1e1948e550fcf2d7d Mon Sep 17 00:00:00 2001 From: Zhuohan Li Date: Thu, 5 Sep 2024 09:54:01 -0700 Subject: [PATCH] small fix --- _posts/2024-09-05-perf-update.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_posts/2024-09-05-perf-update.md b/_posts/2024-09-05-perf-update.md index 69d19ac..721180e 100644 --- a/_posts/2024-09-05-perf-update.md +++ b/_posts/2024-09-05-perf-update.md @@ -62,10 +62,10 @@ Even after splitting these two processes, we find there’s still much room for
-Illustration of the multistep scheduling method in vLLM. By batching multiple scheduling steps at one, we keep the GPU busier than before, therefore reducing latency and improve throughput. +Illustration of the multistep scheduling method in vLLM. By batching multiple scheduling steps at once, we keep the GPU busier than before, therefore reducing latency and improve throughput.

-We identified that the CPU overhead from vLLM’s scheduler and input preparation was leading to GPU underutilization, resulting in suboptimal throughput. To tackle this, we introduced multi-step scheduling, which performs scheduling and input preparation once and runs the model for `n` consecutive steps. By ensuring that the GPU can continue processing between the `n` steps without waiting for the CPU, this approach spreads the CPU overhead across multiple steps, significantly reducing GPU idle time and boosting overall performance. +We identified that the CPU overhead from vLLM’s scheduler and input preparation was leading to GPU underutilization, resulting in suboptimal throughput. To tackle this, we introduced *multi-step scheduling*, which performs scheduling and input preparation once and runs the model for `n` consecutive steps. By ensuring that the GPU can continue processing between the `n` steps without waiting for the CPU, this approach spreads the CPU overhead across multiple steps, significantly reducing GPU idle time and boosting overall performance. This improves the throughput of running Llama 70B models on 4xH100 by 28%. @@ -84,7 +84,7 @@ Continuing our efforts to maximize GPU utilization, we also revamped how the mod Previously, after generating each token, vLLM moved the model output from GPU to CPU, checked the stopping criteria to determine if the request had finished, and then executed the next step. This output processing was often slow, involving de-tokenizing the generated token IDs and performing string matching, with the overhead increasing as batch sizes grew. -To address this inefficiency, we introduced asynchronous output processing, which overlaps the output processing with model execution. Instead of processing the output immediately, vLLM now delays it, performing the processing of the `n`-th step output while executing the `n+1`-th step. This approach assumes that no request from the `n`-th step has met the stopping criteria, incurring a slight overhead of executing one additional step per request. However, the significant boost in GPU utilization more than offsets this cost, leading to improved overall performance. +To address this inefficiency, we introduced *asynchronous output processing*, which overlaps the output processing with model execution. Instead of processing the output immediately, vLLM now delays it, performing the processing of the `n`-th step output while executing the `n+1`-th step. This approach assumes that no request from the `n`-th step has met the stopping criteria, incurring a slight overhead of executing one additional step per request. However, the significant boost in GPU utilization more than offsets this cost, leading to improved overall performance. This improves the time-per-output-token of running Llama 70B models on 4xH100 by 8.7%.