Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
This commit is contained in:
WoosukKwon 2025-01-24 14:00:27 -08:00
parent f0a7b016bb
commit ce983a38d5
1 changed files with 1 additions and 1 deletions

View File

@ -46,7 +46,7 @@ In the [v0.6.0 release](https://blog.vllm.ai/2024/09/05/perf-update.html), vLLM
</picture>
</p>
vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional distinction between “prefill” and “decode” phases by treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions are represented as a simple dictionary, e.g., `{request_id: num_tokens}`, which specifies the number of tokens to process for each request at each step. We find that this representation is general enough to support features such as chunked prefills, prefix caching, and speculative decoding.
vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional distinction between “prefill” and “decode” phases by treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions are represented as a simple dictionary, e.g., `{request_id: num_tokens}`, which specifies the number of tokens to process for each request at each step. We find that this representation is general enough to support features such as chunked prefills, prefix caching, and speculative decoding. For instance, chunked-prefill scheduling is seamlessly implemented: with a fixed token budget, the scheduler dynamically decides how many tokens to allocate to each request (as shown in the figure above).
## 3. Fast Prefix Caching