Add Qwen2 fig

Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
This commit is contained in:
WoosukKwon 2025-01-26 19:29:27 -08:00
parent 30d4e438ca
commit 1aef324e23
4 changed files with 15 additions and 6 deletions

View File

@ -58,14 +58,13 @@ vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional d
vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%. vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.
Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.**
<p align="center"> <p align="center">
<picture> <picture>
<img src="/assets/figures/v1/v1_prefix_caching.png" width="100%"> <img src="/assets/figures/v1/v1_prefix_caching.png" width="90%">
</picture> </picture>
</p> </p>
Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.**
## 4. Clean Architecture for Tensor-Parallel Inference ## 4. Clean Architecture for Tensor-Parallel Inference
@ -113,16 +112,26 @@ The final piece of the puzzle for vLLM V1 was integrating [FlashAttention 3](htt
# Performance # Performance
Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **x** higher throughput compared to V0 (*without multi-step scheduling*). Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **1.7x** higher throughput compared to V0 (*without multi-step scheduling*).
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack. These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs. The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
- **Llama 3.1 8B, 1xH100** - **Llama 3.1 8B & Llama 3.3 70B**
- **Llama 3.3 70B, 4xH100** <p align="center">
<picture>
<img src="/assets/figures/v1/v1_llama.png" width="100%">
</picture>
</p>
- **Qwen2-VL (VLM), 1xH100** - **Qwen2-VL (VLM), 1xH100**
<p align="center">
<picture>
<img src="/assets/figures/v1/v1_qwen2vl.png" width="50%">
</picture>
</p>
# Limitations & Future Work # Limitations & Future Work

Binary file not shown.

After

Width:  |  Height:  |  Size: 275 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 122 KiB

After

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 170 KiB