Add Qwen2 fig
Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
This commit is contained in:
parent
30d4e438ca
commit
1aef324e23
|
@ -58,14 +58,13 @@ vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional d
|
||||||
|
|
||||||
vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.
|
vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.
|
||||||
|
|
||||||
Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.**
|
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<picture>
|
<picture>
|
||||||
<img src="/assets/figures/v1/v1_prefix_caching.png" width="100%">
|
<img src="/assets/figures/v1/v1_prefix_caching.png" width="90%">
|
||||||
</picture>
|
</picture>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
|
Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.**
|
||||||
|
|
||||||
## 4. Clean Architecture for Tensor-Parallel Inference
|
## 4. Clean Architecture for Tensor-Parallel Inference
|
||||||
|
|
||||||
|
@ -113,16 +112,26 @@ The final piece of the puzzle for vLLM V1 was integrating [FlashAttention 3](htt
|
||||||
|
|
||||||
# Performance
|
# Performance
|
||||||
|
|
||||||
Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **x** higher throughput compared to V0 (*without multi-step scheduling*).
|
Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **1.7x** higher throughput compared to V0 (*without multi-step scheduling*).
|
||||||
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
|
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
|
||||||
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
|
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
|
||||||
|
|
||||||
- **Llama 3.1 8B, 1xH100**
|
- **Llama 3.1 8B & Llama 3.3 70B**
|
||||||
|
|
||||||
- **Llama 3.3 70B, 4xH100**
|
<p align="center">
|
||||||
|
<picture>
|
||||||
|
<img src="/assets/figures/v1/v1_llama.png" width="100%">
|
||||||
|
</picture>
|
||||||
|
</p>
|
||||||
|
|
||||||
- **Qwen2-VL (VLM), 1xH100**
|
- **Qwen2-VL (VLM), 1xH100**
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<picture>
|
||||||
|
<img src="/assets/figures/v1/v1_qwen2vl.png" width="50%">
|
||||||
|
</picture>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
|
||||||
# Limitations & Future Work
|
# Limitations & Future Work
|
||||||
|
|
||||||
|
|
Binary file not shown.
After Width: | Height: | Size: 275 KiB |
Binary file not shown.
Before Width: | Height: | Size: 122 KiB After Width: | Height: | Size: 124 KiB |
Binary file not shown.
After Width: | Height: | Size: 170 KiB |
Loading…
Reference in New Issue