diff --git a/_posts/2025-01-26-v1.md b/_posts/2025-01-26-v1.md index dc2f305..ef87f42 100644 --- a/_posts/2025-01-26-v1.md +++ b/_posts/2025-01-26-v1.md @@ -58,14 +58,13 @@ vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional d vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%. -Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.** -

- +

+Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.** ## 4. Clean Architecture for Tensor-Parallel Inference @@ -113,16 +112,26 @@ The final piece of the puzzle for vLLM V1 was integrating [FlashAttention 3](htt # Performance -Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **x** higher throughput compared to V0 (*without multi-step scheduling*). +Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **1.7x** higher throughput compared to V0 (*without multi-step scheduling*). These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack. The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs. -- **Llama 3.1 8B, 1xH100** +- **Llama 3.1 8B & Llama 3.3 70B** -- **Llama 3.3 70B, 4xH100** +

+ + + +

- **Qwen2-VL (VLM), 1xH100** +

+ + + +

+ # Limitations & Future Work diff --git a/assets/figures/v1/v1_llama.png b/assets/figures/v1/v1_llama.png new file mode 100644 index 0000000..e228e60 Binary files /dev/null and b/assets/figures/v1/v1_llama.png differ diff --git a/assets/figures/v1/v1_prefix_caching.png b/assets/figures/v1/v1_prefix_caching.png index 0265dd0..1dcd341 100644 Binary files a/assets/figures/v1/v1_prefix_caching.png and b/assets/figures/v1/v1_prefix_caching.png differ diff --git a/assets/figures/v1/v1_qwen2vl.png b/assets/figures/v1/v1_qwen2vl.png new file mode 100644 index 0000000..fdec227 Binary files /dev/null and b/assets/figures/v1/v1_qwen2vl.png differ