Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
This commit is contained in:
WoosukKwon 2025-01-26 20:34:12 -08:00
parent f90c18079b
commit e83a5be9ff
1 changed files with 19 additions and 19 deletions

View File

@ -112,7 +112,7 @@ The final piece of the puzzle for vLLM V1 was integrating [FlashAttention 3](htt
# Performance # Performance
Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **1.7x** higher throughput compared to V0 (*without multi-step scheduling*). Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **1.7x higher throughput** compared to V0 (*without multi-step scheduling*).
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack. These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs. The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
@ -185,7 +185,7 @@ The V1 re-architecture is a continued joint effort across the entire vLLM team a
- [Tyler Michael Smith](https://github.com/tlrmchlsmth) implemented the tensor parallelism support with Python multiprocessing. - [Tyler Michael Smith](https://github.com/tlrmchlsmth) implemented the tensor parallelism support with Python multiprocessing.
- [Rui Qiao](https://github.com/ruisearch42) implemented the tensor parallelism support with Ray and is implementing pipeline parallelism support. - [Rui Qiao](https://github.com/ruisearch42) implemented the tensor parallelism support with Ray and is implementing pipeline parallelism support.
- [Lucas Wilkinson](https://github.com/LucasWilkinson) added support for FlashAttention 3. - [Lucas Wilkinson](https://github.com/LucasWilkinson) added support for FlashAttention 3.
- [Alexander Matveev](https://github.com/alexm-redhat) implemented the optimized preprocessor for multimodal inputs. - [Alexander Matveev](https://github.com/alexm-redhat) implemented the optimized preprocessor for multimodal inputs and is implementing TPU support.
- [Sourashis Roy](https://github.com/sroy745) implemented the logit penalties in the sampler. - [Sourashis Roy](https://github.com/sroy745) implemented the logit penalties in the sampler.
- [Cyrus Leung](https://github.com/DarkLight1337) led the MLLM input processing refactoring effort and helped its integration to V1. - [Cyrus Leung](https://github.com/DarkLight1337) led the MLLM input processing refactoring effort and helped its integration to V1.
- [Russell Bryant](https://github.com/russellb) addressed several multiprocess-related issues. - [Russell Bryant](https://github.com/russellb) addressed several multiprocess-related issues.