Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
This commit is contained in:
WoosukKwon 2025-01-24 14:07:23 -08:00
parent ce983a38d5
commit 4cf76f3c75
2 changed files with 5 additions and 1 deletions

View File

@ -32,7 +32,11 @@ vLLM V1 introduces a comprehensive re-architecture of its core components, inclu
## 1. Optimized Execution Loop & API Server
![][image2]
<p align="center">
<picture>
<img src="/assets/figures/v1/v1_server_architecture.png" width="80%">
</picture>
</p>
As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLMs core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.

Binary file not shown.

After

Width:  |  Height:  |  Size: 300 KiB