parent
ce983a38d5
commit
4cf76f3c75
|
@ -32,7 +32,11 @@ vLLM V1 introduces a comprehensive re-architecture of its core components, inclu
|
|||
|
||||
## 1. Optimized Execution Loop & API Server
|
||||
|
||||
![][image2]
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img src="/assets/figures/v1/v1_server_architecture.png" width="80%">
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLM’s core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.
|
||||
|
||||
|
|
Binary file not shown.
After Width: | Height: | Size: 300 KiB |
Loading…
Reference in New Issue