Fig

Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
2025-01-24 14:07:23 -08:00 · 2025-01-24 14:07:23 -08:00 · 4cf76f3c75
parent ce983a38d5
commit 4cf76f3c75
2 changed files with 5 additions and 1 deletions
--- a/_posts/2025-01-24-v1.md
+++ b/_posts/2025-01-24-v1.md
@ -32,7 +32,11 @@ vLLM V1 introduces a comprehensive re-architecture of its core components, inclu

 ## 1. Optimized Execution Loop & API Server

-![][image2]
+<p align="center">
+<picture>
+<img src="/assets/figures/v1/v1_server_architecture.png" width="80%">
+</picture>
+</p>

 As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLM’s core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.

--- a/assets/figures/v1/v1_server_architecture.png
+++ b/assets/figures/v1/v1_server_architecture.png