diff --git a/_posts/2025-01-24-v1.md b/_posts/2025-01-24-v1.md index 2f3d669..ad6455d 100644 --- a/_posts/2025-01-24-v1.md +++ b/_posts/2025-01-24-v1.md @@ -32,7 +32,11 @@ vLLM V1 introduces a comprehensive re-architecture of its core components, inclu ## 1. Optimized Execution Loop & API Server -![][image2] +

+ + + +

As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLM’s core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms. diff --git a/assets/figures/v1/v1_server_architecture.png b/assets/figures/v1/v1_server_architecture.png new file mode 100644 index 0000000..7acae43 Binary files /dev/null and b/assets/figures/v1/v1_server_architecture.png differ