From f81c314751fe12bb9d9fbc9e4e516809bc4daa25 Mon Sep 17 00:00:00 2001 From: WoosukKwon Date: Fri, 24 Jan 2025 13:54:01 -0800 Subject: [PATCH] Minor Signed-off-by: WoosukKwon --- _posts/2025-01-24-v1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-01-24-v1.md b/_posts/2025-01-24-v1.md index 47f3ca7..70c595f 100644 --- a/_posts/2025-01-24-v1.md +++ b/_posts/2025-01-24-v1.md @@ -70,7 +70,7 @@ vLLM V1 introduces a clean and efficient architecture for tensor-parallel infere

-In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the [Persistent Batch](https://github.com/InternLM/lmdeploy) technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively using Numpy operations instead of Python's native ones. +In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the [Persistent Batch](https://github.com/InternLM/lmdeploy) technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively utilizing Numpy operations instead of Python's native ones. ## 6. torch.compile and Piecewise CUDA Graphs