diff --git a/_posts/2025-01-24-v1.md b/_posts/2025-01-24-v1.md index 47f3ca7..70c595f 100644 --- a/_posts/2025-01-24-v1.md +++ b/_posts/2025-01-24-v1.md @@ -70,7 +70,7 @@ vLLM V1 introduces a clean and efficient architecture for tensor-parallel infere

-In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the [Persistent Batch](https://github.com/InternLM/lmdeploy) technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively using Numpy operations instead of Python's native ones. +In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the [Persistent Batch](https://github.com/InternLM/lmdeploy) technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively utilizing Numpy operations instead of Python's native ones. ## 6. torch.compile and Piecewise CUDA Graphs