Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
This commit is contained in:
WoosukKwon 2025-01-24 13:54:01 -08:00
parent 41e379c103
commit f81c314751
1 changed files with 1 additions and 1 deletions

View File

@ -70,7 +70,7 @@ vLLM V1 introduces a clean and efficient architecture for tensor-parallel infere
</picture>
</p>
In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the [Persistent Batch](https://github.com/InternLM/lmdeploy) technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively using Numpy operations instead of Python's native ones.
In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the [Persistent Batch](https://github.com/InternLM/lmdeploy) technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively utilizing Numpy operations instead of Python's native ones.
## 6. torch.compile and Piecewise CUDA Graphs