Minor
Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
This commit is contained in:
parent
41e379c103
commit
f81c314751
|
@ -70,7 +70,7 @@ vLLM V1 introduces a clean and efficient architecture for tensor-parallel infere
|
|||
</picture>
|
||||
</p>
|
||||
|
||||
In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the [Persistent Batch](https://github.com/InternLM/lmdeploy) technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively using Numpy operations instead of Python's native ones.
|
||||
In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the [Persistent Batch](https://github.com/InternLM/lmdeploy) technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively utilizing Numpy operations instead of Python's native ones.
|
||||
|
||||
## 6. torch.compile and Piecewise CUDA Graphs
|
||||
|
||||
|
|
Loading…
Reference in New Issue