Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu>
This commit is contained in:
WoosukKwon 2025-01-24 13:48:03 -08:00
parent 6eff449c37
commit abc8465d71
1 changed files with 6 additions and 6 deletions

View File

@ -34,7 +34,7 @@ vLLM V1 introduces a comprehensive re-architecture of its core components, inclu
![][image2] ![][image2]
As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLMs core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as \~5ms. As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLMs core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.
In the [v0.6.0 release](https://blog.vllm.ai/2024/09/05/perf-update.html), vLLM introduced a multiprocessing API server utilizing ZeroMQ for IPC, enabling overlap between the API server and AsyncLLM. vLLM V1 extends this by integrating the multiprocessing architecture deeper into the core of AsyncLLM, creating an isolated `EngineCore` execution loop that focuses exclusively on the scheduler and model executor. This design allows for greater overlap of CPU-intensive tasks—such as tokenization, multimodal input processing, de-tokenization, and request streaming—with the core execution loop, thereby maximizing model throughput. In the [v0.6.0 release](https://blog.vllm.ai/2024/09/05/perf-update.html), vLLM introduced a multiprocessing API server utilizing ZeroMQ for IPC, enabling overlap between the API server and AsyncLLM. vLLM V1 extends this by integrating the multiprocessing architecture deeper into the core of AsyncLLM, creating an isolated `EngineCore` execution loop that focuses exclusively on the scheduler and model executor. This design allows for greater overlap of CPU-intensive tasks—such as tokenization, multimodal input processing, de-tokenization, and request streaming—with the core execution loop, thereby maximizing model throughput.
@ -106,19 +106,19 @@ While vLLM V1 shows promising results, it is still in its alpha stage and lacks
V1 supports decoder-only Transformers like Llama, mixture-of-experts (MoE) models like Mixtral, and some Llava-like VLMs such as Pixtral. All quantization methods are supported. However, V1 currently does not support encoder-decoder architectures like multimodal Llama 3.2, Mamba-based models like Jamba, or embedding models. Please check out [our documentation]() for a more detailed list of the supported models. V1 supports decoder-only Transformers like Llama, mixture-of-experts (MoE) models like Mixtral, and some Llava-like VLMs such as Pixtral. All quantization methods are supported. However, V1 currently does not support encoder-decoder architectures like multimodal Llama 3.2, Mamba-based models like Jamba, or embedding models. Please check out [our documentation]() for a more detailed list of the supported models.
**Feature Limitations:** **Feature Limitations:**
V1 currently lacks support for log probs, prompt log probs sampling parameters, pipeline parallelism, structured decoding, speculative decoding, prometheus metrics, and LoRA. We are actively working to close this feature gap and add new optimizations. Please stay tuned\! V1 currently lacks support for log probs, prompt log probs sampling parameters, pipeline parallelism, structured decoding, speculative decoding, prometheus metrics, and LoRA. We are actively working to close this feature gap and add new optimizations. Please stay tuned!
**Hardware Support:** **Hardware Support:**
V1 currently supports only Ampere or later NVIDIA GPUs. We are working on support for other hardware backends. V1 currently supports only Ampere or later NVIDIA GPUs. We are working on support for other hardware backends.
Finally, please note that you can continue using V0 and maintain backward compatibility by not setting \`VLLM\_USE\_V1=1\`. Finally, please note that you can continue using V0 and maintain backward compatibility by not setting `VLLM_USE_V1=1`.
# How to Get Started # How to Get Started
To use vLLM V1: To use vLLM V1:
1\. Install the latest version of vLLM with `pip install vllm --upgrade`. 1. Install the latest version of vLLM with `pip install vllm --upgrade`.
2\. **Set the environment variable `export VLLM_USE_V1=1`.** 2. **Set the environment variable `export VLLM_USE_V1=1`.**
3\. Use vLLMs [Python interface](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic.py) or OpenAI-compatible server (`vllm serve <model-name>`). You dont need any change to the existing API. 3. Use vLLMs [Python interface](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic.py) or OpenAI-compatible server (`vllm serve <model-name>`). You dont need any change to the existing API.
Please try it out and share your feedback! Please try it out and share your feedback!