vllm/docs
Xu Wenqing 02658c2dfe
Add DeepSeek-R1-0528 function call chat template (#18874)
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
2025-06-04 13:24:18 +00:00
..
api Migrate docs from Sphinx to MkDocs (#18145) 2025-05-23 02:09:53 -07:00
assets Migrate docs from Sphinx to MkDocs (#18145) 2025-05-23 02:09:53 -07:00
cli [Misc] Add packages for benchmark as extra dependency (#19089) 2025-06-04 04:18:48 -07:00
community [Doc] Add community links (#18657) 2025-05-24 06:06:38 -07:00
configuration [Doc] Move examples and further reorganize user guide (#18666) 2025-05-26 07:38:04 -07:00
contributing [Docs] Add developer doc about CI failures (#18782) 2025-06-04 01:09:13 +00:00
deployment [doc] update docker version (#19074) 2025-06-03 19:08:21 +00:00
design [v1][KVCacheManager] Rename BlockHashType to BlockHash (#19015) 2025-06-03 08:01:48 +00:00
features Add DeepSeek-R1-0528 function call chat template (#18874) 2025-06-04 13:24:18 +00:00
getting_started [doc] clarify windows support (#19088) 2025-06-03 21:42:17 +08:00
mkdocs [Misc] Add SPDX-FileCopyrightText (#19100) 2025-06-03 11:20:17 -07:00
models [Doc] Add InternVL LoRA support (#19055) 2025-06-03 09:08:03 +00:00
serving [doc] improve readability (#18675) 2025-05-25 01:40:31 -07:00
training [Doc] Move examples and further reorganize user guide (#18666) 2025-05-26 07:38:04 -07:00
usage [CPU] V1 support for the CPU backend (#16441) 2025-06-03 18:43:01 -07:00
.nav.yml [doc] add CLI doc (#18871) 2025-05-29 09:51:36 +00:00
README.md [doc] show the count for fork and watch (#18950) 2025-05-30 06:45:59 -07:00

README.md

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM" class="no-scaled-link" width="60%" }

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
  • Prefix caching support
  • Multi-lora support

For more information, check out the following: