History

Adrian 3b1e4c6a23 [Docs] Add GPT2ForSequenceClassification to supported models in docs (#19932 ) Signed-off-by: nie3e <adrcwiek@gmail.com>		2025-06-21 20:57:19 +00:00
..
api	Migrate docs from Sphinx to MkDocs (#18145 )	2025-05-23 02:09:53 -07:00
assets	Migrate docs from Sphinx to MkDocs (#18145 )	2025-05-23 02:09:53 -07:00
ci	Add a doc on how to update PyTorch version (#19705 )	2025-06-17 18:10:37 +08:00
cli	[Misc] Add packages for benchmark as extra dependency (#19089 )	2025-06-04 04:18:48 -07:00
community	[doc] add contact us in community (#19922 )	2025-06-21 17:29:06 +00:00
configuration	[Doc] Move examples and further reorganize user guide (#18666 )	2025-05-26 07:38:04 -07:00
contributing	[Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc (#19808 )	2025-06-18 20:22:28 +00:00
deployment	[Doc] Add troubleshooting section to k8s deployment (#19377 )	2025-06-13 21:47:51 +00:00
design	[V1][P/D] An native implementation of xPyD based on P2P NCCL (#18242 )	2025-06-18 06:32:36 +00:00
features	Add xLAM tool parser support (#17148 )	2025-06-19 14:26:41 +08:00
getting_started	[doc] fix the incorrect label (#19787 )	2025-06-18 10:30:58 +00:00
mkdocs	[doc][mkdocs] Add edit button to documentation (#19637 )	2025-06-17 11:10:31 +00:00
models	[Docs] Add GPT2ForSequenceClassification to supported models in docs (#19932 )	2025-06-21 20:57:19 +00:00
serving	[Doc] Support "important" and "announcement" admonitions (#19479 )	2025-06-11 01:39:58 -07:00
training	[Doc] Move examples and further reorganize user guide (#18666 )	2025-05-26 07:38:04 -07:00
usage	[Doc] Update V1 user guide for embedding models (#19842 )	2025-06-19 09:43:27 +00:00
.nav.yml	[doc] add CLI doc (#18871 )	2025-05-29 09:51:36 +00:00
README.md	[doc] show the count for fork and watch (#18950 )	2025-05-30 06:45:59 -07:00

README.md

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM" class="no-scaled-link" width="60%" }

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-lora support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
[vLLM Meetups][meetups]