vllm/docs
Shintarou Okada 3d19d47d91
[Frontend] Expand tools even if tool_choice="none" (#17177)
Signed-off-by: okada shintarou <okada@preferred.jp>
2025-07-01 12:47:38 -04:00
..
api Migrate docs from Sphinx to MkDocs (#18145) 2025-05-23 02:09:53 -07:00
assets Migrate docs from Sphinx to MkDocs (#18145) 2025-05-23 02:09:53 -07:00
ci [doc] Fold long code blocks to improve readability (#19926) 2025-06-23 05:24:23 +00:00
cli [doc] Fold long code blocks to improve readability (#19926) 2025-06-23 05:24:23 +00:00
community [doc] use snippets for contact us (#19944) 2025-06-22 10:26:13 +00:00
configuration [doc] Fold long code blocks to improve readability (#19926) 2025-06-23 05:24:23 +00:00
contributing [Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models (#20058) 2025-06-30 17:26:49 +00:00
deployment [Docs] Improve frameworks/helm.md (#20113) 2025-06-26 10:41:51 +00:00
design [Docs] Fix 1-2-3 list in v1/prefix_caching.md (#20243) 2025-06-30 11:20:51 +00:00
features [Frontend] Expand tools even if tool_choice="none" (#17177) 2025-07-01 12:47:38 -04:00
getting_started [CPU] Update custom ops for the CPU backend (#20255) 2025-07-01 07:25:03 +00:00
mkdocs [doc] fix the incorrect logo in dark mode (#20289) 2025-07-01 08:18:09 +00:00
models [Model]Add Tencent HunYuanMoEV1 Model Support (#20114) 2025-07-01 07:28:13 -07:00
serving [Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA (#15897) 2025-06-30 21:44:39 -07:00
training [Doc] Move examples and further reorganize user guide (#18666) 2025-05-26 07:38:04 -07:00
usage [Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA (#15897) 2025-06-30 21:44:39 -07:00
.nav.yml [Doc] Update docs for New Model Implementation (#20115) 2025-06-26 00:47:06 -07:00
README.md [doc] fix the incorrect logo in dark mode (#20289) 2025-07-01 08:18:09 +00:00

README.md

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
  • Prefix caching support
  • Multi-LoRA support

For more information, check out the following: