mirror of https://github.com/vllm-project/vllm.git
Signed-off-by: wang.yuqi <noooop@126.com> |
||
---|---|---|
.. | ||
api | ||
assets | ||
ci | ||
cli | ||
community | ||
configuration | ||
contributing | ||
deployment | ||
design | ||
features | ||
getting_started | ||
mkdocs | ||
models | ||
serving | ||
training | ||
usage | ||
.nav.yml | ||
README.md |
README.md
Welcome to vLLM
Easy, fast, and cheap LLM serving for everyone
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
- Prefix caching support
- Multi-LoRA support
For more information, check out the following:
- vLLM announcing blog post (intro to PagedAttention)
- vLLM paper (SOSP 2023)
- How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
- [vLLM Meetups][meetups]