* [CI] Add prefix aware routing test Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * [ci] refactor k8s discovery e2e test Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * [CI] Refactor static discovery testing so that it can support multiple logic Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * [CI] Add static e2e test for prefixaware Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * refactor the code Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * [CI] refactor Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * [CI] Add multiple routing logic test Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * [CI] fix bug Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * hotfix/add error handle in pd routing Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> * modify Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> --------- Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com> |
||
|---|---|---|
| .github | ||
| benchmarks/multi-round-qa | ||
| community | ||
| deployment_on_cloud | ||
| docker | ||
| docs | ||
| examples | ||
| helm | ||
| observability | ||
| operator | ||
| proposals | ||
| scripts | ||
| src | ||
| tests/e2e | ||
| tutorials | ||
| utils | ||
| .codespell-ignore | ||
| .gitignore | ||
| .hadolint.yaml | ||
| .markdownlint.yaml | ||
| .pre-commit-config.yaml | ||
| .readthedocs.yaml | ||
| CODE_OF_CONDUCT.md | ||
| CONTRIBUTING.md | ||
| LICENSE | ||
| README.md | ||
| artifacthub-repo.yml | ||
| pyproject.toml | ||
| requirements-test.txt | ||
| uv.lock | ||
README.md
vLLM Production Stack: reference stack for production vLLM deployment
| Blog | Docs | Production-Stack Slack Channel | LMCache Slack | Interest Form | Official Email |
Latest News
- 📄 Official documentation released for production-stack!
- ✨ Cloud Deployment Tutorials for Lambda Labs, AWS EKS, Google GCP are out!
- 🛤️ 2025 Q1 roadmap is released! Join the discussion now!
- 🔥 vLLM Production Stack is released! Check out our release blogs posted on January 22, 2025.
Community Events
We host weekly community meetings at the following timeslot:
- Tuesdays at 5:30 PM PT – Add to Calendar
All are welcome to join!
Introduction
vLLM Production Stack project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:
- 🚀 Scale from a single vLLM instance to a distributed vLLM deployment without changing any application code
- 💻 Monitor the metrics through a web dashboard
- 😄 Enjoy the performance benefits brought by request routing and KV cache offloading
Step-By-Step Tutorials
- How To Install Kubernetes (kubectl, helm, minikube, etc)?
- How to Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Lambda Labs, Azure)?
- How To Set up a Minimal vLLM Production Stack?
- How To Customize vLLM Configs (optional)?
- How to Load Your LLM Weights?
- How to Launch Different LLMs in vLLM Production Stack?
- How to Enable KV Cache Offloading with LMCache?
Architecture
The stack is set up using Helm, and contains the following key parts:
- Serving engine: The vLLM engines that run different LLMs.
- Request router: Directs requests to appropriate backends based on routing keys or session IDs to maximize KV cache reuse.
- Observability stack: monitors the metrics of the backends through Prometheus + Grafana
Roadmap
We are actively working on this project and will release the following features soon. Please stay tuned!
- Autoscaling based on vLLM-specific metrics
- Support for disaggregated prefill
- Router improvements (e.g., more performant router using non-python languages, KV-cache-aware routing algorithm, better fault tolerance, etc)
Deploying the stack via Helm
Prerequisites
- A running Kubernetes (K8s) environment with GPUs
- Run
cd utils && bash install-minikube-cluster.sh - Or follow our tutorial
- Run
Deployment
vLLM Production Stack can be deployed via helm charts. Clone the repo to local and execute the following commands for a minimal deployment:
git clone https://github.com/vllm-project/production-stack.git
cd production-stack/
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
The deployed stack provides the same OpenAI API interface as vLLM, and can be accessed through kubernetes service.
To validate the installation and send a query to the stack, refer to this tutorial.
For more information about customizing the helm chart, please refer to values.yaml and our other tutorials.
Uninstall
helm uninstall vllm
Grafana Dashboard
Features
The Grafana dashboard provides the following insights:
- Available vLLM Instances: Displays the number of healthy instances.
- Request Latency Distribution: Visualizes end-to-end request latency.
- Time-to-First-Token (TTFT) Distribution: Monitors response times for token generation.
- Number of Running Requests: Tracks the number of active requests per instance.
- Number of Pending Requests: Tracks requests waiting to be processed.
- GPU KV Usage Percent: Monitors GPU KV cache usage.
- GPU KV Cache Hit Rate: Displays the hit rate for the GPU KV cache.
Configuration
See the details in observability/README.md
Router
The router ensures efficient request distribution among backends. It supports:
- Routing to endpoints that run different models
- Exporting observability metrics for each serving engine instance, including QPS, time-to-first-token (TTFT), number of pending/running/finished requests, and uptime
- Automatic service discovery and fault tolerance via the Kubernetes API
- Model aliases
- Multiple routing algorithms:
- Round-robin routing
- Session-ID based routing
- Prefix-aware routing (WIP)
Please refer to the router documentation for more details.
Contributing
We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.
License
This project is licensed under Apache License 2.0. See the LICENSE file for details.
Sponsors
We are grateful to our sponsors who support our development and benchmarking efforts:
For any issues or questions, feel free to open an issue or contact us (@ApostaC, @YuhanLiu11, @Shaoting-Feng).