diff --git a/_posts/2025-01-21-stack-release.md b/_posts/2025-01-21-stack-release.md new file mode 100644 index 0000000..4810ca2 --- /dev/null +++ b/_posts/2025-01-21-stack-release.md @@ -0,0 +1,106 @@ +--- +layout: post +title: "High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”" +thumbnail-img: /assets/figures/stack/stack-thumbnail.png +share-img: /assets/figures/stack/stack-thumbnail.png +author: LMCache Team +image: /assets/figures/stack/stack-thumbnail.png +--- +
+ + +## TL;DR +- **vLLM** boasts the largest open-source community, but what does it take to transform vLLM from the best single-node LLM engine to a premier LLM serving system? +- **Today, we release “vLLM production-stack”**, a vLLM-based full inference stack that introduces two major advantages: + - **10x better performance** (3-10x lower response delay & 2-5x higher throughput) with prefix-aware request routing and KV-cache sharing. + - **Easy cluster deployment** with built-in support for fault tolerance, autoscaling, and observability. +- And the best part? It’s **open-source**—so everyone can get started right away! [[**https://github.com/vllm-project/production-stack**]](https://github.com/vllm-project/production-stack) + + +# The Context + + +*In the AI arms race, it’s no longer just about who has the best model—it’s about **who has the best LLM serving system**.* + +**vLLM** has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on **single-node** deployments. + +How do we extend its power into a **full-stack** inference system that any organization can deploy at scale with *high reliability*, *high throughput*, and *low latency*? That’s precisely why the LMCache team and the vLLM team built **vLLM production-stack**. + +
+Icon +
+ +# Introducing "*vLLM Production-Stack*" +**vLLM Production-stack** is an open-source **reference implementation** of an **inference stack** built on top of vLLM, designed to run seamlessly on a cluster of GPU nodes. It adds four critical functionalities that complement vLLM’s native strengths: +- **KV cache sharing & storage** to speed up inference when context is reused (powered by the [**LMCache**](https://github.com/LMCache/LMCache) project). +- **Prefix-aware routing** that sends queries to the vLLM instance already holding the relevant context KV cache. +- **Observability** of individual engine status and query-level metrics (TTFT, TBT, throughput). +- **Autoscaling** to handle dynamics of workloads. + +### Comparison with Alternatives: + +Below is a quick snapshot comparing vLLM production-stack with its closest counterparts: +
+Icon +
+ +### The Design +The vLLM production-stack architecture builds on top of vLLM’s powerful single-node engine to provide a cluster-wide solution. + +At a high level: +- Applications send LLM inference requests. +- Prefix-aware routing checks if the requested context is already cached within the memory pool of one instance. It then forwards the request to the node with the pre-computed cache. +- Autoscaling and a cluster manager watch the overall load and spin up new vLLM nodes if needed. +- Observability modules gather metrics like TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), and throughput, giving you real-time insights into your system’s health. + +
+Icon +
+ +# Advantage #1: Easy Deployment + +Use helm chart to deploy the vLLM production-stack to your k8s cluster through running a single command: +``` +sudo helm repo add llmstack-repo https://lmcache.github.io/helm/ &&\ + sudo helm install llmstack llmstack-repo/vllm-stack +``` + +For more details, please refer to the detailed README at [vLLM production-stack repo](https://github.com/vllm-project/production-stack). [Tutorials](https://github.com/LMCache/LMStack/tree/main/tutorials) about setting up k8s cluster and customizing helm charts are also available. + +# Advantage #2: Better Performance +We conduct a benchmark of multi-round Q&A workload on vLLM production-stack and other setups, including vLLM + KServe and an commercial endpoint service. +The results show vLLM stack outperforms other setups across key metrics (time to first token and inter token latency). + +
+Icon +
+ +
+Icon +
+ +# Advantage #3: Effortless Monitoring +Keep real-time tracking of your LLM inference cluster with key metrics including latency distributions, number of requests over time, KV cache hit rate. + +
+Icon +
+ + +## Conclusion +We’re thrilled to unveil **vLLM Production Stack**—the next step in transforming vLLM from a best-in-class single-node engine into a full-scale LLM serving system. +We believe the vLL stack will open new doors for organizations seeking to build, test, and deploy LLM applications at scale without sacrificing performance or simplicity. + +If you’re as excited as we are, don’t wait! +- **Clone the repo: [https://github.com/vllm-project/production-stack](https://github.com/vllm-project/production-stack)** +- **Kick the tires** +- **Let us know what you think!** +- **[Interest Form](https://forms.gle/mQfQDUXbKfp2St1z7)** + + +Join us to build a future where every application can harness the power of LLM inference—reliably, at scale, and without breaking a sweat. +*Happy deploying!* + +Contacts: +- **vLLM [slack](https://slack.vllm.ai/)** +- **LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ)** diff --git a/assets/figures/stack/stack-itl.png b/assets/figures/stack/stack-itl.png new file mode 100644 index 0000000..415ba85 Binary files /dev/null and b/assets/figures/stack/stack-itl.png differ diff --git a/assets/figures/stack/stack-overview-2.png b/assets/figures/stack/stack-overview-2.png new file mode 100644 index 0000000..38c462b Binary files /dev/null and b/assets/figures/stack/stack-overview-2.png differ diff --git a/assets/figures/stack/stack-panel.png b/assets/figures/stack/stack-panel.png new file mode 100644 index 0000000..adc2ad3 Binary files /dev/null and b/assets/figures/stack/stack-panel.png differ diff --git a/assets/figures/stack/stack-table.png b/assets/figures/stack/stack-table.png new file mode 100644 index 0000000..5c8a2c8 Binary files /dev/null and b/assets/figures/stack/stack-table.png differ diff --git a/assets/figures/stack/stack-thumbnail.png b/assets/figures/stack/stack-thumbnail.png new file mode 100644 index 0000000..5f94c3a Binary files /dev/null and b/assets/figures/stack/stack-thumbnail.png differ diff --git a/assets/figures/stack/stack-ttft.png b/assets/figures/stack/stack-ttft.png new file mode 100644 index 0000000..961969e Binary files /dev/null and b/assets/figures/stack/stack-ttft.png differ diff --git a/assets/figures/stack/temp b/assets/figures/stack/temp new file mode 100644 index 0000000..ce01362 --- /dev/null +++ b/assets/figures/stack/temp @@ -0,0 +1 @@ +hello