From 41cd6ebf99f7efc4dbdb2c9164b7a8a9ac50c838 Mon Sep 17 00:00:00 2001
From: Hanchenli <61769611+Hanchenli@users.noreply.github.com>
Date: Fri, 24 Jan 2025 11:24:43 -0600
Subject: [PATCH 1/6] Add files via upload
---
_posts/2025-01-21-stack-release (1).md | 106 +++++++++++++++++++++++++
1 file changed, 106 insertions(+)
create mode 100644 _posts/2025-01-21-stack-release (1).md
diff --git a/_posts/2025-01-21-stack-release (1).md b/_posts/2025-01-21-stack-release (1).md
new file mode 100644
index 0000000..f6d1a51
--- /dev/null
+++ b/_posts/2025-01-21-stack-release (1).md
@@ -0,0 +1,106 @@
+---
+layout: post
+title: "High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”"
+thumbnail-img: /assets/img/stack-thumbnail.png
+share-img: /assets/img/stack-thumbnail.png
+author: LMCache Team
+image: /assets/img/stack-thumbnail.png
+---
+
+
+
+## TL;DR
+- **vLLM** boasts the largest open-source community, but what does it take to transform vLLM from the best single-node LLM engine to a premier LLM serving system?
+- **Today, we release “vLLM production-stack”**, a vLLM-based full inference stack that introduces two major advantages:
+ - **10x better performance** (3-10x lower response delay & 2-5x higher throughput) with prefix-aware request routing and KV-cache sharing.
+ - **Easy cluster deployment** with built-in support for fault tolerance, autoscaling, and observability.
+- And the best part? It’s **open-source**—so everyone can get started right away! [[**https://github.com/vllm-project/production-stack**]](https://github.com/vllm-project/production-stack)
+
+
+# The Context
+
+
+*In the AI arms race, it’s no longer just about who has the best model—it’s about **who has the best LLM serving system**.*
+
+**vLLM** has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on **single-node** deployments.
+
+How do we extend its power into a **full-stack** inference system that any organization can deploy at scale with *high reliability*, *high throughput*, and *low latency*? That’s precisely why the LMCache team and the vLLM team built **vLLM production-stack**.
+
+
+

+
+
+# Introducing "*vLLM Production-Stack*"
+**vLLM Production-stack** is an open-source **reference implementation** of an **inference stack** built on top of vLLM, designed to run seamlessly on a cluster of GPU nodes. It adds four critical functionalities that complement vLLM’s native strengths:
+- **KV cache sharing & storage** to speed up inference when context is reused (powered by the [**LMCache**](https://github.com/LMCache/LMCache) project).
+- **Prefix-aware routing** that sends queries to the vLLM instance already holding the relevant context KV cache.
+- **Observability** of individual engine status and query-level metrics (TTFT, TBT, throughput).
+- **Autoscaling** to handle dynamics of workloads.
+
+### Comparison with Alternatives:
+
+Below is a quick snapshot comparing vLLM production-stack with its closest counterparts:
+
+

+
+
+### The Design
+The vLLM production-stack architecture builds on top of vLLM’s powerful single-node engine to provide a cluster-wide solution.
+
+At a high level:
+- Applications send LLM inference requests.
+- Prefix-aware routing checks if the requested context is already cached within the memory pool of one instance. It then forwards the request to the node with the pre-computed cache.
+- Autoscaling and a cluster manager watch the overall load and spin up new vLLM nodes if needed.
+- Observability modules gather metrics like TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), and throughput, giving you real-time insights into your system’s health.
+
+
+

+
+
+# Advantage #1: Easy Deployment
+
+Use helm chart to deploy the vLLM production-stack to your k8s cluster through running a single command:
+```
+sudo helm repo add llmstack-repo https://lmcache.github.io/helm/ &&\
+ sudo helm install llmstack llmstack-repo/vllm-stack
+```
+
+For more details, please refer to the detailed README at [vLLM production-stack repo](https://github.com/vllm-project/production-stack). [Tutorials](https://github.com/LMCache/LMStack/tree/main/tutorials) about setting up k8s cluster and customizing helm charts are also available.
+
+# Advantage #2: Better Performance
+We conduct a benchmark of multi-round Q&A workload on vLLM production-stack and other setups, including vLLM + KServe and an commercial endpoint service.
+The results show vLLM stack outperforms other setups across key metrics (time to first token and inter token latency).
+
+
+

+
+
+
+

+
+
+# Advantage #3: Effortless Monitoring
+Keep real-time tracking of your LLM inference cluster with key metrics including latency distributions, number of requests over time, KV cache hit rate.
+
+
+

+
+
+
+## Conclusion
+We’re thrilled to unveil **vLLM Production Stack**—the next step in transforming vLLM from a best-in-class single-node engine into a full-scale LLM serving system.
+We believe the vLL stack will open new doors for organizations seeking to build, test, and deploy LLM applications at scale without sacrificing performance or simplicity.
+
+If you’re as excited as we are, don’t wait!
+- **Clone the repo: [https://github.com/vllm-project/production-stack](https://github.com/vllm-project/production-stack)**
+- **Kick the tires**
+- **Let us know what you think!**
+- **[Interest Form](https://forms.gle/mQfQDUXbKfp2St1z7)**
+
+
+Join us to build a future where every application can harness the power of LLM inference—reliably, at scale, and without breaking a sweat.
+*Happy deploying!*
+
+Contacts:
+- **vLLM [slack](https://slack.vllm.ai/)**
+- **LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ)**
From 0a0111f2ec745fad67536a3bdc235e2254de399f Mon Sep 17 00:00:00 2001
From: Hanchenli <61769611+Hanchenli@users.noreply.github.com>
Date: Fri, 24 Jan 2025 11:25:04 -0600
Subject: [PATCH 2/6] Rename 2025-01-21-stack-release (1).md to
2025-01-21-stack-release.md
---
...025-01-21-stack-release (1).md => 2025-01-21-stack-release.md} | 0
1 file changed, 0 insertions(+), 0 deletions(-)
rename _posts/{2025-01-21-stack-release (1).md => 2025-01-21-stack-release.md} (100%)
diff --git a/_posts/2025-01-21-stack-release (1).md b/_posts/2025-01-21-stack-release.md
similarity index 100%
rename from _posts/2025-01-21-stack-release (1).md
rename to _posts/2025-01-21-stack-release.md
From f16547db6d78eb33029ec759e3f4d39afdc8a6c7 Mon Sep 17 00:00:00 2001
From: Hanchenli <61769611+Hanchenli@users.noreply.github.com>
Date: Fri, 24 Jan 2025 11:25:51 -0600
Subject: [PATCH 3/6] Create temp
---
assets/figures/stack/temp | 1 +
1 file changed, 1 insertion(+)
create mode 100644 assets/figures/stack/temp
diff --git a/assets/figures/stack/temp b/assets/figures/stack/temp
new file mode 100644
index 0000000..ce01362
--- /dev/null
+++ b/assets/figures/stack/temp
@@ -0,0 +1 @@
+hello
From 47d9a477b797f6b1e0e04df7ccdc67eec4f125f5 Mon Sep 17 00:00:00 2001
From: Hanchenli <61769611+Hanchenli@users.noreply.github.com>
Date: Fri, 24 Jan 2025 11:27:27 -0600
Subject: [PATCH 4/6] Add files via upload
---
assets/figures/stack/stack-itl.png | Bin 0 -> 75165 bytes
assets/figures/stack/stack-overview-2.png | Bin 0 -> 445808 bytes
assets/figures/stack/stack-panel.png | Bin 0 -> 209281 bytes
assets/figures/stack/stack-table.png | Bin 0 -> 78527 bytes
assets/figures/stack/stack-thumbnail.png | Bin 0 -> 182919 bytes
assets/figures/stack/stack-ttft.png | Bin 0 -> 82183 bytes
6 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 assets/figures/stack/stack-itl.png
create mode 100644 assets/figures/stack/stack-overview-2.png
create mode 100644 assets/figures/stack/stack-panel.png
create mode 100644 assets/figures/stack/stack-table.png
create mode 100644 assets/figures/stack/stack-thumbnail.png
create mode 100644 assets/figures/stack/stack-ttft.png
diff --git a/assets/figures/stack/stack-itl.png b/assets/figures/stack/stack-itl.png
new file mode 100644
index 0000000000000000000000000000000000000000..415ba85a470ef8b6f24281d1eec854e40e5474bf
GIT binary patch
literal 75165
zcmeFZ1y@{4*DVaBlR$8SB|(F`OK?kqySo$I-6cSP;O_3Ojk`++cemi~b~nj6@AG}{
z^9$}6cd&b;X|i`!t(rC0Tx(VJ4{6D-h;QD%fr5fU6cHAZg@Ss~4Fv^*1p5m3B*LVu
z3MSK|(P1
zsWrWloQXS2s2vqsw%|>wbH5A6n=h}v