From 46d4516b5438c114946b6138e1ce9aa194611f45 Mon Sep 17 00:00:00 2001 From: Hanchenli <61769611+Hanchenli@users.noreply.github.com> Date: Fri, 24 Jan 2025 11:29:01 -0600 Subject: [PATCH] Update 2025-01-21-stack-release.md --- _posts/2025-01-21-stack-release.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/_posts/2025-01-21-stack-release.md b/_posts/2025-01-21-stack-release.md index f6d1a51..6fd6e09 100644 --- a/_posts/2025-01-21-stack-release.md +++ b/_posts/2025-01-21-stack-release.md @@ -1,10 +1,10 @@ --- layout: post title: "High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”" -thumbnail-img: /assets/img/stack-thumbnail.png -share-img: /assets/img/stack-thumbnail.png +thumbnail-img: /assets/figure/stack/stack-thumbnail.png +share-img: /assets/figure/stack/stack-thumbnail.png author: LMCache Team -image: /assets/img/stack-thumbnail.png +image: /assets/figure/stack/stack-thumbnail.png ---
@@ -27,7 +27,7 @@ image: /assets/img/stack-thumbnail.png How do we extend its power into a **full-stack** inference system that any organization can deploy at scale with *high reliability*, *high throughput*, and *low latency*? That’s precisely why the LMCache team and the vLLM team built **vLLM production-stack**.
-Icon +Icon
# Introducing "*vLLM Production-Stack*" @@ -41,7 +41,7 @@ How do we extend its power into a **full-stack** inference system that any organ Below is a quick snapshot comparing vLLM production-stack with its closest counterparts:
-Icon +Icon
### The Design @@ -54,7 +54,7 @@ At a high level: - Observability modules gather metrics like TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), and throughput, giving you real-time insights into your system’s health.
-Icon +Icon
# Advantage #1: Easy Deployment @@ -72,18 +72,18 @@ We conduct a benchmark of multi-round Q&A workload on vLLM production-stack and The results show vLLM stack outperforms other setups across key metrics (time to first token and inter token latency).
-Icon +Icon
-Icon +Icon
# Advantage #3: Effortless Monitoring Keep real-time tracking of your LLM inference cluster with key metrics including latency distributions, number of requests over time, KV cache hit rate.
-Icon +Icon