Update 2025-01-21-stack-release.md

This commit is contained in:
Hanchenli 2025-01-24 11:29:01 -06:00 committed by GitHub
parent 47d9a477b7
commit 46d4516b54
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 9 additions and 9 deletions

View File

@ -1,10 +1,10 @@
--- ---
layout: post layout: post
title: "High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”" title: "High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”"
thumbnail-img: /assets/img/stack-thumbnail.png thumbnail-img: /assets/figure/stack/stack-thumbnail.png
share-img: /assets/img/stack-thumbnail.png share-img: /assets/figure/stack/stack-thumbnail.png
author: LMCache Team author: LMCache Team
image: /assets/img/stack-thumbnail.png image: /assets/figure/stack/stack-thumbnail.png
--- ---
<br> <br>
@ -27,7 +27,7 @@ image: /assets/img/stack-thumbnail.png
How do we extend its power into a **full-stack** inference system that any organization can deploy at scale with *high reliability*, *high throughput*, and *low latency*? Thats precisely why the LMCache team and the vLLM team built **vLLM production-stack**. How do we extend its power into a **full-stack** inference system that any organization can deploy at scale with *high reliability*, *high throughput*, and *low latency*? Thats precisely why the LMCache team and the vLLM team built **vLLM production-stack**.
<div align="center"> <div align="center">
<img src="/assets/img/stack-thumbnail.png" alt="Icon" style="width: 60%; vertical-align:middle;"> <img src="/assets/figure/stack/stack-thumbnail.png" alt="Icon" style="width: 60%; vertical-align:middle;">
</div> </div>
# Introducing "*vLLM Production-Stack*" # Introducing "*vLLM Production-Stack*"
@ -41,7 +41,7 @@ How do we extend its power into a **full-stack** inference system that any organ
Below is a quick snapshot comparing vLLM production-stack with its closest counterparts: Below is a quick snapshot comparing vLLM production-stack with its closest counterparts:
<div align="center"> <div align="center">
<img src="/assets/img/stack-table.png" alt="Icon" style="width: 90%; vertical-align:middle;"> <img src="/assets/figure/stack/stack-table.png" alt="Icon" style="width: 90%; vertical-align:middle;">
</div> </div>
### The Design ### The Design
@ -54,7 +54,7 @@ At a high level:
- Observability modules gather metrics like TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), and throughput, giving you real-time insights into your systems health. - Observability modules gather metrics like TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), and throughput, giving you real-time insights into your systems health.
<div align="center"> <div align="center">
<img src="/assets/img/stack-overview-2.png" alt="Icon" style="width: 90%; vertical-align:middle;"> <img src="/assets/figure/stack/stack-overview-2.png" alt="Icon" style="width: 90%; vertical-align:middle;">
</div> </div>
# Advantage #1: Easy Deployment # Advantage #1: Easy Deployment
@ -72,18 +72,18 @@ We conduct a benchmark of multi-round Q&A workload on vLLM production-stack and
The results show vLLM stack outperforms other setups across key metrics (time to first token and inter token latency). The results show vLLM stack outperforms other setups across key metrics (time to first token and inter token latency).
<div align="center"> <div align="center">
<img src="/assets/img/stack-ttft.png" alt="Icon" style="width: 60%; vertical-align:middle;"> <img src="/assets/figure/stack/stack-ttft.png" alt="Icon" style="width: 60%; vertical-align:middle;">
</div> </div>
<div align="center"> <div align="center">
<img src="/assets/img/stack-itl.png" alt="Icon" style="width: 60%; vertical-align:middle;"> <img src="/assets/figure/stack/stack-itl.png" alt="Icon" style="width: 60%; vertical-align:middle;">
</div> </div>
# Advantage #3: Effortless Monitoring # Advantage #3: Effortless Monitoring
Keep real-time tracking of your LLM inference cluster with key metrics including latency distributions, number of requests over time, KV cache hit rate. Keep real-time tracking of your LLM inference cluster with key metrics including latency distributions, number of requests over time, KV cache hit rate.
<div align="center"> <div align="center">
<img src="/assets/img/stack-panel.png" alt="Icon" style="width: 70%; vertical-align:middle;"> <img src="/assets/figure/stack/stack-panel.png" alt="Icon" style="width: 70%; vertical-align:middle;">
</div> </div>