diff --git a/_posts/2023-06-20-vllm.md b/_posts/2023-06-20-vllm.md index e7e2211..330f982 100644 --- a/_posts/2023-06-20-vllm.md +++ b/_posts/2023-06-20-vllm.md @@ -3,6 +3,7 @@ layout: post title: "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" author: "Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution)" extra: "


" +image: /assets/logos/vllm-logo-text-light.png ---

GitHub | Documentation | Paper diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index ff12fd7..b558812 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -2,6 +2,7 @@ layout: post title: "Notes on vLLM v.s. DeepSpeed-FastGen" author: "vLLM Team" +image: /assets/figures/notes-vllm-vs-deepspeed/s2.png --- --- diff --git a/_posts/2024-07-23-llama31.md b/_posts/2024-07-23-llama31.md index 364a078..99e5ba2 100644 --- a/_posts/2024-07-23-llama31.md +++ b/_posts/2024-07-23-llama31.md @@ -2,6 +2,7 @@ layout: post title: "Announcing Llama 3.1 Support in vLLM" author: "vLLM Team" +image: /assets/figures/llama31/perf_llama3.png --- Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting new features with longer context length (up to 128K tokens), larger model size (up to 405B parameters), and more advanced model capabilities. The vLLM community has added many enhancements to make sure the longer, larger Llamas run smoothly on vLLM, which includes chunked prefill, FP8 quantization, and pipeline parallelism. We will introduce these new enhancements in this blogpost. diff --git a/_posts/2024-07-25-lfai-perf.md b/_posts/2024-07-25-lfai-perf.md index 26ec112..7108acd 100644 --- a/_posts/2024-07-25-lfai-perf.md +++ b/_posts/2024-07-25-lfai-perf.md @@ -2,6 +2,7 @@ layout: post title: "vLLM’s Open Governance and Performance Roadmap" author: "vLLM Team" +image: /assets/figures/lfai/vllm-lfai-light.png --- diff --git a/_posts/2024-09-05-perf-update.md b/_posts/2024-09-05-perf-update.md index 6506044..e26187a 100644 --- a/_posts/2024-09-05-perf-update.md +++ b/_posts/2024-09-05-perf-update.md @@ -2,6 +2,7 @@ layout: post title: "vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction" author: "vLLM Team" +image: /assets/figures/perf-v060/llama8B_comparison.png --- **TL;DR:** vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model. diff --git a/_posts/2024-10-17-spec-decode.md b/_posts/2024-10-17-spec-decode.md index abcd6b4..02cec08 100644 --- a/_posts/2024-10-17-spec-decode.md +++ b/_posts/2024-10-17-spec-decode.md @@ -2,6 +2,7 @@ layout: post title: "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" author: "vLLM Team" +image: /assets/figures/spec-decode/figure9.png --- Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.