retro add images
This commit is contained in:
parent
7588b8bbc2
commit
4577c6ac65
|
@ -3,6 +3,7 @@ layout: post
|
||||||
title: "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention"
|
title: "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention"
|
||||||
author: "Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution)"
|
author: "Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution)"
|
||||||
extra: "<br><p align=\"center\"><picture><img src=\"/assets/logos/vllm-logo-text-light.png\" width=\"65%\"></picture></p><br>"
|
extra: "<br><p align=\"center\"><picture><img src=\"/assets/logos/vllm-logo-text-light.png\" width=\"65%\"></picture></p><br>"
|
||||||
|
image: /assets/logos/vllm-logo-text-light.png
|
||||||
---
|
---
|
||||||
<p align="center" style="margin-top:-15px">
|
<p align="center" style="margin-top:-15px">
|
||||||
<a href="https://github.com/vllm-project/vllm"><b>GitHub</b></a> | <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://arxiv.org/pdf/2309.06180.pdf"><b>Paper</b></a>
|
<a href="https://github.com/vllm-project/vllm"><b>GitHub</b></a> | <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://arxiv.org/pdf/2309.06180.pdf"><b>Paper</b></a>
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
layout: post
|
layout: post
|
||||||
title: "Notes on vLLM v.s. DeepSpeed-FastGen"
|
title: "Notes on vLLM v.s. DeepSpeed-FastGen"
|
||||||
author: "vLLM Team"
|
author: "vLLM Team"
|
||||||
|
image: /assets/figures/notes-vllm-vs-deepspeed/s2.png
|
||||||
---
|
---
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
layout: post
|
layout: post
|
||||||
title: "Announcing Llama 3.1 Support in vLLM"
|
title: "Announcing Llama 3.1 Support in vLLM"
|
||||||
author: "vLLM Team"
|
author: "vLLM Team"
|
||||||
|
image: /assets/figures/llama31/perf_llama3.png
|
||||||
---
|
---
|
||||||
|
|
||||||
Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting new features with longer context length (up to 128K tokens), larger model size (up to 405B parameters), and more advanced model capabilities. The vLLM community has added many enhancements to make sure the longer, larger Llamas run smoothly on vLLM, which includes chunked prefill, FP8 quantization, and pipeline parallelism. We will introduce these new enhancements in this blogpost.
|
Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting new features with longer context length (up to 128K tokens), larger model size (up to 405B parameters), and more advanced model capabilities. The vLLM community has added many enhancements to make sure the longer, larger Llamas run smoothly on vLLM, which includes chunked prefill, FP8 quantization, and pipeline parallelism. We will introduce these new enhancements in this blogpost.
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
layout: post
|
layout: post
|
||||||
title: "vLLM’s Open Governance and Performance Roadmap"
|
title: "vLLM’s Open Governance and Performance Roadmap"
|
||||||
author: "vLLM Team"
|
author: "vLLM Team"
|
||||||
|
image: /assets/figures/lfai/vllm-lfai-light.png
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
layout: post
|
layout: post
|
||||||
title: "vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction"
|
title: "vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction"
|
||||||
author: "vLLM Team"
|
author: "vLLM Team"
|
||||||
|
image: /assets/figures/perf-v060/llama8B_comparison.png
|
||||||
---
|
---
|
||||||
|
|
||||||
**TL;DR:** vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.
|
**TL;DR:** vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
layout: post
|
layout: post
|
||||||
title: "How Speculative Decoding Boosts vLLM Performance by up to 2.8x"
|
title: "How Speculative Decoding Boosts vLLM Performance by up to 2.8x"
|
||||||
author: "vLLM Team"
|
author: "vLLM Team"
|
||||||
|
image: /assets/figures/spec-decode/figure9.png
|
||||||
---
|
---
|
||||||
|
|
||||||
Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.
|
Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.
|
||||||
|
|
Loading…
Reference in New Issue