Merge pull request #3 from vllm-project/fastgen

FastGen
2023-11-14 15:00:16 -08:00 · 2023-11-14 15:00:16 -08:00 · 98626b451c
parent 73685a63a9 1d9d5b235d
commit 98626b451c
1 changed files with 13 additions and 13 deletions
--- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md
+++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md
@ -1,14 +1,14 @@
 ---
 layout: post
-title: "Notes on vLLM v.s. DeepSpeed"
+title: "Notes on vLLM v.s. DeepSpeed-FastGen"
 author: "vLLM Team"
 ---

 ---
 **TL;DR:**

- vLLM matches DeepSpeed's speed in common scenarios and surpasses it when handling longer outputs.
- DeepSpeed only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
+- vLLM matches DeepSpeed-FastGen's speed in common scenarios and surpasses it when handling longer outputs.
+- DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
 - vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned, offering extensive model and optimization support.

 ---
@ -16,21 +16,21 @@ author: "vLLM Team"
 The DeepSpeed team recently published [a blog post](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique.
 We are happy to see the technology advancements from the open-source community.
 In this blog, we show the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited.
-For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed MII.
+For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed-FastGen.


 ### Performance Benchmark

-We've identified two key differences between vLLM and DeepSpeed in terms of performance optimization:
+We've identified two key differences between vLLM and DeepSpeed-FastGen in terms of performance optimization:

-1. **DeepSpeed adopts a conservative/suboptimal memory allocation scheme**, which wastes memory when output lengths are large.
-2. DeepSpeed’s Dynamic SplitFuse scheduling gives **speedup only when prompt lengths are much greater than output lengths**.
+1. **DeepSpeed-FastGen adopts a conservative/suboptimal memory allocation scheme**, which wastes memory when output lengths are large.
+2. DeepSpeed-FastGen’s Dynamic SplitFuse scheduling gives **speedup only when prompt lengths are much greater than output lengths**.

-As a result, DeepSpeed outperforms when the workload is consistently long prompt and short output.
+As a result, DeepSpeed-FastGen outperforms when the workload is consistently long prompt and short output.
 In other scenarios, vLLM shows superior performance.

 #### Scenario 1: Long Prompt Length, Short Output
-Here, DeepSpeed's Dynamic SplitFuse scheduling is expected to shine.
+Here, DeepSpeed-FastGen's Dynamic SplitFuse scheduling is expected to shine.
 However, the performance gain we observe isn't as significant as 2x.

 <p align="center">
@ -40,7 +40,7 @@ However, the performance gain we observe isn't as significant as 2x.
 </p>

 #### Scenario 2: Other cases
-In these cases, vLLM is up to **1.8x** faster than DeepSpeed.
+In these cases, vLLM is up to **1.8x** faster than DeepSpeed-FastGen.

 <p align="center">
 <picture>
@ -58,10 +58,10 @@ Specifically for the Dynamic SplitFuse optimization, we are actively investigati

 ### Appendix: Feature Comparison

-DeepSpeed currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
-We do expect the DeepSpeed open source are eager to catch up and we welcome the creative innovation in the market!
+DeepSpeed-FastGen currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
+We do expect the DeepSpeed-FastGen is eager to catch up and we welcome the creative innovation in the market!

-|                            |                   vLLM                  |                    DeepSpeed                    |
+|                            |                   vLLM                  |                DeepSpeed-FastGen                |
 |----------------------------|:---------------------------------------:|:-----------------------------------------------:|
 | Runtime                    | Python/PyTorch                          | Python/PyTorch                                  |
 | Model implementation       | HuggingFace Transformers                | Custom implementation + converter for HF models |