Merge pull request #3 from vllm-project/fastgen

FastGen
2023-11-14 15:00:16 -08:00 · 2023-11-14 15:00:16 -08:00 · 98626b451c
parent 73685a63a9 1d9d5b235d
commit 98626b451c
1 changed files with 13 additions and 13 deletions
--- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md
+++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md
@ -1,14 +1,14 @@
 ---
 layout: post
-title: "Notes on vLLM v.s. DeepSpeed"
+title: "Notes on vLLM v.s. DeepSpeed-FastGen"
 author: "vLLM Team"
 ---
 ---
 **TL;DR:**
- vLLM matches DeepSpeed's speed in common scenarios and surpasses it when handling longer outputs.
+- vLLM matches DeepSpeed-FastGen's speed in common scenarios and surpasses it when handling longer outputs.
- DeepSpeed only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
+- DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
 - vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned, offering extensive model and optimization support.
 ---
@ -16,21 +16,21 @@ author: "vLLM Team"
 The DeepSpeed team recently published [a blog post](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique.
 We are happy to see the technology advancements from the open-source community.
 In this blog, we show the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited.
-For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed MII.
+For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed-FastGen.
 ### Performance Benchmark
-We've identified two key differences between vLLM and DeepSpeed in terms of performance optimization:
+We've identified two key differences between vLLM and DeepSpeed-FastGen in terms of performance optimization:
-1. **DeepSpeed adopts a conservative/suboptimal memory allocation scheme**, which wastes memory when output lengths are large.
+1. **DeepSpeed-FastGen adopts a conservative/suboptimal memory allocation scheme**, which wastes memory when output lengths are large.
-2. DeepSpeed’s Dynamic SplitFuse scheduling gives **speedup only when prompt lengths are much greater than output lengths**.
+2. DeepSpeed-FastGen’s Dynamic SplitFuse scheduling gives **speedup only when prompt lengths are much greater than output lengths**.
-As a result, DeepSpeed outperforms when the workload is consistently long prompt and short output.
+As a result, DeepSpeed-FastGen outperforms when the workload is consistently long prompt and short output.
 In other scenarios, vLLM shows superior performance.
 #### Scenario 1: Long Prompt Length, Short Output
-Here, DeepSpeed's Dynamic SplitFuse scheduling is expected to shine.
+Here, DeepSpeed-FastGen's Dynamic SplitFuse scheduling is expected to shine.
 However, the performance gain we observe isn't as significant as 2x.
 <p align="center">
@ -40,7 +40,7 @@ However, the performance gain we observe isn't as significant as 2x.
 </p>
 #### Scenario 2: Other cases
-In these cases, vLLM is up to **1.8x** faster than DeepSpeed.
+In these cases, vLLM is up to **1.8x** faster than DeepSpeed-FastGen.
 <p align="center">
 <picture>
@ -58,10 +58,10 @@ Specifically for the Dynamic SplitFuse optimization, we are actively investigati
 ### Appendix: Feature Comparison
-DeepSpeed currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
+DeepSpeed-FastGen currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
-We do expect the DeepSpeed open source are eager to catch up and we welcome the creative innovation in the market!
+We do expect the DeepSpeed-FastGen is eager to catch up and we welcome the creative innovation in the market!
-|                            |                   vLLM                  |                    DeepSpeed                    |
+|                            |                   vLLM                  |                DeepSpeed-FastGen                |
 |----------------------------|:---------------------------------------:|:-----------------------------------------------:|
 | Runtime                    | Python/PyTorch                          | Python/PyTorch                                  |
 | Model implementation       | HuggingFace Transformers                | Custom implementation + converter for HF models |