This commit is contained in:
LiuXiaoxuanPKU 2024-10-22 11:39:37 -07:00
parent c02becf2bd
commit 5c940c665f
3 changed files with 3 additions and 3 deletions

View File

@ -84,7 +84,7 @@ However, selecting the right draft model can be challenging. For example, in mod
<picture>
<img src="/assets/figures/spec-decode/figure3.png" width="80%">
</picture><br>
An example of prompt lookup decoding. Given the prompt, we build all 2-grams as the lookup key. The values are the three tokens following the lookup key. During generation, we will check if the current 2-gram matches any key in the lookup table. If so, we will propose the following tokens with the value.
An example of prompt lookup decoding. Given the prompt, we build all 2-grams as the lookup key. The values are the three tokens following the lookup key. During generation, we will check if the current 2-gram matches any key. If so, we will propose the following tokens with the value.
</p>
Otherwise known as n-gram matching, this approach is effective for use cases like summarization and question-answering, where there is a significant overlap between the prompt and the answer. Instead of using a small model to propose tokens, the system speculates based on the information already available in the prompt. This works particularly well when the large model repeats parts of the prompt in its answers.
@ -95,7 +95,7 @@ Otherwise known as n-gram matching, this approach is effective for use cases lik
<img src="/assets/figures/spec-decode/figure4.png" width="60%">
</picture><br>
<i>Picture from https://github.com/FasterDecoding/Medusa</i>.
In the example above, three heads are used to propose tokens for the following three positions. Head 1 is proposing ["is", "\'", "the"] for the first position. Head 2 is proposing ["difficult", "is", "\'"] for the second position. Head 3 is proposing ["not", "difficult", "a"] for the third position. All heads take the output of the last transformer block as the input.
In the example, three heads are used to propose tokens for the following three positions. Head 1 is proposing ["is", "\'", "the"] for the first position. Head 2 is proposing ["difficult", "is", "\'"] for the second position. Head 3 is proposing ["not", "difficult", "a"] for the third position. All heads take the output of the last transformer block as the input.
</p>
In this method, additional layers (or heads) are added to the large model itself, allowing it to predict multiple tokens in a single forward pass. This reduces the need for a separate draft model, instead leveraging the large models own capacity for parallel token generation. Though preliminary, this method shows promise for improving efficiency as more optimized kernels are developed.
@ -116,7 +116,7 @@ Speculative decoding offers significant performance benefits in **low-QPS (queri
&nbsp; &nbsp;
<img src="/assets/figures/spec-decode/figure6.png" width="48%">
</picture><br>
Performance comparison showing spec decode delivering up to 1.5x Speedup at QPS=1 Llama3-70B on ShareGPT with 4xH100 using draft model (turboderp/Qwama-0.5B-Instruct) and up to 2.8x Speedup at QPS=1 Llama3-70B on CNN Dailymail with 4xH100 using n-grams
Performance comparison showing spec decode delivering up to 1.5x Speedup at QPS=1 Llama3-70B on ShareGPT with 4xH100 using draft model (turboderp/Qwama-0.5B-Instruct) and up to 2.8x Speedup at QPS=1 Llama3-70B on CNN Dailymail with 4xH100 using n-grams.
</p>
However, in **high-QPS environments**, speculative decoding may introduce performance trade-offs. The extra compute required to propose and verify tokens can sometimes slow down the system when it is already compute-bound, as seen when the number of requests per second increases. In such cases, the overhead of speculative decoding can outweigh its benefits, leading to reduced performance.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 242 KiB

After

Width:  |  Height:  |  Size: 199 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 186 KiB

After

Width:  |  Height:  |  Size: 148 KiB