diff --git a/_posts/2024-10-17-spec-decode.md b/_posts/2024-10-17-spec-decode.md
index 04440f7..abcd6b4 100644
--- a/_posts/2024-10-17-spec-decode.md
+++ b/_posts/2024-10-17-spec-decode.md
@@ -84,7 +84,7 @@ However, selecting the right draft model can be challenging. For example, in mod
-An example of prompt lookup decoding. Given the prompt, we build all 2-grams as the lookup key. The values are the three tokens following the lookup key. During generation, we will check if the current 2-gram matches any key in the lookup table. If so, we will propose the following tokens with the value.
+An example of prompt lookup decoding. Given the prompt, we build all 2-grams as the lookup key. The values are the three tokens following the lookup key. During generation, we will check if the current 2-gram matches any key. If so, we will propose the following tokens with the value.
Otherwise known as n-gram matching, this approach is effective for use cases like summarization and question-answering, where there is a significant overlap between the prompt and the answer. Instead of using a small model to propose tokens, the system speculates based on the information already available in the prompt. This works particularly well when the large model repeats parts of the prompt in its answers.
@@ -95,7 +95,7 @@ Otherwise known as n-gram matching, this approach is effective for use cases lik
Picture from https://github.com/FasterDecoding/Medusa.
-In the example above, three heads are used to propose tokens for the following three positions. Head 1 is proposing ["is", "\'", "the"] for the first position. Head 2 is proposing ["difficult", "is", "\'"] for the second position. Head 3 is proposing ["not", "difficult", "a"] for the third position. All heads take the output of the last transformer block as the input.
+In the example, three heads are used to propose tokens for the following three positions. Head 1 is proposing ["is", "\'", "the"] for the first position. Head 2 is proposing ["difficult", "is", "\'"] for the second position. Head 3 is proposing ["not", "difficult", "a"] for the third position. All heads take the output of the last transformer block as the input.
In this method, additional layers (or heads) are added to the large model itself, allowing it to predict multiple tokens in a single forward pass. This reduces the need for a separate draft model, instead leveraging the large model’s own capacity for parallel token generation. Though preliminary, this method shows promise for improving efficiency as more optimized kernels are developed.
@@ -116,7 +116,7 @@ Speculative decoding offers significant performance benefits in **low-QPS (queri
-Performance comparison showing spec decode delivering up to 1.5x Speedup at QPS=1 Llama3-70B on ShareGPT with 4xH100 using draft model (turboderp/Qwama-0.5B-Instruct) and up to 2.8x Speedup at QPS=1 Llama3-70B on CNN Dailymail with 4xH100 using n-grams
+Performance comparison showing spec decode delivering up to 1.5x Speedup at QPS=1 Llama3-70B on ShareGPT with 4xH100 using draft model (turboderp/Qwama-0.5B-Instruct) and up to 2.8x Speedup at QPS=1 Llama3-70B on CNN Dailymail with 4xH100 using n-grams.
However, in **high-QPS environments**, speculative decoding may introduce performance trade-offs. The extra compute required to propose and verify tokens can sometimes slow down the system when it is already compute-bound, as seen when the number of requests per second increases. In such cases, the overhead of speculative decoding can outweigh its benefits, leading to reduced performance.
diff --git a/assets/figures/spec-decode/figure2.png b/assets/figures/spec-decode/figure2.png
index d0973b8..912b5c1 100644
Binary files a/assets/figures/spec-decode/figure2.png and b/assets/figures/spec-decode/figure2.png differ
diff --git a/assets/figures/spec-decode/figure3.png b/assets/figures/spec-decode/figure3.png
index 8dda188..7243f88 100644
Binary files a/assets/figures/spec-decode/figure3.png and b/assets/figures/spec-decode/figure3.png differ