From cc0466fe0fea768d2438b385f26a3e1aecbd8d47 Mon Sep 17 00:00:00 2001
From: tunjiantan <tunjian.tan@embeddedllm.com>
Date: Wed, 30 Oct 2024 16:04:59 +0000
Subject: [PATCH 1/2] amend benchmark command

Signed-off-by: tunjiantan <tunjian.tan@embeddedllm.com>
---
 _posts/2024-10-23-vllm-serving-amd.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/_posts/2024-10-23-vllm-serving-amd.md b/_posts/2024-10-23-vllm-serving-amd.md
index 7a7ad94..cc42986 100644
--- a/_posts/2024-10-23-vllm-serving-amd.md
+++ b/_posts/2024-10-23-vllm-serving-amd.md
@@ -335,5 +335,5 @@ We have built the ROCm compatible vLLM docker from Dockerfile.rocm found in the
 | ------------- | ------------- |
 | vLLM Default Configuration | `VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-num-seqs 1024 --max-num-batched-tokens 1024 ` |
 | TGI Default Configuration | `ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --model-id Llama-3.1-405B-Instruct` |
-| vLLM (This Guide) | `VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct-FP8 -tp 8 --max-seq-len-to-capture 16384 --enable-chunked-prefill=False --num-scheduler-step 15 --max-num-seqs 1024 ` |
+| vLLM (This Guide) | `VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-seq-len-to-capture 16384 --enable-chunked-prefill=False --num-scheduler-step 15 --max-num-seqs 1024 ` |
 | TGI (This Guide) | `ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --max-total-tokens 131072 --max-input-tokens 131000 --model-id Llama-3.1-405B-Instruct` |
\ No newline at end of file

From 9769c02a655944aa1c9374c121c5cfcebf5498fd Mon Sep 17 00:00:00 2001
From: tunjiantan <tunjian.tan@embeddedllm.com>
Date: Wed, 30 Oct 2024 16:09:43 +0000
Subject: [PATCH 2/2] amend data type

Signed-off-by: tunjiantan <tunjian.tan@embeddedllm.com>
---
 _posts/2024-10-23-vllm-serving-amd.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/_posts/2024-10-23-vllm-serving-amd.md b/_posts/2024-10-23-vllm-serving-amd.md
index cc42986..eb77785 100644
--- a/_posts/2024-10-23-vllm-serving-amd.md
+++ b/_posts/2024-10-23-vllm-serving-amd.md
@@ -14,7 +14,7 @@ author: "Guest Post by Embedded LLM and Hot Aisle Inc."
 &nbsp; &nbsp;
 <img src="/assets/figures/vllm-serving-amd/405b2.png" width="35%">
 </picture><br>
-vLLM vs. TGI performance comparison for Llama 3.1 405B on 8 x MI300X (FP16, 32 QPS).
+vLLM vs. TGI performance comparison for Llama 3.1 405B on 8 x MI300X (BF16, 32 QPS).
 </p>
 
 <p align="center">
@@ -24,7 +24,7 @@ vLLM vs. TGI performance comparison for Llama 3.1 405B on 8 x MI300X (FP16, 32 Q
 &nbsp; &nbsp;
 <img src="/assets/figures/vllm-serving-amd/70b2.png" width="35%">
 </picture><br>
-vLLM vs. TGI performance comparison for Llama 3.1 70B on 8 x MI300X (FP16, 32 QPS).
+vLLM vs. TGI performance comparison for Llama 3.1 70B on 8 x MI300X (BF16, 32 QPS).
 </p>
 
 ### Introduction
@@ -49,7 +49,7 @@ Even in the default configuration, vLLM shows superior performance compared to T
 <picture>
 <img src="/assets/figures/vllm-serving-amd/introduction/Mean TTFT (ms).png" width="70%">
 </picture><br>
-vLLM vs. TGI performance for Llama 3.1 405B on 8 x MI300X (FP16, QPS 16, 32, 1000; see Appendix for commands).
+vLLM vs. TGI performance for Llama 3.1 405B on 8 x MI300X (BF16, QPS 16, 32, 1000; see Appendix for commands).
 </p>
 
 ### How to run vLLM with Optimal Performance