remove attn_temperature_tuning in default user guide (#49)

Signed-off-by: Lu Fang <fanglu@fb.com>
This commit is contained in:
Lucia Fang 2025-04-08 04:03:41 -07:00 committed by GitHub
parent 3eb4d4d737
commit e4a43dab00
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 4 additions and 6 deletions

View File

@ -35,7 +35,7 @@ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruc
``` ```
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--max-model-len 430000 --override-generation-config='{"attn_temperature_tuning": true}' --max-model-len 430000'
``` ```
On 8x H200 GPUs: On 8x H200 GPUs:
@ -45,7 +45,7 @@ On 8x H200 GPUs:
``` ```
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--max-model-len 3600000 --override-generation-config='{"attn_temperature_tuning": true}' --max-model-len 3600000'
``` ```
* Maverick (up to 1M context): * Maverick (up to 1M context):
@ -53,11 +53,9 @@ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruc
``` ```
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--tensor-parallel-size 8 --tensor-parallel-size 8
--max-model-len 1000000 --override-generation-config='{"attn_temperature_tuning": true}' --max-model-len 1000000'
``` ```
Note: we highly recommend to turn on attn_temperature_tuning to improve accuracy for long contexts longer than 32K tokens, and VLLM_DISABLE_COMPILE_CACHE=1 is required.
**Multimodality:** **Multimodality:**
The Llama 4 models excel at image understanding up to 8-10 images. By default, vLLM server accepts 1 image per request. Please pass `--limit-mm-per-prompt image=10` to serve up to 10 images per request with OpenAI-compatible API. We also recommend checking out our multi-image offline inference example with Llama-4 [here](https://github.com/vllm-project/vllm/blob/v0.8.3/examples/offline_inference/vision_language_multi_image.py). The Llama 4 models excel at image understanding up to 8-10 images. By default, vLLM server accepts 1 image per request. Please pass `--limit-mm-per-prompt image=10` to serve up to 10 images per request with OpenAI-compatible API. We also recommend checking out our multi-image offline inference example with Llama-4 [here](https://github.com/vllm-project/vllm/blob/v0.8.3/examples/offline_inference/vision_language_multi_image.py).
@ -74,6 +72,7 @@ While more performance enhancements are on the way, we believe the Llama 4 model
* **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting. * **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting.
* **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). * **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
* **Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens.
**Other Hardware Support & Quantizations:** **Other Hardware Support & Quantizations:**
@ -108,4 +107,3 @@ We extend our sincere thanks to the Meta team for their implementation of the mo
We also thank the AMD team for their support in enabling these models on MI300X: [Hongxia Yang](https://github.com/hongxiayang) and Weijun Jiang. We also thank the AMD team for their support in enabling these models on MI300X: [Hongxia Yang](https://github.com/hongxiayang) and Weijun Jiang.
The vLLM teams performance benchmarks were run on hardware generously provided by Nebius and NVIDIA. The vLLM teams performance benchmarks were run on hardware generously provided by Nebius and NVIDIA.