llama4 post (#47)

* llama4 post

Signed-off-by: simon-mo <simon.mo@hey.com>

* 8 images

Signed-off-by: simon-mo <simon.mo@hey.com>

* comments

Signed-off-by: simon-mo <simon.mo@hey.com>

* charlotte edits

Signed-off-by: simon-mo <simon.mo@hey.com>

* remove dup paragraphs

Signed-off-by: Roger Wang <ywang@roblox.com>

* add multimodal

Signed-off-by: Roger Wang <ywang@roblox.com>

* update model id

Signed-off-by: Roger Wang <ywang@roblox.com>

* point to 0.8.3 branch

Signed-off-by: Roger Wang <ywang@roblox.com>

---------

Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
This commit is contained in:
Simon Mo 2025-04-05 23:44:57 -07:00 committed by GitHub
parent 79dc7969a9
commit de34b1ce54
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 108 additions and 0 deletions

108
_posts/2025-04-05-llama4.md Normal file
View File

@ -0,0 +1,108 @@
---
layout: post
title: "Llama 4 in vLLM"
author: "The vLLM Team"
image: /assets/figures/llama4/perf.png
thumbnail-img: /assets/figures/llama4/perf.png
share-img: /assets/figures/llama4/perf.png
---
We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10 images with good results), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later:
```
pip install -U vllm
```
Below, you'll find sample commands to get started. Alternatively, you can replace the CLI command with docker run ([instructions here](https://docs.vllm.ai/en/latest/deployment/docker.html)) or use our Pythonic interface, the [`LLM` class](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference), for local batch inference. We also recommend checking out the [demo from the Meta team](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynb) showcasing the 1M long context capability with vLLM.
## Usage Guide
Here's how you can serve the Llama 4 models using different hardware configurations.
Using 8xH100, vLLM can serve Scout with 1M context and Maverick with about 430K. See more tips below for performance enhancement and leveraging long context.
On 8x H100 GPUs:
* Scout (up to 1M context):
```
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 8 \
--max-model-len 1000000
```
* Maverick (up to \~430K context):
```
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--tensor-parallel-size 8 \
--max-model-len 430000
```
On 8x H200 GPUs:
* Scout (up to 3.6M context):
```
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 8 \
--max-model-len 3600000
```
* Maverick (up to 1M context):
```
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--tensor-parallel-size 8
```
**Multimodality:**
The Llama 4 models excel at image understanding up to 8-10 images. By default, vLLM server accepts 1 image per request. Please pass `--limit-mm-per-prompt image=10` to serve up to 10 images per request with OpenAI-compatible API. We also recommend checking out our multi-image offline inference example with Llama-4 [here](https://github.com/vllm-project/vllm/blob/v0.8.3/examples/offline_inference/vision_language_multi_image.py).
**Performance:**
With the configurations above, we observe the following output tokens/s for Scout-BF16 and Maverick-FP8:
![](/assets/figures/llama4/perf.png)
While more performance enhancements are on the way, we believe the Llama 4 models' efficient architecture and relatively small size make them practical for scaled usage today.
**Tips for Performance and Long Context:**
* **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting.
* **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
* **Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens.
**Other Hardware Support & Quantizations:**
* A100: We have verified that the bf16 versions of the models work well on A100 GPUs.
* INT4: An INT4-quantized version of the Scout model checkpoint that fits on a single H100 GPUis currently a work in progress. Stay tuned for updates.
* AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building [vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm) and using the same commands as above.
**Inference Accuracy Validation:**
We validated inference accuracy against the official Meta report using lm-eval-harness. Here are the results for [meta-llama/Llama-4-Maverick-17B-128E-Instruct](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct):
| | MMLU Pro | ChartQA |
|----------|---------|---------|
| Reported | 80.5 | 90 |
| H100 FP8 | 80.4 | 89.4 |
| AMD MI300x BF16 | 80.4 | 89.4 |
| H200 BF16 | 80.2 | 89.3 |
## Efficient Architecture and Cluster Scale Serving
Llama 4s model architecture is particularly well-suited for efficient long-context inference, thanks to features like:
* **Mixture of Experts (MoE):** Scout uses 16 experts (17B activated parameters), and Maverick uses 128 experts (17B activated parameters). Only one expert is activated per token, maintaining efficiency.
* **Interleaved RoPE (iRoPE):** Llama 4 interleaves global attention (without RoPE) with chunked local attention (with RoPE) in a 1:3 ratio. The local attention layer attends to tokens in non-overlapping chunks, significantly reducing the quadratic complexity of attention as context length scales.
vLLM recently launched the [V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), delivering major performance speedups on single nodes, along with native torch.compile support. Our [Q2 roadmap](https://github.com/vllm-project/vllm/issues/15735) focuses on enhancing vLLMs multi-node scaling capabilities, aiming for disaggregated, cluster-scale serving. We are actively adding support for efficient expert parallelism, multi-node data parallelism, and cluster-wide prefill disaggregation.
## Acknowledgement
We extend our sincere thanks to the Meta team for their implementation of the model architecture, extensive accuracy evaluation, and performance benchmarking: [Lucia (Lu) Fang](https://github.com/luccafong), [Ye (Charlotte) Qi](https://github.com/yeqcharlotte), [Lu Fang](https://github.com/houseroad), [Yang Chen](https://github.com/chenyang78), [Zijing Liu](https://github.com/liuzijing2014), [Yong Hoon Shin](https://github.com/sarckk), [Zhewen Li](https://github.com/zhewenl), [Jon Swenson](https://github.com/jmswen), [Kai Wu](https://github.com/wukaixingxp), [Xiaodong Wang](https://github.com/xw285cornell), [Shiyan Deng](https://github.com/842974287), [Wenchen Wang](https://github.com/wangwenchen0407), [Lai Wei](https://github.com/roywei), [Matthias Reso](https://github.com/mreso), [Chris Thi](https://github.com/cthi), [Keyun Tong](https://github.com/youngkent), [Jinho Hwang](https://github.com/jinhohwang-meta), [Driss Guessous](https://github.com/drisspg), [Aston Zhang](https://github.com/astonzhang).
We also thank the AMD team for their support in enabling these models on MI300X: [Hongxia Yang](https://github.com/hongxiayang) and Weijun Jiang.
The vLLM teams performance benchmarks were run on hardware generously provided by Nebius and NVIDIA.

Binary file not shown.

After

Width:  |  Height:  |  Size: 514 KiB