History

Ekagra Ranjan 135cf55cd1 [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (#18971 )		2025-06-03 15:26:33 -07:00
..
cutlass_benchmarks	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
disagg_benchmarks	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
fused_kernels	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
kernels	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
overheads	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
structured_schemas	benchmarks: simplify test jsonschema (#14567 )	2025-03-11 13:39:30 +00:00
README.md	[Misc][Benchmark] Add support for CustomDataset (#18511 )	2025-05-31 19:07:38 +00:00
auto_tune.sh	[Misc][Tools][Benchmark] Publish script to auto tune server parameters (#17207 )	2025-05-01 19:53:03 +00:00
backend_request_func.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
benchmark_dataset.py	[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (#18971 )	2025-06-03 15:26:33 -07:00
benchmark_latency.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
benchmark_long_document_qa_throughput.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
benchmark_prefix_caching.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
benchmark_prioritization.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
benchmark_serving.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
benchmark_serving_structured_output.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
benchmark_throughput.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
benchmark_utils.py	[Misc] Add SPDX-FileCopyrightText (#19100 )	2025-06-03 11:20:17 -07:00
pyproject.toml	[Doc] Move examples and further reorganize user guide (#18666 )	2025-05-26 07:38:04 -07:00
run_structured_output_benchmark.sh	[Benchmarks] Refactor run_structured_output_benchmarks.sh (#17722 )	2025-05-13 01:47:29 -07:00
sonnet.txt	feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )	2024-03-27 13:39:26 -07:00

README.md

Benchmarking vLLM

This README guides you through running benchmark tests with the extensive datasets supported on vLLM. It’s a living document, updated as new features and datasets become available.

Dataset Overview

Dataset	Online	Offline	Data Path
ShareGPT	✅	✅	`wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
BurstGPT	✅	✅	`wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv`
Sonnet	✅	✅	Local file: `benchmarks/sonnet.txt`
Random	✅	✅	`synthetic`
HuggingFace-VisionArena	✅	✅	`lmarena-ai/VisionArena-Chat`
HuggingFace-InstructCoder	✅	✅	`likaixin/InstructCoder`
HuggingFace-AIMO	✅	✅	`AI-MO/aimo-validation-aime` , `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-CoT`
HuggingFace-Other	✅	✅	`lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered`
Custom	✅	✅	Local file: `data.jsonl`

✅: supported

🟡: Partial support

🚧: to be supported

Note: HuggingFace dataset's dataset-name should be set to hf

Example - Online Benchmark

First start serving your model

vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests

Then run the benchmarking script

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10

If successful, you will see the following output

============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  5.78      
Total input tokens:                      1369      
Total generated tokens:                  2212      
Request throughput (req/s):              1.73      
Output token throughput (tok/s):         382.89    
Total Token throughput (tok/s):          619.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54     
Median TTFT (ms):                        73.88     
P99 TTFT (ms):                           79.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91      
Median TPOT (ms):                        7.96      
P99 TPOT (ms):                           8.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74      
Median ITL (ms):                         7.70      
P99 ITL (ms):                            8.39      
==================================================

Custom Dataset

If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using CustomDataset. Your data needs to be in .jsonl format and needs to have "prompt" field per entry, e.g., data.jsonl

{"prompt": "What is the capital of India?"}
{"prompt": "What is the capital of Iran?"}
{"prompt": "What is the capital of China?"}

# start server
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests

# run benchmarking script
python3 benchmarks/benchmark_serving.py --port 9001 --save-result --save-detailed \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/completions \
  --dataset-name custom \
  --dataset-path <path-to-your-data-jsonl> \
  --custom-skip-chat-template \
  --num-prompts 80 \
  --max-concurrency 1 \
  --temperature=0.3 \
  --top-p=0.75 \
  --result-dir "./log/"

You can skip applying chat template if your data already has it by using --custom-skip-chat-template.

VisionArena Benchmark for Vision Language Models

# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 1000

InstructCoder Benchmark with Speculative Decoding

VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-config $'{"method": "ngram",
    "num_speculative_tokens": 5, "prompt_lookup_max": 5,
    "prompt_lookup_min": 2}'

python3 benchmarks/benchmark_serving.py \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 2048

Other HuggingFaceDataset Examples

vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

lmms-lab/LLaVA-OneVision-Data

python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10

Aeala/ShareGPT_Vicuna_unfiltered

python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10

AI-MO/aimo-validation-aime

python3 vllm/benchmarks/benchmark_serving.py \
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path AI-MO/aimo-validation-aime \
    --num-prompts 10 \
    --seed 42

philschmid/mt-bench

python3 vllm/benchmarks/benchmark_serving.py \
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts 80

Running With Sampling Parameters

When using OpenAI-compatible backends such as vllm, optional sampling parameters can be specified. Example client command:

python3 vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --top-k 10 \
  --top-p 0.9 \
  --temperature 0.5 \
  --num-prompts 10

Example - Offline Throughput Benchmark

python3 vllm/benchmarks/benchmark_throughput.py \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset-name sonnet \
  --dataset-path vllm/benchmarks/sonnet.txt \
  --num-prompts 10

If successful, you will see the following output

Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens:  5014
Total num output tokens:  1500

VisionArena Benchmark for Vision Language Models

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --hf-split train

The num prompt tokens now includes image token counts

Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens:  14527
Total num output tokens:  1280

InstructCoder Benchmark with Speculative Decoding

VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
python3 vllm/benchmarks/benchmark_throughput.py \
    --dataset-name=hf \
    --dataset-path=likaixin/InstructCoder \
    --model=meta-llama/Meta-Llama-3-8B-Instruct \
    --input-len=1000 \
    --output-len=100 \
    --num-prompts=2048 \
    --async-engine \
    --speculative-config $'{"method": "ngram",
    "num_speculative_tokens": 5, "prompt_lookup_max": 5,
    "prompt_lookup_min": 2}'

Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens:  261136
Total num output tokens:  204800

Other HuggingFaceDataset Examples

lmms-lab/LLaVA-OneVision-Data

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10

Aeala/ShareGPT_Vicuna_unfiltered

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10

AI-MO/aimo-validation-aime

python3 benchmarks/benchmark_throughput.py \
  --model Qwen/QwQ-32B \
  --backend vllm \
  --dataset-name hf \
  --dataset-path AI-MO/aimo-validation-aime \
  --hf-split train \
  --num-prompts 10

Benchmark with LoRA Adapters

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-2-7b-hf \
  --backend vllm \
  --dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --dataset_name sharegpt \
  --num-prompts 10 \
  --max-loras 2 \
  --max-lora-rank 8 \
  --enable-lora \
  --lora-path yard1/llama-2-7b-sql-lora-test

README.md Unescape Escape

Benchmarking vLLM

Dataset Overview

Example - Online Benchmark

Custom Dataset

VisionArena Benchmark for Vision Language Models

InstructCoder Benchmark with Speculative Decoding

Other HuggingFaceDataset Examples

Running With Sampling Parameters

Example - Offline Throughput Benchmark

VisionArena Benchmark for Vision Language Models

InstructCoder Benchmark with Speculative Decoding

Other HuggingFaceDataset Examples

Benchmark with LoRA Adapters

README.md