# Benchmarking vLLM This README guides you through running benchmark tests with the extensive datasets supported on vLLM. It’s a living document, updated as new features and datasets become available. **Dataset Overview**
Dataset Online Offline Data Path
ShareGPT wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
BurstGPT wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv
Sonnet Local file: benchmarks/sonnet.txt
Random synthetic
HuggingFace-VisionArena lmarena-ai/VisionArena-Chat
HuggingFace-InstructCoder likaixin/InstructCoder
HuggingFace-AIMO AI-MO/aimo-validation-aime , AI-MO/NuminaMath-1.5, AI-MO/NuminaMath-CoT
HuggingFace-Other lmms-lab/LLaVA-OneVision-Data, Aeala/ShareGPT_Vicuna_unfiltered
Custom Local file: data.jsonl
✅: supported 🟡: Partial support 🚧: to be supported **Note**: HuggingFace dataset's `dataset-name` should be set to `hf` ---
🚀 Example - Online Benchmark
First start serving your model ```bash vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests ``` Then run the benchmarking script ```bash # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json python3 vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --endpoint /v1/completions \ --dataset-name sharegpt \ --dataset-path /ShareGPT_V3_unfiltered_cleaned_split.json \ --num-prompts 10 ``` If successful, you will see the following output ``` ============ Serving Benchmark Result ============ Successful requests: 10 Benchmark duration (s): 5.78 Total input tokens: 1369 Total generated tokens: 2212 Request throughput (req/s): 1.73 Output token throughput (tok/s): 382.89 Total Token throughput (tok/s): 619.85 ---------------Time to First Token---------------- Mean TTFT (ms): 71.54 Median TTFT (ms): 73.88 P99 TTFT (ms): 79.49 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 7.91 Median TPOT (ms): 7.96 P99 TPOT (ms): 8.03 ---------------Inter-token Latency---------------- Mean ITL (ms): 7.74 Median ITL (ms): 7.70 P99 ITL (ms): 8.39 ================================================== ``` **Custom Dataset** If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl ``` {"prompt": "What is the capital of India?"} {"prompt": "What is the capital of Iran?"} {"prompt": "What is the capital of China?"} ``` ```bash # start server VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests ``` ```bash # run benchmarking script python3 benchmarks/benchmark_serving.py --port 9001 --save-result --save-detailed \ --backend vllm \ --model meta-llama/Llama-3.1-8B-Instruct \ --endpoint /v1/completions \ --dataset-name custom \ --dataset-path \ --custom-skip-chat-template \ --num-prompts 80 \ --max-concurrency 1 \ --temperature=0.3 \ --top-p=0.75 \ --result-dir "./log/" ``` You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`. **VisionArena Benchmark for Vision Language Models** ```bash # need a model with vision capability here vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests ``` ```bash python3 vllm/benchmarks/benchmark_serving.py \ --backend openai-chat \ --model Qwen/Qwen2-VL-7B-Instruct \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path lmarena-ai/VisionArena-Chat \ --hf-split train \ --num-prompts 1000 ``` **InstructCoder Benchmark with Speculative Decoding** ``` bash VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \ --speculative-config $'{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2}' ``` ``` bash python3 benchmarks/benchmark_serving.py \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --dataset-name hf \ --dataset-path likaixin/InstructCoder \ --num-prompts 2048 ``` **Other HuggingFaceDataset Examples** ```bash vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests ``` **`lmms-lab/LLaVA-OneVision-Data`** ```bash python3 vllm/benchmarks/benchmark_serving.py \ --backend openai-chat \ --model Qwen/Qwen2-VL-7B-Instruct \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path lmms-lab/LLaVA-OneVision-Data \ --hf-split train \ --hf-subset "chart2text(cauldron)" \ --num-prompts 10 ``` **`Aeala/ShareGPT_Vicuna_unfiltered`** ```bash python3 vllm/benchmarks/benchmark_serving.py \ --backend openai-chat \ --model Qwen/Qwen2-VL-7B-Instruct \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \ --hf-split train \ --num-prompts 10 ``` **`AI-MO/aimo-validation-aime`** ``` bash python3 vllm/benchmarks/benchmark_serving.py \ --model Qwen/QwQ-32B \ --dataset-name hf \ --dataset-path AI-MO/aimo-validation-aime \ --num-prompts 10 \ --seed 42 ``` **`philschmid/mt-bench`** ``` bash python3 vllm/benchmarks/benchmark_serving.py \ --model Qwen/QwQ-32B \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts 80 ``` **Running With Sampling Parameters** When using OpenAI-compatible backends such as `vllm`, optional sampling parameters can be specified. Example client command: ```bash python3 vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --endpoint /v1/completions \ --dataset-name sharegpt \ --dataset-path /ShareGPT_V3_unfiltered_cleaned_split.json \ --top-k 10 \ --top-p 0.9 \ --temperature 0.5 \ --num-prompts 10 ``` **Running With Ramp-Up Request Rate** The benchmark tool also supports ramping up the request rate over the duration of the benchmark run. This can be useful for stress testing the server or finding the maximum throughput that it can handle, given some latency budget. Two ramp-up strategies are supported: - `linear`: Increases the request rate linearly from a start value to an end value. - `exponential`: Increases the request rate exponentially. The following arguments can be used to control the ramp-up: - `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`). - `--ramp-up-start-rps`: The request rate at the beginning of the benchmark. - `--ramp-up-end-rps`: The request rate at the end of the benchmark.
📈 Example - Offline Throughput Benchmark
```bash python3 vllm/benchmarks/benchmark_throughput.py \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset-name sonnet \ --dataset-path vllm/benchmarks/sonnet.txt \ --num-prompts 10 ``` If successful, you will see the following output ``` Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s Total num prompt tokens: 5014 Total num output tokens: 1500 ``` **VisionArena Benchmark for Vision Language Models** ``` bash python3 vllm/benchmarks/benchmark_throughput.py \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ --dataset-name hf \ --dataset-path lmarena-ai/VisionArena-Chat \ --num-prompts 1000 \ --hf-split train ``` The `num prompt tokens` now includes image token counts ``` Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s Total num prompt tokens: 14527 Total num output tokens: 1280 ``` **InstructCoder Benchmark with Speculative Decoding** ``` bash VLLM_WORKER_MULTIPROC_METHOD=spawn \ VLLM_USE_V1=1 \ python3 vllm/benchmarks/benchmark_throughput.py \ --dataset-name=hf \ --dataset-path=likaixin/InstructCoder \ --model=meta-llama/Meta-Llama-3-8B-Instruct \ --input-len=1000 \ --output-len=100 \ --num-prompts=2048 \ --async-engine \ --speculative-config $'{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2}' ``` ``` Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s Total num prompt tokens: 261136 Total num output tokens: 204800 ``` **Other HuggingFaceDataset Examples** **`lmms-lab/LLaVA-OneVision-Data`** ```bash python3 vllm/benchmarks/benchmark_throughput.py \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ --dataset-name hf \ --dataset-path lmms-lab/LLaVA-OneVision-Data \ --hf-split train \ --hf-subset "chart2text(cauldron)" \ --num-prompts 10 ``` **`Aeala/ShareGPT_Vicuna_unfiltered`** ```bash python3 vllm/benchmarks/benchmark_throughput.py \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ --dataset-name hf \ --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \ --hf-split train \ --num-prompts 10 ``` **`AI-MO/aimo-validation-aime`** ```bash python3 benchmarks/benchmark_throughput.py \ --model Qwen/QwQ-32B \ --backend vllm \ --dataset-name hf \ --dataset-path AI-MO/aimo-validation-aime \ --hf-split train \ --num-prompts 10 ``` **Benchmark with LoRA Adapters** ``` bash # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json python3 vllm/benchmarks/benchmark_throughput.py \ --model meta-llama/Llama-2-7b-hf \ --backend vllm \ --dataset_path /ShareGPT_V3_unfiltered_cleaned_split.json \ --dataset_name sharegpt \ --num-prompts 10 \ --max-loras 2 \ --max-lora-rank 8 \ --enable-lora \ --lora-path yard1/llama-2-7b-sql-lora-test ```
🛠️ Example - Structured Output Benchmark
Benchmark the performance of structured output generation (JSON, grammar, regex). **Server Setup** ```bash vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests ``` **JSON Schema Benchmark** ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset json \ --structured-output-ratio 1.0 \ --request-rate 10 \ --num-prompts 1000 ``` **Grammar-based Generation Benchmark** ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset grammar \ --structure-type grammar \ --request-rate 10 \ --num-prompts 1000 ``` **Regex-based Generation Benchmark** ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset regex \ --request-rate 10 \ --num-prompts 1000 ``` **Choice-based Generation Benchmark** ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset choice \ --request-rate 10 \ --num-prompts 1000 ``` **XGrammar Benchmark Dataset** ```bash python3 benchmarks/benchmark_serving_structured_output.py \ --backend vllm \ --model NousResearch/Hermes-3-Llama-3.1-8B \ --dataset xgrammar_bench \ --request-rate 10 \ --num-prompts 1000 ```
📚 Example - Long Document QA Benchmark
Benchmark the performance of long document question-answering with prefix caching. **Basic Long Document QA Test** ```bash python3 benchmarks/benchmark_long_document_qa_throughput.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-documents 16 \ --document-length 2000 \ --output-len 50 \ --repeat-count 5 ``` **Different Repeat Modes** ```bash # Random mode (default) - shuffle prompts randomly python3 benchmarks/benchmark_long_document_qa_throughput.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-documents 8 \ --document-length 3000 \ --repeat-count 3 \ --repeat-mode random # Tile mode - repeat entire prompt list in sequence python3 benchmarks/benchmark_long_document_qa_throughput.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-documents 8 \ --document-length 3000 \ --repeat-count 3 \ --repeat-mode tile # Interleave mode - repeat each prompt consecutively python3 benchmarks/benchmark_long_document_qa_throughput.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-documents 8 \ --document-length 3000 \ --repeat-count 3 \ --repeat-mode interleave ```
🗂️ Example - Prefix Caching Benchmark
Benchmark the efficiency of automatic prefix caching. **Fixed Prompt with Prefix Caching** ```bash python3 benchmarks/benchmark_prefix_caching.py \ --model meta-llama/Llama-2-7b-chat-hf \ --enable-prefix-caching \ --num-prompts 1 \ --repeat-count 100 \ --input-length-range 128:256 ``` **ShareGPT Dataset with Prefix Caching** ```bash # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json python3 benchmarks/benchmark_prefix_caching.py \ --model meta-llama/Llama-2-7b-chat-hf \ --dataset-path /path/ShareGPT_V3_unfiltered_cleaned_split.json \ --enable-prefix-caching \ --num-prompts 20 \ --repeat-count 5 \ --input-length-range 128:256 ```
⚡ Example - Request Prioritization Benchmark
Benchmark the performance of request prioritization in vLLM. **Basic Prioritization Test** ```bash python3 benchmarks/benchmark_prioritization.py \ --model meta-llama/Llama-2-7b-chat-hf \ --input-len 128 \ --output-len 64 \ --num-prompts 100 \ --scheduling-policy priority ``` **Multiple Sequences per Prompt** ```bash python3 benchmarks/benchmark_prioritization.py \ --model meta-llama/Llama-2-7b-chat-hf \ --input-len 128 \ --output-len 64 \ --num-prompts 100 \ --scheduling-policy priority \ --n 2 ```