Compare commits
247 Commits
v0.10.0rc2
...
main
Author | SHA1 | Date |
---|---|---|
|
9e0726e5bf | |
|
53c21e492e | |
|
0780bb5783 | |
|
58bb902186 | |
|
7349d5268b | |
|
9484641616 | |
|
207b750e19 | |
|
5daffe7cf6 | |
|
2836dd73f1 | |
|
d2aab336ad | |
|
9532a6d563 | |
|
3e36fcbee6 | |
|
055bd3978e | |
|
0f7919fca0 | |
|
61445453df | |
|
ec02e536df | |
|
9cb497bfa3 | |
|
ca9e2be3ed | |
|
601f856d56 | |
|
287f527f54 | |
|
f12d9256b3 | |
|
b9b753e7a7 | |
|
56bd537dde | |
|
8f0d516715 | |
|
f4135232b9 | |
|
4904e53c32 | |
|
004203e953 | |
|
5c765aec65 | |
|
ad510309ee | |
|
366f6b3a4d | |
|
6e599eebe8 | |
|
88edf5994c | |
|
ff08e51940 | |
|
8f4a1c9a04 | |
|
36ede45989 | |
|
0e40b26073 | |
|
0271c2ff2f | |
|
e91d3c9cda | |
|
bf668b5bf5 | |
|
da3e0bd6e5 | |
|
fcfd1eb9c5 | |
|
d979dd6beb | |
|
b876860c62 | |
|
13986365a9 | |
|
5c8fe389d6 | |
|
5bbaf492a6 | |
|
533db0935d | |
|
fc91da5499 | |
|
547795232d | |
|
30ef30ed5a | |
|
02f82fe438 | |
|
2ca5f82c2a | |
|
6f8d261882 | |
|
4cd7fe6cea | |
|
16f3250527 | |
|
e3bc17ceea | |
|
05cbbe20c5 | |
|
65f311ce59 | |
|
1b0a155534 | |
|
44bc46da60 | |
|
b7b23da4d2 | |
|
fdde18229e | |
|
b917da442b | |
|
fb58e3a651 | |
|
76080cff79 | |
|
ba5c5e5404 | |
|
555e7225bc | |
|
0e36abf993 | |
|
452b2a3180 | |
|
0d0cc9e150 | |
|
9266d98048 | |
|
176bbce1db | |
|
a1873db23d | |
|
a33ea28b1b | |
|
7b49cb1c6b | |
|
f03e9cf2bb | |
|
37f86d9048 | |
|
58b11b24a6 | |
|
ad341c5194 | |
|
759b87ef3e | |
|
f693b067a2 | |
|
04e38500ee | |
|
ab714131e4 | |
|
755fa8b657 | |
|
2470419119 | |
|
61a6905ab0 | |
|
37efc63b64 | |
|
a4528f0cac | |
|
a2480251ec | |
|
7234fe2685 | |
|
f1e2c095ec | |
|
12a223ef9b | |
|
e18f085103 | |
|
afa2607596 | |
|
48b763d6b5 | |
|
947e982ede | |
|
c6c9122d50 | |
|
8aa1485fcf | |
|
89ac266b26 | |
|
c6f36cfa26 | |
|
b18b417fbf | |
|
9ba1c88a93 | |
|
e0e58f9729 | |
|
b361f14e39 | |
|
01c753ed98 | |
|
94b71ae106 | |
|
7d44c691b0 | |
|
e17a4d3bf9 | |
|
ec261b0291 | |
|
04fe61aa3d | |
|
25708d317a | |
|
0e18a5d058 | |
|
34a20c49b3 | |
|
31084b3b1f | |
|
bccc43c033 | |
|
1395dd9c28 | |
|
9ace2eaf35 | |
|
656c24f1b5 | |
|
63fe3a700f | |
|
0ae970ed15 | |
|
65e8466c37 | |
|
1b769dccf3 | |
|
2cc571199b | |
|
a4ed731546 | |
|
d128d0d554 | |
|
a6c050286a | |
|
139a7f07bd | |
|
150d9e6337 | |
|
139a97ec56 | |
|
18cc33dd60 | |
|
7656cf4cf3 | |
|
3ea57a56d9 | |
|
75856bc2cb | |
|
304dcdf575 | |
|
88e46c7c8d | |
|
d8937de4c8 | |
|
e626d286f5 | |
|
c7ffe93d9c | |
|
15a72ac478 | |
|
04ff4be310 | |
|
93269bb43e | |
|
82acf2184d | |
|
86ae693f20 | |
|
8f605ee309 | |
|
a9b2a1d704 | |
|
57c22e57f9 | |
|
bda9d0535f | |
|
3d847a3125 | |
|
5f8c9a425e | |
|
1cbf951ba2 | |
|
a8936e5193 | |
|
01a395e9e7 | |
|
971948b846 | |
|
eed2f463b2 | |
|
20950b29fb | |
|
3339cba3ff | |
|
0b8caf9095 | |
|
ccf27cc4d4 | |
|
c657369841 | |
|
6c66f28fa5 | |
|
de509ae8eb | |
|
e7c4f9ee86 | |
|
9094d11c5d | |
|
56e544f24b | |
|
97d6c30cc9 | |
|
a40a8506df | |
|
c215f5c877 | |
|
1cd6eaba54 | |
|
f27fdfc3ed | |
|
de10ff0b7c | |
|
9d197280fa | |
|
e98def439c | |
|
05c1126f29 | |
|
875af38e01 | |
|
7728dd77bb | |
|
2f6e6b33fb | |
|
a55c95096b | |
|
97349fe2bc | |
|
62965de5fe | |
|
7ae75fa6d0 | |
|
f1b286b2fb | |
|
c7742d6113 | |
|
cea96a0156 | |
|
2eddd437ba | |
|
75d29cf4e1 | |
|
41d3082c41 | |
|
7cfea0df39 | |
|
5ac3168ee3 | |
|
396ee94180 | |
|
e189b50f53 | |
|
136d750f5f | |
|
b3caeb82e7 | |
|
eab2f3980c | |
|
9fe98d4250 | |
|
29c6fbe58c | |
|
c72f049cb4 | |
|
f3a683b7c9 | |
|
46d81d6951 | |
|
5c3f2628d5 | |
|
7311f74468 | |
|
8ed01e32f7 | |
|
e38e96a3c0 | |
|
40d86ee412 | |
|
85d051f026 | |
|
5140f54b89 | |
|
947edd099e | |
|
fde60ee775 | |
|
b38bc652ac | |
|
adaf2c6d4f | |
|
42343f1f89 | |
|
965bc71b04 | |
|
807a328bb6 | |
|
e0be2c4d09 | |
|
9c8b2c2a8a | |
|
2212cd6cfb | |
|
ce3a9b1378 | |
|
2ce90e5b01 | |
|
633f6e804b | |
|
b57296bb9a | |
|
34ddcf9ff4 | |
|
fe56180c7f | |
|
07d80d7b0e | |
|
2dd72d23d9 | |
|
a6c7fb8cff | |
|
a7272c23d0 | |
|
6066284914 | |
|
1e9ea8e69d | |
|
d9f9a3fd96 | |
|
1b25f1fe75 | |
|
e8cb0d0495 | |
|
684174115d | |
|
cdb79ee63d | |
|
5a19a6c670 | |
|
2ded067fd2 | |
|
13abd0eaf9 | |
|
61b8cea3b4 | |
|
526078a96c | |
|
6da0078523 | |
|
73e3949d07 | |
|
6eca337ce0 | |
|
85bda9e7d0 | |
|
610852a423 | |
|
f0f4de8f26 | |
|
fc5f756db4 | |
|
e74bfc70e4 | |
|
90eeea8f85 | |
|
dde295a934 |
|
@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
|
||||||
## Trigger the benchmark
|
## Trigger the benchmark
|
||||||
|
|
||||||
Performance benchmark will be triggered when:
|
Performance benchmark will be triggered when:
|
||||||
|
|
||||||
- A PR being merged into vllm.
|
- A PR being merged into vllm.
|
||||||
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
|
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
|
||||||
|
|
||||||
|
@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
Runtime environment variables:
|
Runtime environment variables:
|
||||||
|
|
||||||
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
|
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
|
||||||
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
|
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
|
||||||
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
|
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
|
||||||
|
@ -46,12 +48,14 @@ Runtime environment variables:
|
||||||
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
|
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
|
||||||
|
|
||||||
Nightly benchmark will be triggered when:
|
Nightly benchmark will be triggered when:
|
||||||
|
|
||||||
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
|
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
|
||||||
|
|
||||||
## Performance benchmark details
|
## Performance benchmark details
|
||||||
|
|
||||||
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
|
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
|
||||||
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
|
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
|
||||||
|
>
|
||||||
### Latency test
|
### Latency test
|
||||||
|
|
||||||
Here is an example of one test inside `latency-tests.json`:
|
Here is an example of one test inside `latency-tests.json`:
|
||||||
|
@ -74,7 +78,7 @@ Here is an example of one test inside `latency-tests.json`:
|
||||||
In this example:
|
In this example:
|
||||||
|
|
||||||
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
|
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
|
||||||
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
|
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
|
||||||
|
|
||||||
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
|
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
|
||||||
|
|
||||||
|
@ -82,13 +86,13 @@ WARNING: The benchmarking script will save json results by itself, so please do
|
||||||
|
|
||||||
### Throughput test
|
### Throughput test
|
||||||
|
|
||||||
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
|
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`.
|
||||||
|
|
||||||
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
|
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
|
||||||
|
|
||||||
### Serving test
|
### Serving test
|
||||||
|
|
||||||
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
|
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
[
|
[
|
||||||
|
@ -118,8 +122,8 @@ Inside this example:
|
||||||
|
|
||||||
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
|
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
|
||||||
- The `server-parameters` includes the command line arguments for vLLM server.
|
- The `server-parameters` includes the command line arguments for vLLM server.
|
||||||
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
|
- The `client-parameters` includes the command line arguments for `vllm bench serve`.
|
||||||
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
|
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve`
|
||||||
|
|
||||||
The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.
|
The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.
|
||||||
|
|
||||||
|
@ -149,6 +153,7 @@ Here is an example using the script to compare result_a and result_b without det
|
||||||
|
|
||||||
Here is an example using the script to compare result_a and result_b with detail test name.
|
Here is an example using the script to compare result_a and result_b with detail test name.
|
||||||
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
|
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
|
||||||
|
|
||||||
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|
||||||
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
|
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
|
||||||
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
|
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
# Nightly benchmark annotation
|
||||||
|
|
||||||
## Description
|
## Description
|
||||||
|
|
||||||
|
@ -13,15 +14,15 @@ Please download the visualization scripts in the post
|
||||||
|
|
||||||
- Find the docker we use in `benchmarking pipeline`
|
- Find the docker we use in `benchmarking pipeline`
|
||||||
- Deploy the docker, and inside the docker:
|
- Deploy the docker, and inside the docker:
|
||||||
- Download `nightly-benchmarks.zip`.
|
- Download `nightly-benchmarks.zip`.
|
||||||
- In the same folder, run the following code:
|
- In the same folder, run the following code:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export HF_TOKEN=<your HF token>
|
export HF_TOKEN=<your HF token>
|
||||||
apt update
|
apt update
|
||||||
apt install -y git
|
apt install -y git
|
||||||
unzip nightly-benchmarks.zip
|
unzip nightly-benchmarks.zip
|
||||||
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
|
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
And the results will be inside `./benchmarks/results`.
|
And the results will be inside `./benchmarks/results`.
|
||||||
|
|
|
@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
- Docker images:
|
- Docker images:
|
||||||
- vLLM: `vllm/vllm-openai:v0.6.2`
|
- vLLM: `vllm/vllm-openai:v0.6.2`
|
||||||
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
|
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
|
||||||
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
|
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
|
||||||
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
|
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
|
||||||
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
|
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
|
||||||
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
|
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
|
||||||
- Hardware
|
- Hardware
|
||||||
- 8x Nvidia A100 GPUs
|
- 8x Nvidia A100 GPUs
|
||||||
- Workload:
|
- Workload:
|
||||||
- Dataset
|
- Dataset
|
||||||
- ShareGPT dataset
|
- ShareGPT dataset
|
||||||
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
|
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
|
||||||
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
|
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
|
||||||
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
|
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
|
||||||
- Models: llama-3 8B, llama-3 70B.
|
- Models: llama-3 8B, llama-3 70B.
|
||||||
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
|
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
|
||||||
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
|
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
|
||||||
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
|
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
|
||||||
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
|
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
|
||||||
|
|
||||||
## Known issues
|
## Known issues
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
# Performance benchmarks descriptions
|
||||||
|
|
||||||
## Latency tests
|
## Latency tests
|
||||||
|
|
||||||
|
|
|
@ -44,6 +44,7 @@ serving_column_mapping = {
|
||||||
"test_name": "Test name",
|
"test_name": "Test name",
|
||||||
"gpu_type": "GPU",
|
"gpu_type": "GPU",
|
||||||
"completed": "# of req.",
|
"completed": "# of req.",
|
||||||
|
"max_concurrency": "# of max concurrency.",
|
||||||
"request_throughput": "Tput (req/s)",
|
"request_throughput": "Tput (req/s)",
|
||||||
"total_token_throughput": "Total Token Tput (tok/s)",
|
"total_token_throughput": "Total Token Tput (tok/s)",
|
||||||
"output_throughput": "Output Tput (tok/s)",
|
"output_throughput": "Output Tput (tok/s)",
|
||||||
|
@ -100,7 +101,7 @@ if __name__ == "__main__":
|
||||||
raw_result = json.loads(f.read())
|
raw_result = json.loads(f.read())
|
||||||
|
|
||||||
if "serving" in str(test_file):
|
if "serving" in str(test_file):
|
||||||
# this result is generated via `benchmark_serving.py`
|
# this result is generated via `vllm bench serve` command
|
||||||
|
|
||||||
# attach the benchmarking command to raw_result
|
# attach the benchmarking command to raw_result
|
||||||
try:
|
try:
|
||||||
|
@ -120,7 +121,7 @@ if __name__ == "__main__":
|
||||||
continue
|
continue
|
||||||
|
|
||||||
elif "latency" in f.name:
|
elif "latency" in f.name:
|
||||||
# this result is generated via `benchmark_latency.py`
|
# this result is generated via `vllm bench latency` command
|
||||||
|
|
||||||
# attach the benchmarking command to raw_result
|
# attach the benchmarking command to raw_result
|
||||||
try:
|
try:
|
||||||
|
@ -148,7 +149,7 @@ if __name__ == "__main__":
|
||||||
continue
|
continue
|
||||||
|
|
||||||
elif "throughput" in f.name:
|
elif "throughput" in f.name:
|
||||||
# this result is generated via `benchmark_throughput.py`
|
# this result is generated via `vllm bench throughput` command
|
||||||
|
|
||||||
# attach the benchmarking command to raw_result
|
# attach the benchmarking command to raw_result
|
||||||
try:
|
try:
|
||||||
|
|
|
@ -73,7 +73,7 @@ get_current_llm_serving_engine() {
|
||||||
echo "Container: vllm"
|
echo "Container: vllm"
|
||||||
# move to a completely irrelevant directory, to avoid import vllm from current folder
|
# move to a completely irrelevant directory, to avoid import vllm from current folder
|
||||||
export CURRENT_LLM_SERVING_ENGINE=vllm
|
export CURRENT_LLM_SERVING_ENGINE=vllm
|
||||||
|
|
||||||
return
|
return
|
||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
@ -95,12 +95,14 @@ json2args() {
|
||||||
}
|
}
|
||||||
|
|
||||||
kill_gpu_processes() {
|
kill_gpu_processes() {
|
||||||
pkill -f python
|
pkill -f '[p]ython'
|
||||||
pkill -f python3
|
pkill -f '[p]ython3'
|
||||||
pkill -f tritonserver
|
pkill -f '[t]ritonserver'
|
||||||
pkill -f pt_main_thread
|
pkill -f '[p]t_main_thread'
|
||||||
pkill -f text-generation
|
pkill -f '[t]ext-generation'
|
||||||
pkill -f lmdeploy
|
pkill -f '[l]mdeploy'
|
||||||
|
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
|
||||||
|
pkill -f '[V]LLM'
|
||||||
|
|
||||||
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
|
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
|
||||||
sleep 1
|
sleep 1
|
||||||
|
@ -125,7 +127,7 @@ ensure_installed() {
|
||||||
}
|
}
|
||||||
|
|
||||||
run_serving_tests() {
|
run_serving_tests() {
|
||||||
# run serving tests using `benchmark_serving.py`
|
# run serving tests using `vllm bench serve` command
|
||||||
# $1: a json file specifying serving test cases
|
# $1: a json file specifying serving test cases
|
||||||
|
|
||||||
local serving_test_file
|
local serving_test_file
|
||||||
|
@ -225,7 +227,7 @@ run_serving_tests() {
|
||||||
|
|
||||||
if [[ "$dataset_name" = "sharegpt" ]]; then
|
if [[ "$dataset_name" = "sharegpt" ]]; then
|
||||||
|
|
||||||
client_command="python3 benchmark_serving.py \
|
client_command="vllm bench serve \
|
||||||
--backend $backend \
|
--backend $backend \
|
||||||
--tokenizer /tokenizer_cache \
|
--tokenizer /tokenizer_cache \
|
||||||
--model $model \
|
--model $model \
|
||||||
|
@ -246,7 +248,7 @@ run_serving_tests() {
|
||||||
sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
|
sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
|
||||||
sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')
|
sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')
|
||||||
|
|
||||||
client_command="python3 benchmark_serving.py \
|
client_command="vllm bench serve \
|
||||||
--backend $backend \
|
--backend $backend \
|
||||||
--tokenizer /tokenizer_cache \
|
--tokenizer /tokenizer_cache \
|
||||||
--model $model \
|
--model $model \
|
||||||
|
@ -265,13 +267,13 @@ run_serving_tests() {
|
||||||
$client_args"
|
$client_args"
|
||||||
|
|
||||||
else
|
else
|
||||||
|
|
||||||
echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
|
echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
|
||||||
exit 1
|
exit 1
|
||||||
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
echo "Running test case $test_name with qps $qps"
|
echo "Running test case $test_name with qps $qps"
|
||||||
echo "Client command: $client_command"
|
echo "Client command: $client_command"
|
||||||
|
@ -302,7 +304,7 @@ run_serving_tests() {
|
||||||
}
|
}
|
||||||
|
|
||||||
run_genai_perf_tests() {
|
run_genai_perf_tests() {
|
||||||
# run genai-perf tests
|
# run genai-perf tests
|
||||||
|
|
||||||
# $1: a json file specifying genai-perf test cases
|
# $1: a json file specifying genai-perf test cases
|
||||||
local genai_perf_test_file
|
local genai_perf_test_file
|
||||||
|
@ -311,14 +313,14 @@ run_genai_perf_tests() {
|
||||||
# Iterate over genai-perf tests
|
# Iterate over genai-perf tests
|
||||||
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
|
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
|
||||||
# get the test name, and append the GPU type back to it.
|
# get the test name, and append the GPU type back to it.
|
||||||
test_name=$(echo "$params" | jq -r '.test_name')
|
test_name=$(echo "$params" | jq -r '.test_name')
|
||||||
|
|
||||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
||||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
||||||
echo "Skip test case $test_name."
|
echo "Skip test case $test_name."
|
||||||
continue
|
continue
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# prepend the current serving engine to the test name
|
# prepend the current serving engine to the test name
|
||||||
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
|
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
|
||||||
|
|
||||||
|
@ -369,10 +371,10 @@ run_genai_perf_tests() {
|
||||||
qps=$num_prompts
|
qps=$num_prompts
|
||||||
echo "now qps is $qps"
|
echo "now qps is $qps"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
new_test_name=$test_name"_qps_"$qps
|
new_test_name=$test_name"_qps_"$qps
|
||||||
backend=$CURRENT_LLM_SERVING_ENGINE
|
backend=$CURRENT_LLM_SERVING_ENGINE
|
||||||
|
|
||||||
if [[ "$backend" == *"vllm"* ]]; then
|
if [[ "$backend" == *"vllm"* ]]; then
|
||||||
backend="vllm"
|
backend="vllm"
|
||||||
fi
|
fi
|
||||||
|
@ -413,7 +415,7 @@ prepare_dataset() {
|
||||||
do
|
do
|
||||||
cat sonnet.txt >> sonnet_4x.txt
|
cat sonnet.txt >> sonnet_4x.txt
|
||||||
done
|
done
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
main() {
|
main() {
|
||||||
|
|
|
@ -33,7 +33,7 @@ check_gpus() {
|
||||||
|
|
||||||
check_cpus() {
|
check_cpus() {
|
||||||
# check the number of CPUs and NUMA Node and GPU type.
|
# check the number of CPUs and NUMA Node and GPU type.
|
||||||
declare -g numa_count=$(python3 -c "from numa import info;numa_size = info.get_num_configured_nodes(); print(numa_size)")
|
declare -g numa_count=$(lscpu | grep "NUMA node(s):" | awk '{print $3}')
|
||||||
if [[ $numa_count -gt 0 ]]; then
|
if [[ $numa_count -gt 0 ]]; then
|
||||||
echo "NUMA found."
|
echo "NUMA found."
|
||||||
echo $numa_count
|
echo $numa_count
|
||||||
|
@ -126,7 +126,8 @@ kill_gpu_processes() {
|
||||||
ps -aux
|
ps -aux
|
||||||
lsof -t -i:8000 | xargs -r kill -9
|
lsof -t -i:8000 | xargs -r kill -9
|
||||||
pgrep python3 | xargs -r kill -9
|
pgrep python3 | xargs -r kill -9
|
||||||
|
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
|
||||||
|
pgrep VLLM | xargs -r kill -9
|
||||||
|
|
||||||
# wait until GPU memory usage smaller than 1GB
|
# wait until GPU memory usage smaller than 1GB
|
||||||
if command -v nvidia-smi; then
|
if command -v nvidia-smi; then
|
||||||
|
@ -164,7 +165,7 @@ upload_to_buildkite() {
|
||||||
}
|
}
|
||||||
|
|
||||||
run_latency_tests() {
|
run_latency_tests() {
|
||||||
# run latency tests using `benchmark_latency.py`
|
# run latency tests using `vllm bench latency` command
|
||||||
# $1: a json file specifying latency test cases
|
# $1: a json file specifying latency test cases
|
||||||
|
|
||||||
local latency_test_file
|
local latency_test_file
|
||||||
|
@ -205,7 +206,7 @@ run_latency_tests() {
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
latency_command=" $latency_envs python3 benchmark_latency.py \
|
latency_command=" $latency_envs vllm bench latency \
|
||||||
--output-json $RESULTS_FOLDER/${test_name}.json \
|
--output-json $RESULTS_FOLDER/${test_name}.json \
|
||||||
$latency_args"
|
$latency_args"
|
||||||
|
|
||||||
|
@ -231,7 +232,7 @@ run_latency_tests() {
|
||||||
}
|
}
|
||||||
|
|
||||||
run_throughput_tests() {
|
run_throughput_tests() {
|
||||||
# run throughput tests using `benchmark_throughput.py`
|
# run throughput tests using `vllm bench throughput`
|
||||||
# $1: a json file specifying throughput test cases
|
# $1: a json file specifying throughput test cases
|
||||||
|
|
||||||
local throughput_test_file
|
local throughput_test_file
|
||||||
|
@ -272,7 +273,7 @@ run_throughput_tests() {
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
throughput_command=" $throughput_envs python3 benchmark_throughput.py \
|
throughput_command=" $throughput_envs vllm bench throughput \
|
||||||
--output-json $RESULTS_FOLDER/${test_name}.json \
|
--output-json $RESULTS_FOLDER/${test_name}.json \
|
||||||
$throughput_args"
|
$throughput_args"
|
||||||
|
|
||||||
|
@ -297,7 +298,7 @@ run_throughput_tests() {
|
||||||
}
|
}
|
||||||
|
|
||||||
run_serving_tests() {
|
run_serving_tests() {
|
||||||
# run serving tests using `benchmark_serving.py`
|
# run serving tests using `vllm bench serve` command
|
||||||
# $1: a json file specifying serving test cases
|
# $1: a json file specifying serving test cases
|
||||||
|
|
||||||
local serving_test_file
|
local serving_test_file
|
||||||
|
@ -393,7 +394,7 @@ run_serving_tests() {
|
||||||
|
|
||||||
# pass the tensor parallel size to the client so that it can be displayed
|
# pass the tensor parallel size to the client so that it can be displayed
|
||||||
# on the benchmark dashboard
|
# on the benchmark dashboard
|
||||||
client_command="python3 benchmark_serving.py \
|
client_command="vllm bench serve \
|
||||||
--save-result \
|
--save-result \
|
||||||
--result-dir $RESULTS_FOLDER \
|
--result-dir $RESULTS_FOLDER \
|
||||||
--result-filename ${new_test_name}.json \
|
--result-filename ${new_test_name}.json \
|
||||||
|
@ -447,7 +448,7 @@ main() {
|
||||||
(which jq) || (apt-get update && apt-get -y install jq)
|
(which jq) || (apt-get update && apt-get -y install jq)
|
||||||
(which lsof) || (apt-get update && apt-get install -y lsof)
|
(which lsof) || (apt-get update && apt-get install -y lsof)
|
||||||
|
|
||||||
# get the current IP address, required by benchmark_serving.py
|
# get the current IP address, required by `vllm bench serve` command
|
||||||
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
|
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
|
||||||
# turn of the reporting of the status of each request, to clean up the terminal output
|
# turn of the reporting of the status of each request, to clean up the terminal output
|
||||||
export VLLM_LOGGING_LEVEL="WARNING"
|
export VLLM_LOGGING_LEVEL="WARNING"
|
||||||
|
|
|
@ -0,0 +1,209 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp1_sharegpt",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 1,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"max_concurrency": 60,
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp2_sharegpt",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"max_concurrency": 60,
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp4_sharegpt",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 4,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"max_concurrency": 60,
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp1_random_128_128",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 1,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"max_concurrency": 1000,
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp2_random_128_128",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"max_concurrency": 1000,
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp4_random_128_128",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 4,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"max_concurrency": 1000,
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
|
@ -0,0 +1,211 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_pp1_sharegpt",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"pipeline_parallel_size": 1,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"max_concurrency": 60,
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_pp3_sharegpt",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"pipeline_parallel_size": 3,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"max_concurrency": 60,
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp2pp6_sharegpt",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"pipeline_parallel_size": 3,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"max_concurrency": 60,
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_pp1_random_128_128",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"pipeline_parallel_size": 1,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"max_concurrency": 1000,
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_pp3_random_128_128",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL:": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"pipeline_parallel_size": 3,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"max_concurrency": 1000,
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp2pp3_random_128_128",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"pipeline_parallel_size": 3,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"max_concurrency": 1000,
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
|
@ -6,6 +6,7 @@
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
|
@ -18,6 +19,8 @@
|
||||||
"disable_log_stats": "",
|
"disable_log_stats": "",
|
||||||
"disable_log_requests": "",
|
"disable_log_requests": "",
|
||||||
"enforce_eager": "",
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
|
@ -36,6 +39,7 @@
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
|
@ -48,6 +52,8 @@
|
||||||
"disable_log_stats": "",
|
"disable_log_stats": "",
|
||||||
"disable_log_requests": "",
|
"disable_log_requests": "",
|
||||||
"enforce_eager": "",
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
|
@ -66,6 +72,7 @@
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
|
@ -78,6 +85,8 @@
|
||||||
"disable_log_stats": "",
|
"disable_log_stats": "",
|
||||||
"disable_log_requests": "",
|
"disable_log_requests": "",
|
||||||
"enforce_eager": "",
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
|
@ -96,6 +105,7 @@
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
|
@ -109,6 +119,8 @@
|
||||||
"disable_log_stats": "",
|
"disable_log_stats": "",
|
||||||
"disable_log_requests": "",
|
"disable_log_requests": "",
|
||||||
"enforce_eager": "",
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
|
@ -129,6 +141,7 @@
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
|
@ -142,6 +155,8 @@
|
||||||
"disable_log_stats": "",
|
"disable_log_stats": "",
|
||||||
"disable_log_requests": "",
|
"disable_log_requests": "",
|
||||||
"enforce_eager": "",
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
|
|
|
@ -13,9 +13,9 @@ NUMA_NODE=${NUMA_NODE:-1}
|
||||||
export CMAKE_BUILD_PARALLEL_LEVEL=32
|
export CMAKE_BUILD_PARALLEL_LEVEL=32
|
||||||
|
|
||||||
# Setup cleanup
|
# Setup cleanup
|
||||||
remove_docker_container() {
|
remove_docker_container() {
|
||||||
set -e;
|
set -e;
|
||||||
docker rm -f cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"-avx2 || true;
|
docker rm -f cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"-avx2 || true;
|
||||||
}
|
}
|
||||||
trap remove_docker_container EXIT
|
trap remove_docker_container EXIT
|
||||||
remove_docker_container
|
remove_docker_container
|
||||||
|
@ -69,7 +69,7 @@ function cpu_tests() {
|
||||||
docker exec cpu-test-"$NUMA_NODE" bash -c "
|
docker exec cpu-test-"$NUMA_NODE" bash -c "
|
||||||
set -e
|
set -e
|
||||||
pytest -s -v \
|
pytest -s -v \
|
||||||
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]"
|
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]"
|
||||||
|
|
||||||
# Note: disable it until supports V1
|
# Note: disable it until supports V1
|
||||||
# Run AWQ test
|
# Run AWQ test
|
||||||
|
@ -78,23 +78,23 @@ function cpu_tests() {
|
||||||
# VLLM_USE_V1=0 pytest -s -v \
|
# VLLM_USE_V1=0 pytest -s -v \
|
||||||
# tests/quantization/test_ipex_quant.py"
|
# tests/quantization/test_ipex_quant.py"
|
||||||
|
|
||||||
# online serving
|
|
||||||
docker exec cpu-test-"$NUMA_NODE" bash -c '
|
|
||||||
set -e
|
|
||||||
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
|
|
||||||
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
|
|
||||||
python3 benchmarks/benchmark_serving.py \
|
|
||||||
--backend vllm \
|
|
||||||
--dataset-name random \
|
|
||||||
--model meta-llama/Llama-3.2-3B-Instruct \
|
|
||||||
--num-prompts 20 \
|
|
||||||
--endpoint /v1/completions'
|
|
||||||
|
|
||||||
# Run multi-lora tests
|
# Run multi-lora tests
|
||||||
docker exec cpu-test-"$NUMA_NODE" bash -c "
|
docker exec cpu-test-"$NUMA_NODE" bash -c "
|
||||||
set -e
|
set -e
|
||||||
pytest -s -v \
|
pytest -s -v \
|
||||||
tests/lora/test_qwen2vl.py"
|
tests/lora/test_qwen2vl.py"
|
||||||
|
|
||||||
|
# online serving
|
||||||
|
docker exec cpu-test-"$NUMA_NODE" bash -c '
|
||||||
|
set -e
|
||||||
|
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
|
||||||
|
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
|
||||||
|
vllm bench serve \
|
||||||
|
--backend vllm \
|
||||||
|
--dataset-name random \
|
||||||
|
--model meta-llama/Llama-3.2-3B-Instruct \
|
||||||
|
--num-prompts 20 \
|
||||||
|
--endpoint /v1/completions'
|
||||||
}
|
}
|
||||||
|
|
||||||
# All of CPU tests are expected to be finished less than 40 mins.
|
# All of CPU tests are expected to be finished less than 40 mins.
|
||||||
|
|
|
@ -16,8 +16,7 @@ DOCKER_BUILDKIT=1 docker build . \
|
||||||
--build-arg max_jobs=66 \
|
--build-arg max_jobs=66 \
|
||||||
--build-arg nvcc_threads=2 \
|
--build-arg nvcc_threads=2 \
|
||||||
--build-arg RUN_WHEEL_CHECK=false \
|
--build-arg RUN_WHEEL_CHECK=false \
|
||||||
--build-arg torch_cuda_arch_list="9.0+PTX" \
|
--build-arg torch_cuda_arch_list="9.0+PTX"
|
||||||
--build-arg vllm_fa_cmake_gpu_arches="90-real"
|
|
||||||
|
|
||||||
# Setup cleanup
|
# Setup cleanup
|
||||||
remove_docker_container() { docker rm -f gh200-test || true; }
|
remove_docker_container() { docker rm -f gh200-test || true; }
|
||||||
|
|
|
@ -0,0 +1,166 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -xu
|
||||||
|
|
||||||
|
|
||||||
|
remove_docker_container() {
|
||||||
|
docker rm -f tpu-test || true;
|
||||||
|
docker rm -f vllm-tpu || true;
|
||||||
|
}
|
||||||
|
|
||||||
|
trap remove_docker_container EXIT
|
||||||
|
|
||||||
|
# Remove the container that might not be cleaned up in the previous run.
|
||||||
|
remove_docker_container
|
||||||
|
|
||||||
|
# Build the docker image.
|
||||||
|
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
|
||||||
|
|
||||||
|
# Set up cleanup.
|
||||||
|
cleanup_docker() {
|
||||||
|
# Get Docker's root directory
|
||||||
|
docker_root=$(docker info -f '{{.DockerRootDir}}')
|
||||||
|
if [ -z "$docker_root" ]; then
|
||||||
|
echo "Failed to determine Docker root directory."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "Docker root directory: $docker_root"
|
||||||
|
# Check disk usage of the filesystem where Docker's root directory is located
|
||||||
|
disk_usage=$(df "$docker_root" | tail -1 | awk '{print $5}' | sed 's/%//')
|
||||||
|
# Define the threshold
|
||||||
|
threshold=70
|
||||||
|
if [ "$disk_usage" -gt "$threshold" ]; then
|
||||||
|
echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..."
|
||||||
|
# Remove dangling images (those that are not tagged and not used by any container)
|
||||||
|
docker image prune -f
|
||||||
|
# Remove unused volumes / force the system prune for old images as well.
|
||||||
|
docker volume prune -f && docker system prune --force --filter "until=72h" --all
|
||||||
|
echo "Docker images and volumes cleanup completed."
|
||||||
|
else
|
||||||
|
echo "Disk usage is below $threshold%. No cleanup needed."
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
cleanup_docker
|
||||||
|
|
||||||
|
# For HF_TOKEN.
|
||||||
|
source /etc/environment
|
||||||
|
|
||||||
|
docker run --privileged --net host --shm-size=16G -it \
|
||||||
|
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
|
||||||
|
vllm-tpu /bin/bash -c '
|
||||||
|
set -e # Exit immediately if a command exits with a non-zero status.
|
||||||
|
set -u # Treat unset variables as an error.
|
||||||
|
|
||||||
|
echo "--- Starting script inside Docker container ---"
|
||||||
|
|
||||||
|
# Create results directory
|
||||||
|
RESULTS_DIR=$(mktemp -d)
|
||||||
|
# If mktemp fails, set -e will cause the script to exit.
|
||||||
|
echo "Results will be stored in: $RESULTS_DIR"
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
echo "--- Installing Python dependencies ---"
|
||||||
|
python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
|
||||||
|
&& python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
|
||||||
|
&& python3 -m pip install --progress-bar off lm_eval[api]==0.4.4 \
|
||||||
|
&& python3 -m pip install --progress-bar off hf-transfer
|
||||||
|
echo "--- Python dependencies installed ---"
|
||||||
|
export VLLM_USE_V1=1
|
||||||
|
export VLLM_XLA_CHECK_RECOMPILATION=1
|
||||||
|
export VLLM_XLA_CACHE_PATH=
|
||||||
|
echo "Using VLLM V1"
|
||||||
|
|
||||||
|
echo "--- Hardware Information ---"
|
||||||
|
# tpu-info
|
||||||
|
echo "--- Starting Tests ---"
|
||||||
|
set +e
|
||||||
|
overall_script_exit_code=0
|
||||||
|
|
||||||
|
# --- Test Definitions ---
|
||||||
|
# If a test fails, this function will print logs and will not cause the main script to exit.
|
||||||
|
run_test() {
|
||||||
|
local test_num=$1
|
||||||
|
local test_name=$2
|
||||||
|
local test_command=$3
|
||||||
|
local log_file="$RESULTS_DIR/test_${test_num}.log"
|
||||||
|
local actual_exit_code
|
||||||
|
|
||||||
|
echo "--- TEST_$test_num: Running $test_name ---"
|
||||||
|
|
||||||
|
# Execute the test command.
|
||||||
|
eval "$test_command" > >(tee -a "$log_file") 2> >(tee -a "$log_file" >&2)
|
||||||
|
actual_exit_code=$?
|
||||||
|
|
||||||
|
echo "TEST_${test_num}_COMMAND_EXIT_CODE: $actual_exit_code" # This goes to main log
|
||||||
|
echo "TEST_${test_num}_COMMAND_EXIT_CODE: $actual_exit_code" >> "$log_file" # Also to per-test log
|
||||||
|
|
||||||
|
if [ "$actual_exit_code" -ne 0 ]; then
|
||||||
|
echo "TEST_$test_num ($test_name) FAILED with exit code $actual_exit_code." >&2
|
||||||
|
echo "--- Log for failed TEST_$test_num ($test_name) ---" >&2
|
||||||
|
if [ -f "$log_file" ]; then
|
||||||
|
cat "$log_file" >&2
|
||||||
|
else
|
||||||
|
echo "Log file $log_file not found for TEST_$test_num ($test_name)." >&2
|
||||||
|
fi
|
||||||
|
echo "--- End of log for TEST_$test_num ($test_name) ---" >&2
|
||||||
|
return "$actual_exit_code" # Return the failure code
|
||||||
|
else
|
||||||
|
echo "TEST_$test_num ($test_name) PASSED."
|
||||||
|
return 0 # Return success
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Helper function to call run_test and update the overall script exit code
|
||||||
|
run_and_track_test() {
|
||||||
|
local test_num_arg="$1"
|
||||||
|
local test_name_arg="$2"
|
||||||
|
local test_command_arg="$3"
|
||||||
|
|
||||||
|
# Run the test
|
||||||
|
run_test "$test_num_arg" "$test_name_arg" "$test_command_arg"
|
||||||
|
local test_specific_exit_code=$?
|
||||||
|
|
||||||
|
# If the test failed, set the overall script exit code to 1
|
||||||
|
if [ "$test_specific_exit_code" -ne 0 ]; then
|
||||||
|
# No need for extra echo here, run_test already logged the failure.
|
||||||
|
overall_script_exit_code=1
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# --- Actual Test Execution ---
|
||||||
|
run_and_track_test 1 "test_struct_output_generate.py" \
|
||||||
|
"HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\""
|
||||||
|
run_and_track_test 2 "test_moe_pallas.py" \
|
||||||
|
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py"
|
||||||
|
run_and_track_test 3 "test_lora.py" \
|
||||||
|
"VLLM_XLA_CHECK_RECOMPILATION=0 python3 -m pytest -s -v /workspace/vllm/tests/tpu/lora/test_lora.py"
|
||||||
|
run_and_track_test 4 "test_tpu_qkv_linear.py" \
|
||||||
|
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py"
|
||||||
|
run_and_track_test 5 "test_spmd_model_weight_loading.py" \
|
||||||
|
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py"
|
||||||
|
run_and_track_test 6 "test_kv_cache_update_kernel.py" \
|
||||||
|
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py"
|
||||||
|
|
||||||
|
# After all tests have been attempted, exit with the overall status.
|
||||||
|
if [ "$overall_script_exit_code" -ne 0 ]; then
|
||||||
|
echo "--- One or more tests FAILED. Overall script exiting with failure code 1. ---"
|
||||||
|
else
|
||||||
|
echo "--- All tests have completed and PASSED. Overall script exiting with success code 0. ---"
|
||||||
|
fi
|
||||||
|
exit "$overall_script_exit_code"
|
||||||
|
' # IMPORTANT: This is the closing single quote for the bash -c "..." command. Ensure it is present and correct.
|
||||||
|
|
||||||
|
# Capture the exit code of the docker run command
|
||||||
|
DOCKER_RUN_EXIT_CODE=$?
|
||||||
|
|
||||||
|
# The trap will run for cleanup.
|
||||||
|
# Exit the main script with the Docker run command's exit code.
|
||||||
|
if [ "$DOCKER_RUN_EXIT_CODE" -ne 0 ]; then
|
||||||
|
echo "Docker run command failed with exit code $DOCKER_RUN_EXIT_CODE."
|
||||||
|
exit "$DOCKER_RUN_EXIT_CODE"
|
||||||
|
else
|
||||||
|
echo "Docker run command completed successfully."
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
# TODO: This test fails because it uses RANDOM_SEED sampling
|
||||||
|
# pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \
|
|
@ -135,7 +135,7 @@ run_and_track_test 1 "test_compilation.py" \
|
||||||
run_and_track_test 2 "test_basic.py" \
|
run_and_track_test 2 "test_basic.py" \
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_basic.py"
|
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_basic.py"
|
||||||
run_and_track_test 3 "test_accuracy.py::test_lm_eval_accuracy_v1_engine" \
|
run_and_track_test 3 "test_accuracy.py::test_lm_eval_accuracy_v1_engine" \
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine"
|
"HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine"
|
||||||
run_and_track_test 4 "test_quantization_accuracy.py" \
|
run_and_track_test 4 "test_quantization_accuracy.py" \
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py"
|
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py"
|
||||||
run_and_track_test 5 "examples/offline_inference/tpu.py" \
|
run_and_track_test 5 "examples/offline_inference/tpu.py" \
|
||||||
|
@ -150,18 +150,6 @@ run_and_track_test 9 "test_multimodal.py" \
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py"
|
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py"
|
||||||
run_and_track_test 10 "test_pallas.py" \
|
run_and_track_test 10 "test_pallas.py" \
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py"
|
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py"
|
||||||
run_and_track_test 11 "test_struct_output_generate.py" \
|
|
||||||
"HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\""
|
|
||||||
run_and_track_test 12 "test_moe_pallas.py" \
|
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py"
|
|
||||||
run_and_track_test 13 "test_lora.py" \
|
|
||||||
"VLLM_XLA_CHECK_RECOMPILATION=0 python3 -m pytest -s -v /workspace/vllm/tests/tpu/lora/test_lora.py"
|
|
||||||
run_and_track_test 14 "test_tpu_qkv_linear.py" \
|
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py"
|
|
||||||
run_and_track_test 15 "test_spmd_model_weight_loading.py" \
|
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py"
|
|
||||||
run_and_track_test 16 "test_kv_cache_update_kernel.py" \
|
|
||||||
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py"
|
|
||||||
|
|
||||||
# After all tests have been attempted, exit with the overall status.
|
# After all tests have been attempted, exit with the overall status.
|
||||||
if [ "$overall_script_exit_code" -ne 0 ]; then
|
if [ "$overall_script_exit_code" -ne 0 ]; then
|
||||||
|
|
|
@ -11,10 +11,10 @@ cd "$(dirname "${BASH_SOURCE[0]}")/../.."
|
||||||
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
|
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
|
||||||
|
|
||||||
# run python-based benchmarks and upload the result to buildkite
|
# run python-based benchmarks and upload the result to buildkite
|
||||||
python3 benchmarks/benchmark_latency.py --output-json latency_results.json 2>&1 | tee benchmark_latency.txt
|
vllm bench latency --output-json latency_results.json 2>&1 | tee benchmark_latency.txt
|
||||||
bench_latency_exit_code=$?
|
bench_latency_exit_code=$?
|
||||||
|
|
||||||
python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --output-json throughput_results.json 2>&1 | tee benchmark_throughput.txt
|
vllm bench throughput --input-len 256 --output-len 256 --output-json throughput_results.json 2>&1 | tee benchmark_throughput.txt
|
||||||
bench_throughput_exit_code=$?
|
bench_throughput_exit_code=$?
|
||||||
|
|
||||||
# run server-based benchmarks and upload the result to buildkite
|
# run server-based benchmarks and upload the result to buildkite
|
||||||
|
@ -24,7 +24,7 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/r
|
||||||
|
|
||||||
# wait for server to start, timeout after 600 seconds
|
# wait for server to start, timeout after 600 seconds
|
||||||
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
|
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
|
||||||
python3 benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--dataset-name sharegpt \
|
--dataset-name sharegpt \
|
||||||
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
|
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
|
||||||
|
|
|
@ -77,7 +77,7 @@ done
|
||||||
echo "run benchmark test..."
|
echo "run benchmark test..."
|
||||||
echo "logging to $BM_LOG"
|
echo "logging to $BM_LOG"
|
||||||
echo
|
echo
|
||||||
python benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model $MODEL \
|
--model $MODEL \
|
||||||
--dataset-name sonnet \
|
--dataset-name sonnet \
|
||||||
|
|
|
@ -82,7 +82,7 @@ steps:
|
||||||
- bash standalone_tests/python_only_compile.sh
|
- bash standalone_tests/python_only_compile.sh
|
||||||
|
|
||||||
- label: Basic Correctness Test # 30min
|
- label: Basic Correctness Test # 30min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
fast_check: true
|
fast_check: true
|
||||||
torch_nightly: true
|
torch_nightly: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
|
@ -99,7 +99,7 @@ steps:
|
||||||
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
|
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
|
||||||
|
|
||||||
- label: Chunked Prefill Test
|
- label: Chunked Prefill Test
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/basic_correctness/test_chunked_prefill
|
- tests/basic_correctness/test_chunked_prefill
|
||||||
|
@ -108,7 +108,7 @@ steps:
|
||||||
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_chunked_prefill.py
|
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_chunked_prefill.py
|
||||||
|
|
||||||
- label: Core Test # 10min
|
- label: Core Test # 10min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
fast_check: true
|
fast_check: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/core
|
- vllm/core
|
||||||
|
@ -128,11 +128,10 @@ steps:
|
||||||
- tests/entrypoints/offline_mode
|
- tests/entrypoints/offline_mode
|
||||||
commands:
|
commands:
|
||||||
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
|
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
|
||||||
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
|
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_collective_rpc.py
|
||||||
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
|
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
|
||||||
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
|
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
|
||||||
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
|
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
|
||||||
- VLLM_USE_V1=0 pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
|
|
||||||
- VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests
|
- VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests
|
||||||
|
|
||||||
- label: Entrypoints Test (API Server) # 40min
|
- label: Entrypoints Test (API Server) # 40min
|
||||||
|
@ -210,7 +209,7 @@ steps:
|
||||||
- pytest -v -s distributed/test_eplb_execute.py
|
- pytest -v -s distributed/test_eplb_execute.py
|
||||||
|
|
||||||
- label: Metrics, Tracing Test # 10min
|
- label: Metrics, Tracing Test # 10min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
num_gpus: 2
|
num_gpus: 2
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
|
@ -229,7 +228,7 @@ steps:
|
||||||
##### 1 GPU test #####
|
##### 1 GPU test #####
|
||||||
|
|
||||||
- label: Regression Test # 5min
|
- label: Regression Test # 5min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/test_regression
|
- tests/test_regression
|
||||||
|
@ -281,7 +280,7 @@ steps:
|
||||||
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
|
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
|
||||||
|
|
||||||
- label: Examples Test # 25min
|
- label: Examples Test # 25min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
working_dir: "/vllm-workspace/examples"
|
working_dir: "/vllm-workspace/examples"
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/entrypoints
|
- vllm/entrypoints
|
||||||
|
@ -306,7 +305,7 @@ steps:
|
||||||
- VLLM_USE_V1=0 python3 offline_inference/profiling.py --model facebook/opt-125m run_num_steps --num-steps 2
|
- VLLM_USE_V1=0 python3 offline_inference/profiling.py --model facebook/opt-125m run_num_steps --num-steps 2
|
||||||
|
|
||||||
- label: Prefix Caching Test # 9min
|
- label: Prefix Caching Test # 9min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/prefix_caching
|
- tests/prefix_caching
|
||||||
|
@ -315,7 +314,7 @@ steps:
|
||||||
|
|
||||||
|
|
||||||
- label: Platform Tests (CUDA)
|
- label: Platform Tests (CUDA)
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/cuda
|
- tests/cuda
|
||||||
|
@ -356,7 +355,7 @@ steps:
|
||||||
- pytest -v -s compile/test_async_tp.py
|
- pytest -v -s compile/test_async_tp.py
|
||||||
|
|
||||||
- label: PyTorch Fullgraph Smoke Test # 9min
|
- label: PyTorch Fullgraph Smoke Test # 9min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
torch_nightly: true
|
torch_nightly: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
|
@ -369,7 +368,7 @@ steps:
|
||||||
- pytest -v -s compile/piecewise/test_full_cudagraph.py
|
- pytest -v -s compile/piecewise/test_full_cudagraph.py
|
||||||
|
|
||||||
- label: PyTorch Fullgraph Test # 18min
|
- label: PyTorch Fullgraph Test # 18min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
torch_nightly: true
|
torch_nightly: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
|
@ -378,7 +377,7 @@ steps:
|
||||||
- pytest -v -s compile/test_full_graph.py
|
- pytest -v -s compile/test_full_graph.py
|
||||||
|
|
||||||
- label: Kernels Core Operation Test
|
- label: Kernels Core Operation Test
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/
|
- csrc/
|
||||||
- tests/kernels/core
|
- tests/kernels/core
|
||||||
|
@ -403,20 +402,21 @@ steps:
|
||||||
- vllm/model_executor/layers/quantization
|
- vllm/model_executor/layers/quantization
|
||||||
- tests/kernels/quantization
|
- tests/kernels/quantization
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s kernels/quantization --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
- pytest -v -s kernels/quantization --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
||||||
parallelism: 2
|
parallelism: 2
|
||||||
|
|
||||||
- label: Kernels MoE Test
|
- label: Kernels MoE Test %N
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/moe/
|
- csrc/moe/
|
||||||
- tests/kernels/moe
|
- tests/kernels/moe
|
||||||
- vllm/model_executor/layers/fused_moe/
|
- vllm/model_executor/layers/fused_moe/
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s kernels/moe
|
- pytest -v -s kernels/moe --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
||||||
|
parallelism: 2
|
||||||
|
|
||||||
- label: Kernels Mamba Test
|
- label: Kernels Mamba Test
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/mamba/
|
- csrc/mamba/
|
||||||
- tests/kernels/mamba
|
- tests/kernels/mamba
|
||||||
|
@ -424,7 +424,7 @@ steps:
|
||||||
- pytest -v -s kernels/mamba
|
- pytest -v -s kernels/mamba
|
||||||
|
|
||||||
- label: Tensorizer Test # 11min
|
- label: Tensorizer Test # 11min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
soft_fail: true
|
soft_fail: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/model_executor/model_loader
|
- vllm/model_executor/model_loader
|
||||||
|
@ -437,7 +437,7 @@ steps:
|
||||||
- pytest -v -s entrypoints/openai/test_tensorizer_entrypoint.py
|
- pytest -v -s entrypoints/openai/test_tensorizer_entrypoint.py
|
||||||
|
|
||||||
- label: Model Executor Test
|
- label: Model Executor Test
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/model_executor
|
- vllm/model_executor
|
||||||
- tests/model_executor
|
- tests/model_executor
|
||||||
|
@ -447,7 +447,7 @@ steps:
|
||||||
- pytest -v -s model_executor
|
- pytest -v -s model_executor
|
||||||
|
|
||||||
- label: Benchmarks # 9min
|
- label: Benchmarks # 9min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
working_dir: "/vllm-workspace/.buildkite"
|
working_dir: "/vllm-workspace/.buildkite"
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- benchmarks/
|
- benchmarks/
|
||||||
|
@ -455,7 +455,7 @@ steps:
|
||||||
- bash scripts/run-benchmarks.sh
|
- bash scripts/run-benchmarks.sh
|
||||||
|
|
||||||
- label: Benchmarks CLI Test # 10min
|
- label: Benchmarks CLI Test # 10min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/benchmarks/
|
- tests/benchmarks/
|
||||||
|
@ -494,7 +494,7 @@ steps:
|
||||||
- pytest -s entrypoints/openai/correctness/
|
- pytest -s entrypoints/openai/correctness/
|
||||||
|
|
||||||
- label: Encoder Decoder tests # 5min
|
- label: Encoder Decoder tests # 5min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/encoder_decoder
|
- tests/encoder_decoder
|
||||||
|
@ -502,7 +502,7 @@ steps:
|
||||||
- pytest -v -s encoder_decoder
|
- pytest -v -s encoder_decoder
|
||||||
|
|
||||||
- label: OpenAI-Compatible Tool Use # 20 min
|
- label: OpenAI-Compatible Tool Use # 20 min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
fast_check: false
|
fast_check: false
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
|
@ -623,7 +623,7 @@ steps:
|
||||||
|
|
||||||
# This test is used only in PR development phase to test individual models and should never run on main
|
# This test is used only in PR development phase to test individual models and should never run on main
|
||||||
- label: Custom Models Test
|
- label: Custom Models Test
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
optional: true
|
optional: true
|
||||||
commands:
|
commands:
|
||||||
- echo 'Testing custom models...'
|
- echo 'Testing custom models...'
|
||||||
|
@ -643,11 +643,22 @@ steps:
|
||||||
- python3 examples/offline_inference/audio_language.py --model-type whisper
|
- python3 examples/offline_inference/audio_language.py --model-type whisper
|
||||||
- python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl
|
- python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl
|
||||||
|
|
||||||
|
- label: Blackwell Test
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
gpu: b200
|
||||||
|
optional: true
|
||||||
|
source_file_dependencies:
|
||||||
|
- csrc/
|
||||||
|
- vllm/
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
- python3 examples/offline_inference/basic/chat.py
|
||||||
|
|
||||||
##### 1 GPU test #####
|
##### 1 GPU test #####
|
||||||
##### multi gpus test #####
|
##### multi gpus test #####
|
||||||
|
|
||||||
- label: Distributed Comm Ops Test # 7min
|
- label: Distributed Comm Ops Test # 7min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
working_dir: "/vllm-workspace/tests"
|
working_dir: "/vllm-workspace/tests"
|
||||||
num_gpus: 2
|
num_gpus: 2
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
|
@ -744,7 +755,7 @@ steps:
|
||||||
- pytest -v -s plugins/lora_resolvers # unit tests for in-tree lora resolver plugins
|
- pytest -v -s plugins/lora_resolvers # unit tests for in-tree lora resolver plugins
|
||||||
|
|
||||||
- label: Multi-step Tests (4 GPUs) # 36min
|
- label: Multi-step Tests (4 GPUs) # 36min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
working_dir: "/vllm-workspace/tests"
|
working_dir: "/vllm-workspace/tests"
|
||||||
num_gpus: 4
|
num_gpus: 4
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
|
@ -765,7 +776,7 @@ steps:
|
||||||
- pytest -v -s multi_step/test_correctness_llm.py
|
- pytest -v -s multi_step/test_correctness_llm.py
|
||||||
|
|
||||||
- label: Pipeline Parallelism Test # 45min
|
- label: Pipeline Parallelism Test # 45min
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
working_dir: "/vllm-workspace/tests"
|
working_dir: "/vllm-workspace/tests"
|
||||||
num_gpus: 4
|
num_gpus: 4
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
|
@ -779,7 +790,7 @@ steps:
|
||||||
- pytest -v -s distributed/test_pipeline_parallel.py
|
- pytest -v -s distributed/test_pipeline_parallel.py
|
||||||
|
|
||||||
- label: LoRA TP Test (Distributed)
|
- label: LoRA TP Test (Distributed)
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental]
|
||||||
num_gpus: 4
|
num_gpus: 4
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/lora
|
- vllm/lora
|
||||||
|
|
|
@ -10,7 +10,6 @@
|
||||||
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
|
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
|
||||||
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
|
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
|
||||||
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth
|
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth
|
||||||
/vllm/model_executor/guided_decoding @mgoin @russellb @aarnphm
|
|
||||||
/vllm/multimodal @DarkLight1337 @ywang96
|
/vllm/multimodal @DarkLight1337 @ywang96
|
||||||
/vllm/vllm_flash_attn @LucasWilkinson
|
/vllm/vllm_flash_attn @LucasWilkinson
|
||||||
/vllm/lora @jeejeelee
|
/vllm/lora @jeejeelee
|
||||||
|
@ -35,9 +34,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
||||||
/tests/distributed/test_pipeline_parallel.py @youkaichao
|
/tests/distributed/test_pipeline_parallel.py @youkaichao
|
||||||
/tests/distributed/test_same_node.py @youkaichao
|
/tests/distributed/test_same_node.py @youkaichao
|
||||||
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo @aarnphm
|
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo @aarnphm
|
||||||
/tests/entrypoints/llm/test_guided_generate.py @mgoin @russellb @aarnphm
|
|
||||||
/tests/kernels @tlrmchlsmth @WoosukKwon
|
/tests/kernels @tlrmchlsmth @WoosukKwon
|
||||||
/tests/model_executor/test_guided_processors.py @mgoin @russellb
|
|
||||||
/tests/models @DarkLight1337 @ywang96
|
/tests/models @DarkLight1337 @ywang96
|
||||||
/tests/multi_step @alexm-redhat @comaniac
|
/tests/multi_step @alexm-redhat @comaniac
|
||||||
/tests/multimodal @DarkLight1337 @ywang96
|
/tests/multimodal @DarkLight1337 @ywang96
|
||||||
|
@ -52,3 +49,27 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
||||||
# Docs
|
# Docs
|
||||||
/docs @hmellor
|
/docs @hmellor
|
||||||
mkdocs.yaml @hmellor
|
mkdocs.yaml @hmellor
|
||||||
|
|
||||||
|
# CPU
|
||||||
|
/vllm/v1/worker/^cpu @bigPYJ1151
|
||||||
|
/csrc/cpu @bigPYJ1151
|
||||||
|
/vllm/platforms/cpu.py @bigPYJ1151
|
||||||
|
/cmake/cpu_extension.cmake @bigPYJ1151
|
||||||
|
/docker/Dockerfile.cpu @bigPYJ1151
|
||||||
|
|
||||||
|
# Intel GPU
|
||||||
|
/vllm/v1/worker/^xpu @jikunshang
|
||||||
|
/vllm/platforms/xpu.py @jikunshang
|
||||||
|
/docker/Dockerfile.xpu @jikunshang
|
||||||
|
|
||||||
|
# Qwen-specific files
|
||||||
|
/vllm/attention/backends/dual_chunk_flash_attn.py @sighingnow
|
||||||
|
/vllm/model_executor/models/qwen* @sighingnow
|
||||||
|
|
||||||
|
# Mistral-specific files
|
||||||
|
/vllm/model_executor/models/mistral*.py @patrickvonplaten
|
||||||
|
/vllm/model_executor/models/mixtral*.py @patrickvonplaten
|
||||||
|
/vllm/model_executor/models/voxtral*.py @patrickvonplaten
|
||||||
|
/vllm/model_executor/models/pixtral*.py @patrickvonplaten
|
||||||
|
/vllm/transformers_utils/configs/mistral.py @patrickvonplaten
|
||||||
|
/vllm/transformers_utils/tokenizers/mistral.py @patrickvonplaten
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
## Essential Elements of an Effective PR Description Checklist
|
# Essential Elements of an Effective PR Description Checklist
|
||||||
|
|
||||||
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
|
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
|
||||||
- [ ] The test plan, such as providing test command.
|
- [ ] The test plan, such as providing test command.
|
||||||
- [ ] The test results, such as pasting the results comparison before and after, or e2e results
|
- [ ] The test results, such as pasting the results comparison before and after, or e2e results
|
||||||
|
@ -14,5 +15,4 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE B
|
||||||
|
|
||||||
## (Optional) Documentation Update
|
## (Optional) Documentation Update
|
||||||
|
|
||||||
<!--- pyml disable-next-line no-emphasis-as-heading -->
|
|
||||||
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing>** (anything written below this line will be removed by GitHub Actions)
|
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing>** (anything written below this line will be removed by GitHub Actions)
|
||||||
|
|
|
@ -149,9 +149,6 @@ pull_request_rules:
|
||||||
- files=examples/offline_inference/structured_outputs.py
|
- files=examples/offline_inference/structured_outputs.py
|
||||||
- files=examples/online_serving/openai_chat_completion_structured_outputs.py
|
- files=examples/online_serving/openai_chat_completion_structured_outputs.py
|
||||||
- files=examples/online_serving/openai_chat_completion_structured_outputs_with_reasoning.py
|
- files=examples/online_serving/openai_chat_completion_structured_outputs_with_reasoning.py
|
||||||
- files~=^vllm/model_executor/guided_decoding/
|
|
||||||
- files=tests/model_executor/test_guided_processors.py
|
|
||||||
- files=tests/entrypoints/llm/test_guided_generate.py
|
|
||||||
- files~=^tests/v1/structured_output/
|
- files~=^tests/v1/structured_output/
|
||||||
- files=tests/v1/entrypoints/llm/test_guided_generate.py
|
- files=tests/v1/entrypoints/llm/test_guided_generate.py
|
||||||
- files~=^vllm/v1/structured_output/
|
- files~=^vllm/v1/structured_output/
|
||||||
|
|
|
@ -2,6 +2,10 @@ name: Lint and Deploy Charts
|
||||||
|
|
||||||
on: pull_request
|
on: pull_request
|
||||||
|
|
||||||
|
concurrency:
|
||||||
|
group: ${{ github.workflow }}-${{ github.ref }}
|
||||||
|
cancel-in-progress: true
|
||||||
|
|
||||||
permissions:
|
permissions:
|
||||||
contents: read
|
contents: read
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,17 @@
|
||||||
|
{
|
||||||
|
"problemMatcher": [
|
||||||
|
{
|
||||||
|
"owner": "markdownlint",
|
||||||
|
"pattern": [
|
||||||
|
{
|
||||||
|
"regexp": "^([^:]*):(\\d+):?(\\d+)?\\s([\\w-\\/]*)\\s(.*)$",
|
||||||
|
"file": 1,
|
||||||
|
"line": 2,
|
||||||
|
"column": 3,
|
||||||
|
"code": 4,
|
||||||
|
"message": 5
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
|
@ -5,6 +5,10 @@ on:
|
||||||
push:
|
push:
|
||||||
branches: [main]
|
branches: [main]
|
||||||
|
|
||||||
|
concurrency:
|
||||||
|
group: ${{ github.workflow }}-${{ github.ref }}
|
||||||
|
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
|
||||||
|
|
||||||
permissions:
|
permissions:
|
||||||
contents: read
|
contents: read
|
||||||
|
|
||||||
|
@ -17,6 +21,7 @@ jobs:
|
||||||
with:
|
with:
|
||||||
python-version: "3.12"
|
python-version: "3.12"
|
||||||
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
|
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
|
||||||
|
- run: echo "::add-matcher::.github/workflows/matchers/markdownlint.json"
|
||||||
- run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
|
- run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
|
||||||
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
|
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
|
||||||
with:
|
with:
|
||||||
|
|
|
@ -15,7 +15,6 @@ $python_executable -m pip install -r requirements/build.txt -r requirements/cuda
|
||||||
export MAX_JOBS=1
|
export MAX_JOBS=1
|
||||||
# Make sure release wheels are built for the following architectures
|
# Make sure release wheels are built for the following architectures
|
||||||
export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
|
export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
|
||||||
export VLLM_FA_CMAKE_GPU_ARCHES="80-real;90-real"
|
|
||||||
|
|
||||||
bash tools/check_repo.sh
|
bash tools/check_repo.sh
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,13 @@
|
||||||
|
MD007:
|
||||||
|
indent: 4
|
||||||
|
MD013: false
|
||||||
|
MD024:
|
||||||
|
siblings_only: true
|
||||||
|
MD033: false
|
||||||
|
MD042: false
|
||||||
|
MD045: false
|
||||||
|
MD046: false
|
||||||
|
MD051: false
|
||||||
|
MD052: false
|
||||||
|
MD053: false
|
||||||
|
MD059: false
|
|
@ -35,12 +35,12 @@ repos:
|
||||||
exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*'
|
exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*'
|
||||||
types_or: [c++, cuda]
|
types_or: [c++, cuda]
|
||||||
args: [--style=file, --verbose]
|
args: [--style=file, --verbose]
|
||||||
- repo: https://github.com/jackdewinter/pymarkdown
|
- repo: https://github.com/igorshubovych/markdownlint-cli
|
||||||
rev: v0.9.29
|
rev: v0.45.0
|
||||||
hooks:
|
hooks:
|
||||||
- id: pymarkdown
|
- id: markdownlint
|
||||||
exclude: '.*\.inc\.md'
|
exclude: '.*\.inc\.md'
|
||||||
args: [fix]
|
stages: [manual] # Only run in CI
|
||||||
- repo: https://github.com/rhysd/actionlint
|
- repo: https://github.com/rhysd/actionlint
|
||||||
rev: v1.7.7
|
rev: v1.7.7
|
||||||
hooks:
|
hooks:
|
||||||
|
|
|
@ -7,6 +7,9 @@ build:
|
||||||
os: ubuntu-22.04
|
os: ubuntu-22.04
|
||||||
tools:
|
tools:
|
||||||
python: "3.12"
|
python: "3.12"
|
||||||
|
jobs:
|
||||||
|
post_checkout:
|
||||||
|
- git fetch --unshallow || true
|
||||||
|
|
||||||
mkdocs:
|
mkdocs:
|
||||||
configuration: mkdocs.yaml
|
configuration: mkdocs.yaml
|
||||||
|
|
|
@ -635,7 +635,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
|
||||||
"in CUDA target architectures.")
|
"in CUDA target architectures.")
|
||||||
endif()
|
endif()
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
|
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
|
||||||
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
|
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
|
||||||
set(SRCS "csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm100.cu")
|
set(SRCS "csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm100.cu")
|
||||||
|
@ -768,6 +768,14 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
|
||||||
list(APPEND VLLM_MOE_EXT_SRC "csrc/moe/moe_wna16.cu")
|
list(APPEND VLLM_MOE_EXT_SRC "csrc/moe/moe_wna16.cu")
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
|
if(VLLM_GPU_LANG STREQUAL "CUDA")
|
||||||
|
set(MOE_PERMUTE_SRC
|
||||||
|
"csrc/moe/permute_unpermute_kernels/moe_permute_unpermute_kernel.cu"
|
||||||
|
"csrc/moe/moe_permute_unpermute_op.cu")
|
||||||
|
|
||||||
|
list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}")
|
||||||
|
endif()
|
||||||
|
|
||||||
set_gencode_flags_for_srcs(
|
set_gencode_flags_for_srcs(
|
||||||
SRCS "${VLLM_MOE_EXT_SRC}"
|
SRCS "${VLLM_MOE_EXT_SRC}"
|
||||||
CUDA_ARCHS "${CUDA_ARCHS}")
|
CUDA_ARCHS "${CUDA_ARCHS}")
|
||||||
|
@ -836,17 +844,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
|
||||||
endif()
|
endif()
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
if(VLLM_GPU_LANG STREQUAL "CUDA")
|
|
||||||
set(MOE_PERMUTE_SRC
|
|
||||||
"csrc/moe/permute_unpermute_kernels/moe_permute_unpermute_kernel.cu"
|
|
||||||
"csrc/moe/moe_permute_unpermute_op.cu")
|
|
||||||
|
|
||||||
set_gencode_flags_for_srcs(
|
|
||||||
SRCS "${MARLIN_PERMUTE_SRC}"
|
|
||||||
CUDA_ARCHS "${MOE_PERMUTE_ARCHS}")
|
|
||||||
|
|
||||||
list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}")
|
|
||||||
endif()
|
|
||||||
message(STATUS "Enabling moe extension.")
|
message(STATUS "Enabling moe extension.")
|
||||||
define_gpu_extension_target(
|
define_gpu_extension_target(
|
||||||
_moe_C
|
_moe_C
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
<!-- markdownlint-disable MD001 MD041 -->
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<picture>
|
<picture>
|
||||||
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
|
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
|
||||||
|
@ -16,6 +17,7 @@ Easy, fast, and cheap LLM serving for everyone
|
||||||
---
|
---
|
||||||
|
|
||||||
*Latest News* 🔥
|
*Latest News* 🔥
|
||||||
|
|
||||||
- [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing).
|
- [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing).
|
||||||
- [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/).
|
- [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/).
|
||||||
- [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
|
- [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
|
||||||
|
@ -46,6 +48,7 @@ Easy, fast, and cheap LLM serving for everyone
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## About
|
## About
|
||||||
|
|
||||||
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
||||||
|
@ -75,6 +78,7 @@ vLLM is flexible and easy to use with:
|
||||||
- Multi-LoRA support
|
- Multi-LoRA support
|
||||||
|
|
||||||
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
|
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
|
||||||
|
|
||||||
- Transformer-like LLMs (e.g., Llama)
|
- Transformer-like LLMs (e.g., Llama)
|
||||||
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
|
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
|
||||||
- Embedding Models (e.g., E5-Mistral)
|
- Embedding Models (e.g., E5-Mistral)
|
||||||
|
@ -91,6 +95,7 @@ pip install vllm
|
||||||
```
|
```
|
||||||
|
|
||||||
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
|
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
|
||||||
|
|
||||||
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
|
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
|
||||||
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
|
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
|
||||||
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
|
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
|
||||||
|
@ -107,6 +112,7 @@ vLLM is a community project. Our compute resources for development and testing a
|
||||||
<!-- Note: Please sort them in alphabetical order. -->
|
<!-- Note: Please sort them in alphabetical order. -->
|
||||||
<!-- Note: Please keep these consistent with docs/community/sponsors.md -->
|
<!-- Note: Please keep these consistent with docs/community/sponsors.md -->
|
||||||
Cash Donations:
|
Cash Donations:
|
||||||
|
|
||||||
- a16z
|
- a16z
|
||||||
- Dropbox
|
- Dropbox
|
||||||
- Sequoia Capital
|
- Sequoia Capital
|
||||||
|
@ -114,6 +120,7 @@ Cash Donations:
|
||||||
- ZhenFund
|
- ZhenFund
|
||||||
|
|
||||||
Compute Resources:
|
Compute Resources:
|
||||||
|
|
||||||
- AMD
|
- AMD
|
||||||
- Anyscale
|
- Anyscale
|
||||||
- AWS
|
- AWS
|
||||||
|
|
|
@ -60,9 +60,10 @@ Please note: **No feature work allowed for cherry picks**. All PRs that are cons
|
||||||
Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI.
|
Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI.
|
||||||
|
|
||||||
**Current Coverage:**
|
**Current Coverage:**
|
||||||
|
|
||||||
* Models: Llama3, Llama4, and Mixtral
|
* Models: Llama3, Llama4, and Mixtral
|
||||||
* Hardware: NVIDIA H100 and AMD MI300x
|
* Hardware: NVIDIA H100 and AMD MI300x
|
||||||
* *Note: Coverage may change based on new model releases and hardware availability*
|
* _Note: Coverage may change based on new model releases and hardware availability_
|
||||||
|
|
||||||
**Performance Validation Process:**
|
**Performance Validation Process:**
|
||||||
|
|
||||||
|
@ -71,11 +72,13 @@ Request write access to the [pytorch/pytorch-integration-testing](https://github
|
||||||
|
|
||||||
**Step 2: Review Benchmark Setup**
|
**Step 2: Review Benchmark Setup**
|
||||||
Familiarize yourself with the benchmark configurations:
|
Familiarize yourself with the benchmark configurations:
|
||||||
|
|
||||||
* [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda)
|
* [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda)
|
||||||
* [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm)
|
* [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm)
|
||||||
|
|
||||||
**Step 3: Run the Benchmark**
|
**Step 3: Run the Benchmark**
|
||||||
Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure:
|
Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure:
|
||||||
|
|
||||||
* **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`)
|
* **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`)
|
||||||
* **vLLM commit**: Set to the RC commit hash
|
* **vLLM commit**: Set to the RC commit hash
|
||||||
|
|
||||||
|
|
|
@ -4,7 +4,7 @@ This README guides you through running benchmark tests with the extensive
|
||||||
datasets supported on vLLM. It’s a living document, updated as new features and datasets
|
datasets supported on vLLM. It’s a living document, updated as new features and datasets
|
||||||
become available.
|
become available.
|
||||||
|
|
||||||
**Dataset Overview**
|
## Dataset Overview
|
||||||
|
|
||||||
<table style="width:100%; border-collapse: collapse;">
|
<table style="width:100%; border-collapse: collapse;">
|
||||||
<thead>
|
<thead>
|
||||||
|
@ -81,9 +81,10 @@ become available.
|
||||||
|
|
||||||
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
|
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
|
||||||
|
|
||||||
---
|
## 🚀 Example - Online Benchmark
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>🚀 Example - Online Benchmark</b></summary>
|
<summary>Show more</summary>
|
||||||
|
|
||||||
<br/>
|
<br/>
|
||||||
|
|
||||||
|
@ -98,7 +99,7 @@ Then run the benchmarking script
|
||||||
```bash
|
```bash
|
||||||
# download dataset
|
# download dataset
|
||||||
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||||
python3 vllm/benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model NousResearch/Hermes-3-Llama-3.1-8B \
|
--model NousResearch/Hermes-3-Llama-3.1-8B \
|
||||||
--endpoint /v1/completions \
|
--endpoint /v1/completions \
|
||||||
|
@ -109,39 +110,39 @@ python3 vllm/benchmarks/benchmark_serving.py \
|
||||||
|
|
||||||
If successful, you will see the following output
|
If successful, you will see the following output
|
||||||
|
|
||||||
```
|
```text
|
||||||
============ Serving Benchmark Result ============
|
============ Serving Benchmark Result ============
|
||||||
Successful requests: 10
|
Successful requests: 10
|
||||||
Benchmark duration (s): 5.78
|
Benchmark duration (s): 5.78
|
||||||
Total input tokens: 1369
|
Total input tokens: 1369
|
||||||
Total generated tokens: 2212
|
Total generated tokens: 2212
|
||||||
Request throughput (req/s): 1.73
|
Request throughput (req/s): 1.73
|
||||||
Output token throughput (tok/s): 382.89
|
Output token throughput (tok/s): 382.89
|
||||||
Total Token throughput (tok/s): 619.85
|
Total Token throughput (tok/s): 619.85
|
||||||
---------------Time to First Token----------------
|
---------------Time to First Token----------------
|
||||||
Mean TTFT (ms): 71.54
|
Mean TTFT (ms): 71.54
|
||||||
Median TTFT (ms): 73.88
|
Median TTFT (ms): 73.88
|
||||||
P99 TTFT (ms): 79.49
|
P99 TTFT (ms): 79.49
|
||||||
-----Time per Output Token (excl. 1st token)------
|
-----Time per Output Token (excl. 1st token)------
|
||||||
Mean TPOT (ms): 7.91
|
Mean TPOT (ms): 7.91
|
||||||
Median TPOT (ms): 7.96
|
Median TPOT (ms): 7.96
|
||||||
P99 TPOT (ms): 8.03
|
P99 TPOT (ms): 8.03
|
||||||
---------------Inter-token Latency----------------
|
---------------Inter-token Latency----------------
|
||||||
Mean ITL (ms): 7.74
|
Mean ITL (ms): 7.74
|
||||||
Median ITL (ms): 7.70
|
Median ITL (ms): 7.70
|
||||||
P99 ITL (ms): 8.39
|
P99 ITL (ms): 8.39
|
||||||
==================================================
|
==================================================
|
||||||
```
|
```
|
||||||
|
|
||||||
**Custom Dataset**
|
### Custom Dataset
|
||||||
|
|
||||||
If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl
|
If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl
|
||||||
|
|
||||||
```
|
```json
|
||||||
{"prompt": "What is the capital of India?"}
|
{"prompt": "What is the capital of India?"}
|
||||||
{"prompt": "What is the capital of Iran?"}
|
{"prompt": "What is the capital of Iran?"}
|
||||||
{"prompt": "What is the capital of China?"}
|
{"prompt": "What is the capital of China?"}
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# start server
|
# start server
|
||||||
|
@ -150,7 +151,7 @@ VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# run benchmarking script
|
# run benchmarking script
|
||||||
python3 benchmarks/benchmark_serving.py --port 9001 --save-result --save-detailed \
|
vllm bench serve --port 9001 --save-result --save-detailed \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model meta-llama/Llama-3.1-8B-Instruct \
|
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||||
--endpoint /v1/completions \
|
--endpoint /v1/completions \
|
||||||
|
@ -166,7 +167,7 @@ python3 benchmarks/benchmark_serving.py --port 9001 --save-result --save-detaile
|
||||||
|
|
||||||
You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.
|
You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.
|
||||||
|
|
||||||
**VisionArena Benchmark for Vision Language Models**
|
### VisionArena Benchmark for Vision Language Models
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# need a model with vision capability here
|
# need a model with vision capability here
|
||||||
|
@ -174,7 +175,7 @@ vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 vllm/benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend openai-chat \
|
--backend openai-chat \
|
||||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||||
--endpoint /v1/chat/completions \
|
--endpoint /v1/chat/completions \
|
||||||
|
@ -184,7 +185,7 @@ python3 vllm/benchmarks/benchmark_serving.py \
|
||||||
--num-prompts 1000
|
--num-prompts 1000
|
||||||
```
|
```
|
||||||
|
|
||||||
**InstructCoder Benchmark with Speculative Decoding**
|
### InstructCoder Benchmark with Speculative Decoding
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
|
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
|
||||||
|
@ -194,23 +195,23 @@ VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
|
||||||
```
|
```
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
python3 benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--model meta-llama/Meta-Llama-3-8B-Instruct \
|
--model meta-llama/Meta-Llama-3-8B-Instruct \
|
||||||
--dataset-name hf \
|
--dataset-name hf \
|
||||||
--dataset-path likaixin/InstructCoder \
|
--dataset-path likaixin/InstructCoder \
|
||||||
--num-prompts 2048
|
--num-prompts 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
**Other HuggingFaceDataset Examples**
|
### Other HuggingFaceDataset Examples
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
|
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
|
||||||
```
|
```
|
||||||
|
|
||||||
**`lmms-lab/LLaVA-OneVision-Data`**
|
`lmms-lab/LLaVA-OneVision-Data`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 vllm/benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend openai-chat \
|
--backend openai-chat \
|
||||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||||
--endpoint /v1/chat/completions \
|
--endpoint /v1/chat/completions \
|
||||||
|
@ -221,10 +222,10 @@ python3 vllm/benchmarks/benchmark_serving.py \
|
||||||
--num-prompts 10
|
--num-prompts 10
|
||||||
```
|
```
|
||||||
|
|
||||||
**`Aeala/ShareGPT_Vicuna_unfiltered`**
|
`Aeala/ShareGPT_Vicuna_unfiltered`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 vllm/benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend openai-chat \
|
--backend openai-chat \
|
||||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||||
--endpoint /v1/chat/completions \
|
--endpoint /v1/chat/completions \
|
||||||
|
@ -234,10 +235,10 @@ python3 vllm/benchmarks/benchmark_serving.py \
|
||||||
--num-prompts 10
|
--num-prompts 10
|
||||||
```
|
```
|
||||||
|
|
||||||
**`AI-MO/aimo-validation-aime`**
|
`AI-MO/aimo-validation-aime`:
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
python3 vllm/benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--model Qwen/QwQ-32B \
|
--model Qwen/QwQ-32B \
|
||||||
--dataset-name hf \
|
--dataset-name hf \
|
||||||
--dataset-path AI-MO/aimo-validation-aime \
|
--dataset-path AI-MO/aimo-validation-aime \
|
||||||
|
@ -245,23 +246,23 @@ python3 vllm/benchmarks/benchmark_serving.py \
|
||||||
--seed 42
|
--seed 42
|
||||||
```
|
```
|
||||||
|
|
||||||
**`philschmid/mt-bench`**
|
`philschmid/mt-bench`:
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
python3 vllm/benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--model Qwen/QwQ-32B \
|
--model Qwen/QwQ-32B \
|
||||||
--dataset-name hf \
|
--dataset-name hf \
|
||||||
--dataset-path philschmid/mt-bench \
|
--dataset-path philschmid/mt-bench \
|
||||||
--num-prompts 80
|
--num-prompts 80
|
||||||
```
|
```
|
||||||
|
|
||||||
**Running With Sampling Parameters**
|
### Running With Sampling Parameters
|
||||||
|
|
||||||
When using OpenAI-compatible backends such as `vllm`, optional sampling
|
When using OpenAI-compatible backends such as `vllm`, optional sampling
|
||||||
parameters can be specified. Example client command:
|
parameters can be specified. Example client command:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 vllm/benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model NousResearch/Hermes-3-Llama-3.1-8B \
|
--model NousResearch/Hermes-3-Llama-3.1-8B \
|
||||||
--endpoint /v1/completions \
|
--endpoint /v1/completions \
|
||||||
|
@ -273,30 +274,34 @@ python3 vllm/benchmarks/benchmark_serving.py \
|
||||||
--num-prompts 10
|
--num-prompts 10
|
||||||
```
|
```
|
||||||
|
|
||||||
**Running With Ramp-Up Request Rate**
|
### Running With Ramp-Up Request Rate
|
||||||
|
|
||||||
The benchmark tool also supports ramping up the request rate over the
|
The benchmark tool also supports ramping up the request rate over the
|
||||||
duration of the benchmark run. This can be useful for stress testing the
|
duration of the benchmark run. This can be useful for stress testing the
|
||||||
server or finding the maximum throughput that it can handle, given some latency budget.
|
server or finding the maximum throughput that it can handle, given some latency budget.
|
||||||
|
|
||||||
Two ramp-up strategies are supported:
|
Two ramp-up strategies are supported:
|
||||||
|
|
||||||
- `linear`: Increases the request rate linearly from a start value to an end value.
|
- `linear`: Increases the request rate linearly from a start value to an end value.
|
||||||
- `exponential`: Increases the request rate exponentially.
|
- `exponential`: Increases the request rate exponentially.
|
||||||
|
|
||||||
The following arguments can be used to control the ramp-up:
|
The following arguments can be used to control the ramp-up:
|
||||||
|
|
||||||
- `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`).
|
- `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`).
|
||||||
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
|
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
|
||||||
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
|
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
## 📈 Example - Offline Throughput Benchmark
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>📈 Example - Offline Throughput Benchmark</b></summary>
|
<summary>Show more</summary>
|
||||||
|
|
||||||
<br/>
|
<br/>
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 vllm/benchmarks/benchmark_throughput.py \
|
vllm bench throughput \
|
||||||
--model NousResearch/Hermes-3-Llama-3.1-8B \
|
--model NousResearch/Hermes-3-Llama-3.1-8B \
|
||||||
--dataset-name sonnet \
|
--dataset-name sonnet \
|
||||||
--dataset-path vllm/benchmarks/sonnet.txt \
|
--dataset-path vllm/benchmarks/sonnet.txt \
|
||||||
|
@ -305,16 +310,16 @@ python3 vllm/benchmarks/benchmark_throughput.py \
|
||||||
|
|
||||||
If successful, you will see the following output
|
If successful, you will see the following output
|
||||||
|
|
||||||
```
|
```text
|
||||||
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
|
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
|
||||||
Total num prompt tokens: 5014
|
Total num prompt tokens: 5014
|
||||||
Total num output tokens: 1500
|
Total num output tokens: 1500
|
||||||
```
|
```
|
||||||
|
|
||||||
**VisionArena Benchmark for Vision Language Models**
|
### VisionArena Benchmark for Vision Language Models
|
||||||
|
|
||||||
``` bash
|
```bash
|
||||||
python3 vllm/benchmarks/benchmark_throughput.py \
|
vllm bench throughput \
|
||||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||||
--backend vllm-chat \
|
--backend vllm-chat \
|
||||||
--dataset-name hf \
|
--dataset-name hf \
|
||||||
|
@ -325,18 +330,18 @@ python3 vllm/benchmarks/benchmark_throughput.py \
|
||||||
|
|
||||||
The `num prompt tokens` now includes image token counts
|
The `num prompt tokens` now includes image token counts
|
||||||
|
|
||||||
```
|
```text
|
||||||
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
|
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
|
||||||
Total num prompt tokens: 14527
|
Total num prompt tokens: 14527
|
||||||
Total num output tokens: 1280
|
Total num output tokens: 1280
|
||||||
```
|
```
|
||||||
|
|
||||||
**InstructCoder Benchmark with Speculative Decoding**
|
### InstructCoder Benchmark with Speculative Decoding
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
VLLM_WORKER_MULTIPROC_METHOD=spawn \
|
VLLM_WORKER_MULTIPROC_METHOD=spawn \
|
||||||
VLLM_USE_V1=1 \
|
VLLM_USE_V1=1 \
|
||||||
python3 vllm/benchmarks/benchmark_throughput.py \
|
vllm bench throughput \
|
||||||
--dataset-name=hf \
|
--dataset-name=hf \
|
||||||
--dataset-path=likaixin/InstructCoder \
|
--dataset-path=likaixin/InstructCoder \
|
||||||
--model=meta-llama/Meta-Llama-3-8B-Instruct \
|
--model=meta-llama/Meta-Llama-3-8B-Instruct \
|
||||||
|
@ -349,18 +354,18 @@ python3 vllm/benchmarks/benchmark_throughput.py \
|
||||||
"prompt_lookup_min": 2}'
|
"prompt_lookup_min": 2}'
|
||||||
```
|
```
|
||||||
|
|
||||||
```
|
```text
|
||||||
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
|
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
|
||||||
Total num prompt tokens: 261136
|
Total num prompt tokens: 261136
|
||||||
Total num output tokens: 204800
|
Total num output tokens: 204800
|
||||||
```
|
```
|
||||||
|
|
||||||
**Other HuggingFaceDataset Examples**
|
### Other HuggingFaceDataset Examples
|
||||||
|
|
||||||
**`lmms-lab/LLaVA-OneVision-Data`**
|
`lmms-lab/LLaVA-OneVision-Data`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 vllm/benchmarks/benchmark_throughput.py \
|
vllm bench throughput \
|
||||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||||
--backend vllm-chat \
|
--backend vllm-chat \
|
||||||
--dataset-name hf \
|
--dataset-name hf \
|
||||||
|
@ -370,10 +375,10 @@ python3 vllm/benchmarks/benchmark_throughput.py \
|
||||||
--num-prompts 10
|
--num-prompts 10
|
||||||
```
|
```
|
||||||
|
|
||||||
**`Aeala/ShareGPT_Vicuna_unfiltered`**
|
`Aeala/ShareGPT_Vicuna_unfiltered`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 vllm/benchmarks/benchmark_throughput.py \
|
vllm bench throughput \
|
||||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||||
--backend vllm-chat \
|
--backend vllm-chat \
|
||||||
--dataset-name hf \
|
--dataset-name hf \
|
||||||
|
@ -382,10 +387,10 @@ python3 vllm/benchmarks/benchmark_throughput.py \
|
||||||
--num-prompts 10
|
--num-prompts 10
|
||||||
```
|
```
|
||||||
|
|
||||||
**`AI-MO/aimo-validation-aime`**
|
`AI-MO/aimo-validation-aime`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_throughput.py \
|
vllm bench throughput \
|
||||||
--model Qwen/QwQ-32B \
|
--model Qwen/QwQ-32B \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--dataset-name hf \
|
--dataset-name hf \
|
||||||
|
@ -394,12 +399,12 @@ python3 benchmarks/benchmark_throughput.py \
|
||||||
--num-prompts 10
|
--num-prompts 10
|
||||||
```
|
```
|
||||||
|
|
||||||
**Benchmark with LoRA Adapters**
|
Benchmark with LoRA adapters:
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
# download dataset
|
# download dataset
|
||||||
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||||
python3 vllm/benchmarks/benchmark_throughput.py \
|
vllm bench throughput \
|
||||||
--model meta-llama/Llama-2-7b-hf \
|
--model meta-llama/Llama-2-7b-hf \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
|
--dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
|
||||||
|
@ -413,20 +418,22 @@ python3 vllm/benchmarks/benchmark_throughput.py \
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
## 🛠️ Example - Structured Output Benchmark
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>🛠️ Example - Structured Output Benchmark</b></summary>
|
<summary>Show more</summary>
|
||||||
|
|
||||||
<br/>
|
<br/>
|
||||||
|
|
||||||
Benchmark the performance of structured output generation (JSON, grammar, regex).
|
Benchmark the performance of structured output generation (JSON, grammar, regex).
|
||||||
|
|
||||||
**Server Setup**
|
### Server Setup
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
|
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
|
||||||
```
|
```
|
||||||
|
|
||||||
**JSON Schema Benchmark**
|
### JSON Schema Benchmark
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
|
@ -438,7 +445,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
--num-prompts 1000
|
--num-prompts 1000
|
||||||
```
|
```
|
||||||
|
|
||||||
**Grammar-based Generation Benchmark**
|
### Grammar-based Generation Benchmark
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
|
@ -450,7 +457,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
--num-prompts 1000
|
--num-prompts 1000
|
||||||
```
|
```
|
||||||
|
|
||||||
**Regex-based Generation Benchmark**
|
### Regex-based Generation Benchmark
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
|
@ -461,7 +468,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
--num-prompts 1000
|
--num-prompts 1000
|
||||||
```
|
```
|
||||||
|
|
||||||
**Choice-based Generation Benchmark**
|
### Choice-based Generation Benchmark
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
|
@ -472,7 +479,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
--num-prompts 1000
|
--num-prompts 1000
|
||||||
```
|
```
|
||||||
|
|
||||||
**XGrammar Benchmark Dataset**
|
### XGrammar Benchmark Dataset
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
|
@ -485,14 +492,16 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
## 📚 Example - Long Document QA Benchmark
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>📚 Example - Long Document QA Benchmark</b></summary>
|
<summary>Show more</summary>
|
||||||
|
|
||||||
<br/>
|
<br/>
|
||||||
|
|
||||||
Benchmark the performance of long document question-answering with prefix caching.
|
Benchmark the performance of long document question-answering with prefix caching.
|
||||||
|
|
||||||
**Basic Long Document QA Test**
|
### Basic Long Document QA Test
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_long_document_qa_throughput.py \
|
python3 benchmarks/benchmark_long_document_qa_throughput.py \
|
||||||
|
@ -504,7 +513,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
|
||||||
--repeat-count 5
|
--repeat-count 5
|
||||||
```
|
```
|
||||||
|
|
||||||
**Different Repeat Modes**
|
### Different Repeat Modes
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Random mode (default) - shuffle prompts randomly
|
# Random mode (default) - shuffle prompts randomly
|
||||||
|
@ -537,14 +546,16 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
## 🗂️ Example - Prefix Caching Benchmark
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>🗂️ Example - Prefix Caching Benchmark</b></summary>
|
<summary>Show more</summary>
|
||||||
|
|
||||||
<br/>
|
<br/>
|
||||||
|
|
||||||
Benchmark the efficiency of automatic prefix caching.
|
Benchmark the efficiency of automatic prefix caching.
|
||||||
|
|
||||||
**Fixed Prompt with Prefix Caching**
|
### Fixed Prompt with Prefix Caching
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_prefix_caching.py \
|
python3 benchmarks/benchmark_prefix_caching.py \
|
||||||
|
@ -555,7 +566,7 @@ python3 benchmarks/benchmark_prefix_caching.py \
|
||||||
--input-length-range 128:256
|
--input-length-range 128:256
|
||||||
```
|
```
|
||||||
|
|
||||||
**ShareGPT Dataset with Prefix Caching**
|
### ShareGPT Dataset with Prefix Caching
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# download dataset
|
# download dataset
|
||||||
|
@ -572,14 +583,16 @@ python3 benchmarks/benchmark_prefix_caching.py \
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
## ⚡ Example - Request Prioritization Benchmark
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>⚡ Example - Request Prioritization Benchmark</b></summary>
|
<summary>Show more</summary>
|
||||||
|
|
||||||
<br/>
|
<br/>
|
||||||
|
|
||||||
Benchmark the performance of request prioritization in vLLM.
|
Benchmark the performance of request prioritization in vLLM.
|
||||||
|
|
||||||
**Basic Prioritization Test**
|
### Basic Prioritization Test
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_prioritization.py \
|
python3 benchmarks/benchmark_prioritization.py \
|
||||||
|
@ -590,7 +603,7 @@ python3 benchmarks/benchmark_prioritization.py \
|
||||||
--scheduling-policy priority
|
--scheduling-policy priority
|
||||||
```
|
```
|
||||||
|
|
||||||
**Multiple Sequences per Prompt**
|
### Multiple Sequences per Prompt
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 benchmarks/benchmark_prioritization.py \
|
python3 benchmarks/benchmark_prioritization.py \
|
||||||
|
|
|
@ -3,6 +3,7 @@
|
||||||
This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate.
|
This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate.
|
||||||
|
|
||||||
## Table of Contents
|
## Table of Contents
|
||||||
|
|
||||||
- [Prerequisites](#prerequisites)
|
- [Prerequisites](#prerequisites)
|
||||||
- [Configuration](#configuration)
|
- [Configuration](#configuration)
|
||||||
- [How to Run](#how-to-run)
|
- [How to Run](#how-to-run)
|
||||||
|
@ -39,6 +40,7 @@ You must set the following variables at the top of the script before execution.
|
||||||
| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) |
|
| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) |
|
||||||
| `INPUT_LEN` | **Required.** Request input length. | `4000` |
|
| `INPUT_LEN` | **Required.** Request input length. | `4000` |
|
||||||
| `OUTPUT_LEN` | **Required.** Request output length. | `16` |
|
| `OUTPUT_LEN` | **Required.** Request output length. | `16` |
|
||||||
|
| `MAX_MODEL_LEN` | **Required.** Max model length. | `4096` |
|
||||||
| `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` |
|
| `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` |
|
||||||
| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` |
|
| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` |
|
||||||
| `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` |
|
| `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` |
|
||||||
|
@ -51,7 +53,7 @@ You must set the following variables at the top of the script before execution.
|
||||||
1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section.
|
1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section.
|
||||||
2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost.
|
2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost.
|
||||||
|
|
||||||
```
|
```bash
|
||||||
cd <FOLDER_OF_THIS_SCRIPT>
|
cd <FOLDER_OF_THIS_SCRIPT>
|
||||||
bash auto_tune.sh
|
bash auto_tune.sh
|
||||||
```
|
```
|
||||||
|
@ -63,34 +65,40 @@ bash auto_tune.sh
|
||||||
Here are a few examples of how to configure the script for different goals:
|
Here are a few examples of how to configure the script for different goals:
|
||||||
|
|
||||||
### 1. Maximize Throughput (No Latency Constraint)
|
### 1. Maximize Throughput (No Latency Constraint)
|
||||||
|
|
||||||
- **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens.
|
- **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens.
|
||||||
- **Configuration**:
|
- **Configuration**:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
INPUT_LEN=1800
|
INPUT_LEN=1800
|
||||||
OUTPUT_LEN=20
|
OUTPUT_LEN=20
|
||||||
|
MAX_MODEL_LEN=2048
|
||||||
MIN_CACHE_HIT_PCT=0
|
MIN_CACHE_HIT_PCT=0
|
||||||
MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
|
MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 2. Maximize Throughput with a Latency Requirement
|
#### 2. Maximize Throughput with a Latency Requirement
|
||||||
|
|
||||||
- **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms.
|
- **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms.
|
||||||
- **Configuration**:
|
- **Configuration**:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
INPUT_LEN=1800
|
INPUT_LEN=1800
|
||||||
OUTPUT_LEN=20
|
OUTPUT_LEN=20
|
||||||
|
MAX_MODEL_LEN=2048
|
||||||
MIN_CACHE_HIT_PCT=0
|
MIN_CACHE_HIT_PCT=0
|
||||||
MAX_LATENCY_ALLOWED_MS=500
|
MAX_LATENCY_ALLOWED_MS=500
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 3. Maximize Throughput with Prefix Caching and Latency Requirements
|
#### 3. Maximize Throughput with Prefix Caching and Latency Requirements
|
||||||
|
|
||||||
- **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms.
|
- **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms.
|
||||||
- **Configuration**:
|
- **Configuration**:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
INPUT_LEN=1800
|
INPUT_LEN=1800
|
||||||
OUTPUT_LEN=20
|
OUTPUT_LEN=20
|
||||||
|
MAX_MODEL_LEN=2048
|
||||||
MIN_CACHE_HIT_PCT=60
|
MIN_CACHE_HIT_PCT=60
|
||||||
MAX_LATENCY_ALLOWED_MS=500
|
MAX_LATENCY_ALLOWED_MS=500
|
||||||
```
|
```
|
||||||
|
@ -101,11 +109,11 @@ After the script finishes, you will find the results in a new, timestamped direc
|
||||||
|
|
||||||
- **Log Files**: The directory (`$BASE/auto-benchmark/YYYY_MM_DD_HH_MM/`) contains detailed logs for each run:
|
- **Log Files**: The directory (`$BASE/auto-benchmark/YYYY_MM_DD_HH_MM/`) contains detailed logs for each run:
|
||||||
- `vllm_log_...txt`: The log output from the vLLM server for each parameter combination.
|
- `vllm_log_...txt`: The log output from the vLLM server for each parameter combination.
|
||||||
- `bm_log_...txt`: The log output from the `benchmark_serving.py` script for each benchmark run.
|
- `bm_log_...txt`: The log output from the `vllm bench serve` command for each benchmark run.
|
||||||
|
|
||||||
- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found.
|
- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found.
|
||||||
|
|
||||||
```
|
```text
|
||||||
# Example result.txt content
|
# Example result.txt content
|
||||||
hash:a1b2c3d4...
|
hash:a1b2c3d4...
|
||||||
max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8
|
max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8
|
||||||
|
|
|
@ -1,16 +1,18 @@
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
|
|
||||||
# This script aims to tune the best server parameter combinations to maximize throughput for given requirement.
|
# This script aims to tune the best server parameter combinations to maximize throughput for given requirement.
|
||||||
# See details in README (benchmarks/auto_tune/README.md).
|
# See details in README (benchmarks/auto_tune/README.md).
|
||||||
|
|
||||||
TAG=$(date +"%Y_%m_%d_%H_%M")
|
TAG=$(date +"%Y_%m_%d_%H_%M")
|
||||||
BASE=""
|
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
|
||||||
|
BASE="$SCRIPT_DIR/../../.."
|
||||||
MODEL="meta-llama/Llama-3.1-8B-Instruct"
|
MODEL="meta-llama/Llama-3.1-8B-Instruct"
|
||||||
SYSTEM="TPU"
|
SYSTEM="TPU"
|
||||||
TP=1
|
TP=1
|
||||||
DOWNLOAD_DIR=""
|
DOWNLOAD_DIR=""
|
||||||
INPUT_LEN=4000
|
INPUT_LEN=4000
|
||||||
OUTPUT_LEN=16
|
OUTPUT_LEN=16
|
||||||
|
MAX_MODEL_LEN=4096
|
||||||
MIN_CACHE_HIT_PCT=0
|
MIN_CACHE_HIT_PCT=0
|
||||||
MAX_LATENCY_ALLOWED_MS=100000000000
|
MAX_LATENCY_ALLOWED_MS=100000000000
|
||||||
NUM_SEQS_LIST="128 256"
|
NUM_SEQS_LIST="128 256"
|
||||||
|
@ -36,6 +38,13 @@ current_hash=$(git rev-parse HEAD)
|
||||||
echo "hash:$current_hash" >> "$RESULT"
|
echo "hash:$current_hash" >> "$RESULT"
|
||||||
echo "current_hash: $current_hash"
|
echo "current_hash: $current_hash"
|
||||||
|
|
||||||
|
TOTAL_LEN=$((INPUT_LEN + OUTPUT_LEN))
|
||||||
|
RED='\033[0;31m'
|
||||||
|
if (( TOTAL_LEN > MAX_MODEL_LEN )); then
|
||||||
|
echo -e "${RED}FAILED: INPUT_LEN($INPUT_LEN) + OUTPUT_LEN($OUTPUT_LEN) = $TOTAL_LEN, which is > MAX_MODEL_LEN = $MAX_MODEL_LEN.\033[0m" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
best_throughput=0
|
best_throughput=0
|
||||||
best_max_num_seqs=0
|
best_max_num_seqs=0
|
||||||
best_num_batched_tokens=0
|
best_num_batched_tokens=0
|
||||||
|
@ -47,7 +56,7 @@ start_server() {
|
||||||
local max_num_batched_tokens=$3
|
local max_num_batched_tokens=$3
|
||||||
local vllm_log=$4
|
local vllm_log=$4
|
||||||
local profile_dir=$5
|
local profile_dir=$5
|
||||||
|
|
||||||
pkill -f vllm
|
pkill -f vllm
|
||||||
|
|
||||||
VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \
|
VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \
|
||||||
|
@ -60,13 +69,13 @@ start_server() {
|
||||||
--enable-prefix-caching \
|
--enable-prefix-caching \
|
||||||
--load-format dummy \
|
--load-format dummy \
|
||||||
--download-dir "$DOWNLOAD_DIR" \
|
--download-dir "$DOWNLOAD_DIR" \
|
||||||
--max-model-len $(( INPUT_LEN+OUTPUT_LEN )) > "$vllm_log" 2>&1 &
|
--max-model-len $MAX_MODEL_LEN > "$vllm_log" 2>&1 &
|
||||||
|
|
||||||
# wait for 10 minutes...
|
# wait for 10 minutes...
|
||||||
server_started=0
|
server_started=0
|
||||||
for i in {1..60}; do
|
for i in {1..60}; do
|
||||||
RESPONSE=$(curl -s -X GET "http://0.0.0.0:8004/health" -w "%{http_code}" -o /dev/stdout)
|
RESPONSE=$(curl -s -X GET "http://0.0.0.0:8004/health" -w "%{http_code}" -o /dev/stdout)
|
||||||
STATUS_CODE=$(echo "$RESPONSE" | tail -n 1)
|
STATUS_CODE=$(echo "$RESPONSE" | tail -n 1)
|
||||||
if [[ "$STATUS_CODE" -eq 200 ]]; then
|
if [[ "$STATUS_CODE" -eq 200 ]]; then
|
||||||
server_started=1
|
server_started=1
|
||||||
break
|
break
|
||||||
|
@ -89,10 +98,10 @@ update_best_profile() {
|
||||||
selected_profile_file=
|
selected_profile_file=
|
||||||
if [[ "$SYSTEM" == "TPU" ]]; then
|
if [[ "$SYSTEM" == "TPU" ]]; then
|
||||||
selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb"
|
selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb"
|
||||||
fi
|
fi
|
||||||
if [[ "$SYSTEM" == "GPU" ]]; then
|
if [[ "$SYSTEM" == "GPU" ]]; then
|
||||||
selected_profile_file="${sorted_paths[$profile_index]}"
|
selected_profile_file="${sorted_paths[$profile_index]}"
|
||||||
fi
|
fi
|
||||||
rm -f $PROFILE_PATH/*
|
rm -f $PROFILE_PATH/*
|
||||||
cp $selected_profile_file $PROFILE_PATH
|
cp $selected_profile_file $PROFILE_PATH
|
||||||
}
|
}
|
||||||
|
@ -120,14 +129,14 @@ run_benchmark() {
|
||||||
echo "server started."
|
echo "server started."
|
||||||
fi
|
fi
|
||||||
echo
|
echo
|
||||||
|
|
||||||
echo "run benchmark test..."
|
echo "run benchmark test..."
|
||||||
meet_latency_requirement=0
|
meet_latency_requirement=0
|
||||||
# get a basic qps by using request-rate inf
|
# get a basic qps by using request-rate inf
|
||||||
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt"
|
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt"
|
||||||
prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 ))
|
prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 ))
|
||||||
adjusted_input_len=$(( INPUT_LEN - prefix_len ))
|
adjusted_input_len=$(( INPUT_LEN - prefix_len ))
|
||||||
python3 benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model $MODEL \
|
--model $MODEL \
|
||||||
--dataset-name random \
|
--dataset-name random \
|
||||||
|
@ -160,7 +169,7 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len ))
|
||||||
curl -X POST http://0.0.0.0:8004/reset_prefix_cache
|
curl -X POST http://0.0.0.0:8004/reset_prefix_cache
|
||||||
sleep 5
|
sleep 5
|
||||||
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt"
|
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt"
|
||||||
python3 benchmarks/benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model $MODEL \
|
--model $MODEL \
|
||||||
--dataset-name random \
|
--dataset-name random \
|
||||||
|
@ -245,4 +254,3 @@ done
|
||||||
echo "finish permutations"
|
echo "finish permutations"
|
||||||
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH"
|
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH"
|
||||||
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT"
|
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT"
|
||||||
|
|
||||||
|
|
|
@ -11,6 +11,7 @@ from typing import Any, Optional
|
||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
|
from typing_extensions import deprecated
|
||||||
|
|
||||||
import vllm.envs as envs
|
import vllm.envs as envs
|
||||||
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
|
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
|
||||||
|
@ -34,6 +35,10 @@ def save_to_pytorch_benchmark_format(
|
||||||
write_to_json(pt_file, pt_records)
|
write_to_json(pt_file, pt_records)
|
||||||
|
|
||||||
|
|
||||||
|
@deprecated(
|
||||||
|
"benchmark_latency.py is deprecated and will be removed in a "
|
||||||
|
"future version. Please use 'vllm bench latency' instead.",
|
||||||
|
)
|
||||||
def main(args: argparse.Namespace):
|
def main(args: argparse.Namespace):
|
||||||
print(args)
|
print(args)
|
||||||
|
|
||||||
|
|
|
@ -38,6 +38,7 @@ from typing import Any, Literal, Optional
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from tqdm.asyncio import tqdm
|
from tqdm.asyncio import tqdm
|
||||||
from transformers import PreTrainedTokenizerBase
|
from transformers import PreTrainedTokenizerBase
|
||||||
|
from typing_extensions import deprecated
|
||||||
|
|
||||||
from backend_request_func import (
|
from backend_request_func import (
|
||||||
ASYNC_REQUEST_FUNCS,
|
ASYNC_REQUEST_FUNCS,
|
||||||
|
@ -395,20 +396,6 @@ async def benchmark(
|
||||||
tasks.append(asyncio.create_task(task))
|
tasks.append(asyncio.create_task(task))
|
||||||
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
|
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
|
||||||
|
|
||||||
if profile:
|
|
||||||
print("Stopping profiler...")
|
|
||||||
profile_input = RequestFuncInput(
|
|
||||||
model=model_id,
|
|
||||||
prompt=test_prompt,
|
|
||||||
api_url=base_url + "/stop_profile",
|
|
||||||
prompt_len=test_prompt_len,
|
|
||||||
output_len=test_output_len,
|
|
||||||
logprobs=logprobs,
|
|
||||||
)
|
|
||||||
profile_output = await request_func(request_func_input=profile_input)
|
|
||||||
if profile_output.success:
|
|
||||||
print("Profiler stopped")
|
|
||||||
|
|
||||||
if pbar is not None:
|
if pbar is not None:
|
||||||
pbar.close()
|
pbar.close()
|
||||||
|
|
||||||
|
@ -426,6 +413,10 @@ async def benchmark(
|
||||||
|
|
||||||
print("{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="="))
|
print("{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="="))
|
||||||
print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
|
print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
|
||||||
|
if max_concurrency is not None:
|
||||||
|
print("{:<40} {:<10}".format("Maximum request concurrency:", max_concurrency))
|
||||||
|
if request_rate != float("inf"):
|
||||||
|
print("{:<40} {:<10.2f}".format("Request rate configured (RPS):", request_rate))
|
||||||
print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
|
print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
|
||||||
print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
|
print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
|
||||||
print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
|
print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
|
||||||
|
@ -517,6 +508,20 @@ async def benchmark(
|
||||||
|
|
||||||
print("=" * 50)
|
print("=" * 50)
|
||||||
|
|
||||||
|
if profile:
|
||||||
|
print("Stopping profiler...")
|
||||||
|
profile_input = RequestFuncInput(
|
||||||
|
model=model_id,
|
||||||
|
prompt=test_prompt,
|
||||||
|
api_url=base_url + "/stop_profile",
|
||||||
|
prompt_len=test_prompt_len,
|
||||||
|
output_len=test_output_len,
|
||||||
|
logprobs=logprobs,
|
||||||
|
)
|
||||||
|
profile_output = await request_func(request_func_input=profile_input)
|
||||||
|
if profile_output.success:
|
||||||
|
print("Profiler stopped")
|
||||||
|
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
@ -593,6 +598,10 @@ def save_to_pytorch_benchmark_format(
|
||||||
write_to_json(pt_file, pt_records)
|
write_to_json(pt_file, pt_records)
|
||||||
|
|
||||||
|
|
||||||
|
@deprecated(
|
||||||
|
"benchmark_serving.py is deprecated and will be removed in a future "
|
||||||
|
"version. Please use 'vllm bench serve' instead.",
|
||||||
|
)
|
||||||
def main(args: argparse.Namespace):
|
def main(args: argparse.Namespace):
|
||||||
print(args)
|
print(args)
|
||||||
random.seed(args.seed)
|
random.seed(args.seed)
|
||||||
|
|
|
@ -538,20 +538,6 @@ async def benchmark(
|
||||||
)
|
)
|
||||||
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
|
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
|
||||||
|
|
||||||
if profile:
|
|
||||||
print("Stopping profiler...")
|
|
||||||
profile_input = RequestFuncInput(
|
|
||||||
model=model_id,
|
|
||||||
prompt=test_request.prompt,
|
|
||||||
api_url=base_url + "/stop_profile",
|
|
||||||
prompt_len=test_request.prompt_len,
|
|
||||||
output_len=test_request.expected_output_len,
|
|
||||||
extra_body={test_request.structure_type: test_request.schema},
|
|
||||||
)
|
|
||||||
profile_output = await request_func(request_func_input=profile_input)
|
|
||||||
if profile_output.success:
|
|
||||||
print("Profiler stopped")
|
|
||||||
|
|
||||||
if pbar is not None:
|
if pbar is not None:
|
||||||
pbar.close()
|
pbar.close()
|
||||||
|
|
||||||
|
@ -569,6 +555,10 @@ async def benchmark(
|
||||||
|
|
||||||
print("{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="="))
|
print("{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="="))
|
||||||
print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
|
print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
|
||||||
|
if max_concurrency is not None:
|
||||||
|
print("{:<40} {:<10}".format("Maximum request concurrency:", max_concurrency))
|
||||||
|
if request_rate != float("inf"):
|
||||||
|
print("{:<40} {:<10.2f}".format("Request rate configured (RPS):", request_rate))
|
||||||
print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
|
print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
|
||||||
print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
|
print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
|
||||||
print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
|
print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
|
||||||
|
@ -666,6 +656,20 @@ async def benchmark(
|
||||||
|
|
||||||
print("=" * 50)
|
print("=" * 50)
|
||||||
|
|
||||||
|
if profile:
|
||||||
|
print("Stopping profiler...")
|
||||||
|
profile_input = RequestFuncInput(
|
||||||
|
model=model_id,
|
||||||
|
prompt=test_request.prompt,
|
||||||
|
api_url=base_url + "/stop_profile",
|
||||||
|
prompt_len=test_request.prompt_len,
|
||||||
|
output_len=test_request.expected_output_len,
|
||||||
|
extra_body={test_request.structure_type: test_request.schema},
|
||||||
|
)
|
||||||
|
profile_output = await request_func(request_func_input=profile_input)
|
||||||
|
if profile_output.success:
|
||||||
|
print("Profiler stopped")
|
||||||
|
|
||||||
return result, ret
|
return result, ret
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -15,6 +15,7 @@ import torch
|
||||||
import uvloop
|
import uvloop
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerBase
|
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerBase
|
||||||
|
from typing_extensions import deprecated
|
||||||
|
|
||||||
from benchmark_dataset import (
|
from benchmark_dataset import (
|
||||||
AIMODataset,
|
AIMODataset,
|
||||||
|
@ -167,7 +168,8 @@ async def run_vllm_async(
|
||||||
from vllm import SamplingParams
|
from vllm import SamplingParams
|
||||||
|
|
||||||
async with build_async_engine_client_from_engine_args(
|
async with build_async_engine_client_from_engine_args(
|
||||||
engine_args, disable_frontend_multiprocessing
|
engine_args,
|
||||||
|
disable_frontend_multiprocessing=disable_frontend_multiprocessing,
|
||||||
) as llm:
|
) as llm:
|
||||||
model_config = await llm.get_model_config()
|
model_config = await llm.get_model_config()
|
||||||
assert all(
|
assert all(
|
||||||
|
@ -381,6 +383,10 @@ def get_requests(args, tokenizer):
|
||||||
return dataset_cls(**common_kwargs).sample(**sample_kwargs)
|
return dataset_cls(**common_kwargs).sample(**sample_kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
@deprecated(
|
||||||
|
"benchmark_throughput.py is deprecated and will be removed in a "
|
||||||
|
"future version. Please use 'vllm bench throughput' instead.",
|
||||||
|
)
|
||||||
def main(args: argparse.Namespace):
|
def main(args: argparse.Namespace):
|
||||||
if args.seed is None:
|
if args.seed is None:
|
||||||
args.seed = 0
|
args.seed = 0
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
# benchmark the overhead of disaggregated prefill.
|
# benchmark the overhead of disaggregated prefill.
|
||||||
# methodology:
|
# methodology:
|
||||||
# - send all request to prefill vLLM instance. It will buffer KV cache.
|
# - send all request to prefill vLLM instance. It will buffer KV cache.
|
||||||
# - then send all request to decode instance.
|
# - then send all request to decode instance.
|
||||||
# - The TTFT of decode instance is the overhead.
|
# - The TTFT of decode instance is the overhead.
|
||||||
|
|
||||||
set -ex
|
set -ex
|
||||||
|
@ -12,6 +12,8 @@ kill_gpu_processes() {
|
||||||
# kill all processes on GPU.
|
# kill all processes on GPU.
|
||||||
pgrep pt_main_thread | xargs -r kill -9
|
pgrep pt_main_thread | xargs -r kill -9
|
||||||
pgrep python3 | xargs -r kill -9
|
pgrep python3 | xargs -r kill -9
|
||||||
|
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
|
||||||
|
pgrep VLLM | xargs -r kill -9
|
||||||
sleep 10
|
sleep 10
|
||||||
|
|
||||||
# remove vllm config file
|
# remove vllm config file
|
||||||
|
@ -61,7 +63,7 @@ benchmark() {
|
||||||
--gpu-memory-utilization 0.6 \
|
--gpu-memory-utilization 0.6 \
|
||||||
--kv-transfer-config \
|
--kv-transfer-config \
|
||||||
'{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
|
'{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
|
||||||
|
|
||||||
|
|
||||||
CUDA_VISIBLE_DEVICES=1 python3 \
|
CUDA_VISIBLE_DEVICES=1 python3 \
|
||||||
-m vllm.entrypoints.openai.api_server \
|
-m vllm.entrypoints.openai.api_server \
|
||||||
|
@ -76,38 +78,38 @@ benchmark() {
|
||||||
wait_for_server 8200
|
wait_for_server 8200
|
||||||
|
|
||||||
# let the prefill instance finish prefill
|
# let the prefill instance finish prefill
|
||||||
python3 ../benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model $model \
|
--model $model \
|
||||||
--dataset-name $dataset_name \
|
--dataset-name $dataset_name \
|
||||||
--dataset-path $dataset_path \
|
--dataset-path $dataset_path \
|
||||||
--sonnet-input-len $input_len \
|
--sonnet-input-len $input_len \
|
||||||
--sonnet-output-len "$output_len" \
|
--sonnet-output-len "$output_len" \
|
||||||
--sonnet-prefix-len $prefix_len \
|
--sonnet-prefix-len $prefix_len \
|
||||||
--num-prompts $num_prompts \
|
--num-prompts $num_prompts \
|
||||||
--port 8100 \
|
--port 8100 \
|
||||||
--save-result \
|
--save-result \
|
||||||
--result-dir $results_folder \
|
--result-dir $results_folder \
|
||||||
--result-filename disagg_prefill_tp1.json \
|
--result-filename disagg_prefill_tp1.json \
|
||||||
--request-rate "inf"
|
--request-rate "inf"
|
||||||
|
|
||||||
|
|
||||||
# send the request to decode.
|
# send the request to decode.
|
||||||
# The TTFT of this command will be the overhead of disagg prefill impl.
|
# The TTFT of this command will be the overhead of disagg prefill impl.
|
||||||
python3 ../benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model $model \
|
--model $model \
|
||||||
--dataset-name $dataset_name \
|
--dataset-name $dataset_name \
|
||||||
--dataset-path $dataset_path \
|
--dataset-path $dataset_path \
|
||||||
--sonnet-input-len $input_len \
|
--sonnet-input-len $input_len \
|
||||||
--sonnet-output-len "$output_len" \
|
--sonnet-output-len "$output_len" \
|
||||||
--sonnet-prefix-len $prefix_len \
|
--sonnet-prefix-len $prefix_len \
|
||||||
--num-prompts $num_prompts \
|
--num-prompts $num_prompts \
|
||||||
--port 8200 \
|
--port 8200 \
|
||||||
--save-result \
|
--save-result \
|
||||||
--result-dir $results_folder \
|
--result-dir $results_folder \
|
||||||
--result-filename disagg_prefill_tp1_overhead.json \
|
--result-filename disagg_prefill_tp1_overhead.json \
|
||||||
--request-rate "$qps"
|
--request-rate "$qps"
|
||||||
kill_gpu_processes
|
kill_gpu_processes
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
|
@ -18,6 +18,8 @@ kill_gpu_processes() {
|
||||||
# kill all processes on GPU.
|
# kill all processes on GPU.
|
||||||
pgrep pt_main_thread | xargs -r kill -9
|
pgrep pt_main_thread | xargs -r kill -9
|
||||||
pgrep python3 | xargs -r kill -9
|
pgrep python3 | xargs -r kill -9
|
||||||
|
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
|
||||||
|
pgrep VLLM | xargs -r kill -9
|
||||||
for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done
|
for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done
|
||||||
sleep 1
|
sleep 1
|
||||||
}
|
}
|
||||||
|
@ -58,7 +60,7 @@ launch_chunked_prefill() {
|
||||||
|
|
||||||
|
|
||||||
launch_disagg_prefill() {
|
launch_disagg_prefill() {
|
||||||
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
|
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
|
||||||
# disagg prefill
|
# disagg prefill
|
||||||
CUDA_VISIBLE_DEVICES=0 python3 \
|
CUDA_VISIBLE_DEVICES=0 python3 \
|
||||||
-m vllm.entrypoints.openai.api_server \
|
-m vllm.entrypoints.openai.api_server \
|
||||||
|
@ -97,20 +99,20 @@ benchmark() {
|
||||||
output_len=$2
|
output_len=$2
|
||||||
tag=$3
|
tag=$3
|
||||||
|
|
||||||
python3 ../benchmark_serving.py \
|
vllm bench serve \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
--model $model \
|
--model $model \
|
||||||
--dataset-name $dataset_name \
|
--dataset-name $dataset_name \
|
||||||
--dataset-path $dataset_path \
|
--dataset-path $dataset_path \
|
||||||
--sonnet-input-len $input_len \
|
--sonnet-input-len $input_len \
|
||||||
--sonnet-output-len "$output_len" \
|
--sonnet-output-len "$output_len" \
|
||||||
--sonnet-prefix-len $prefix_len \
|
--sonnet-prefix-len $prefix_len \
|
||||||
--num-prompts $num_prompts \
|
--num-prompts $num_prompts \
|
||||||
--port 8000 \
|
--port 8000 \
|
||||||
--save-result \
|
--save-result \
|
||||||
--result-dir $results_folder \
|
--result-dir $results_folder \
|
||||||
--result-filename "$tag"-qps-"$qps".json \
|
--result-filename "$tag"-qps-"$qps".json \
|
||||||
--request-rate "$qps"
|
--request-rate "$qps"
|
||||||
|
|
||||||
sleep 2
|
sleep 2
|
||||||
}
|
}
|
||||||
|
|
|
@ -5,9 +5,8 @@ import itertools
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
|
|
||||||
from vllm import _custom_ops as ops
|
|
||||||
from vllm.model_executor.layers.fused_moe.moe_align_block_size import (
|
from vllm.model_executor.layers.fused_moe.moe_align_block_size import (
|
||||||
moe_align_block_size_triton,
|
moe_align_block_size,
|
||||||
)
|
)
|
||||||
from vllm.triton_utils import triton
|
from vllm.triton_utils import triton
|
||||||
|
|
||||||
|
@ -21,60 +20,6 @@ def get_topk_ids(num_tokens: int, num_experts: int, topk: int) -> torch.Tensor:
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def check_correctness(num_tokens, num_experts=256, block_size=256, topk=8):
|
|
||||||
"""
|
|
||||||
Verifies vllm vs. Triton
|
|
||||||
"""
|
|
||||||
topk_ids = get_topk_ids(num_tokens, num_experts, topk)
|
|
||||||
|
|
||||||
# 1. malloc space for triton and vllm
|
|
||||||
# malloc enough space (max_num_tokens_padded) for the sorted ids
|
|
||||||
max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
|
|
||||||
sorted_ids_triton = torch.empty(
|
|
||||||
(max_num_tokens_padded,), dtype=torch.int32, device="cuda"
|
|
||||||
)
|
|
||||||
expert_ids_triton = torch.empty(
|
|
||||||
(max_num_tokens_padded // block_size,), dtype=torch.int32, device="cuda"
|
|
||||||
)
|
|
||||||
num_tokens_post_pad_triton = torch.empty((1,), dtype=torch.int32, device="cuda")
|
|
||||||
|
|
||||||
sorted_ids_vllm = torch.empty_like(sorted_ids_triton)
|
|
||||||
expert_ids_vllm = torch.empty_like(expert_ids_triton)
|
|
||||||
num_tokens_post_pad_vllm = torch.empty_like(num_tokens_post_pad_triton)
|
|
||||||
|
|
||||||
# 2. run implementations
|
|
||||||
moe_align_block_size_triton(
|
|
||||||
topk_ids,
|
|
||||||
num_experts,
|
|
||||||
block_size,
|
|
||||||
sorted_ids_triton,
|
|
||||||
expert_ids_triton,
|
|
||||||
num_tokens_post_pad_triton,
|
|
||||||
)
|
|
||||||
|
|
||||||
ops.moe_align_block_size(
|
|
||||||
topk_ids,
|
|
||||||
num_experts,
|
|
||||||
block_size,
|
|
||||||
sorted_ids_vllm,
|
|
||||||
expert_ids_vllm,
|
|
||||||
num_tokens_post_pad_vllm,
|
|
||||||
)
|
|
||||||
print(f"✅ VLLM implementation works with {num_experts} experts!")
|
|
||||||
|
|
||||||
# 3. compare results
|
|
||||||
if torch.allclose(expert_ids_triton, expert_ids_vllm) and torch.allclose(
|
|
||||||
num_tokens_post_pad_triton, num_tokens_post_pad_vllm
|
|
||||||
):
|
|
||||||
print("✅ Triton and VLLM implementations match.")
|
|
||||||
else:
|
|
||||||
print("❌ Triton and VLLM implementations DO NOT match.")
|
|
||||||
print("Triton expert_ids:", expert_ids_triton)
|
|
||||||
print("VLLM expert_ids:", expert_ids_vllm)
|
|
||||||
print("Triton num_tokens_post_pad:", num_tokens_post_pad_triton)
|
|
||||||
print("VLLM num_tokens_post_pad:", num_tokens_post_pad_vllm)
|
|
||||||
|
|
||||||
|
|
||||||
# test configurations
|
# test configurations
|
||||||
num_tokens_range = [1, 16, 256, 4096]
|
num_tokens_range = [1, 16, 256, 4096]
|
||||||
num_experts_range = [16, 64, 224, 256, 280, 512]
|
num_experts_range = [16, 64, 224, 256, 280, 512]
|
||||||
|
@ -87,8 +32,8 @@ configs = list(itertools.product(num_tokens_range, num_experts_range, topk_range
|
||||||
x_names=["num_tokens", "num_experts", "topk"],
|
x_names=["num_tokens", "num_experts", "topk"],
|
||||||
x_vals=configs,
|
x_vals=configs,
|
||||||
line_arg="provider",
|
line_arg="provider",
|
||||||
line_vals=["vllm", "triton"], # "triton"
|
line_vals=["vllm"],
|
||||||
line_names=["VLLM", "Triton"], # "Triton"
|
line_names=["vLLM"],
|
||||||
plot_name="moe-align-block-size-performance",
|
plot_name="moe-align-block-size-performance",
|
||||||
args={},
|
args={},
|
||||||
)
|
)
|
||||||
|
@ -98,36 +43,11 @@ def benchmark(num_tokens, num_experts, topk, provider):
|
||||||
block_size = 256
|
block_size = 256
|
||||||
topk_ids = get_topk_ids(num_tokens, num_experts, topk)
|
topk_ids = get_topk_ids(num_tokens, num_experts, topk)
|
||||||
|
|
||||||
max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
|
|
||||||
sorted_ids = torch.empty((max_num_tokens_padded,), dtype=torch.int32, device="cuda")
|
|
||||||
max_num_m_blocks = max_num_tokens_padded // block_size
|
|
||||||
expert_ids = torch.empty((max_num_m_blocks,), dtype=torch.int32, device="cuda")
|
|
||||||
num_tokens_post_pad = torch.empty((1,), dtype=torch.int32, device="cuda")
|
|
||||||
|
|
||||||
quantiles = [0.5, 0.2, 0.8]
|
quantiles = [0.5, 0.2, 0.8]
|
||||||
|
|
||||||
if provider == "vllm":
|
if provider == "vllm":
|
||||||
ms, min_ms, max_ms = triton.testing.do_bench(
|
ms, min_ms, max_ms = triton.testing.do_bench(
|
||||||
lambda: ops.moe_align_block_size(
|
lambda: moe_align_block_size(topk_ids, block_size, num_experts),
|
||||||
topk_ids,
|
|
||||||
num_experts,
|
|
||||||
block_size,
|
|
||||||
sorted_ids.clone(),
|
|
||||||
expert_ids.clone(),
|
|
||||||
num_tokens_post_pad.clone(),
|
|
||||||
),
|
|
||||||
quantiles=quantiles,
|
|
||||||
)
|
|
||||||
elif provider == "triton":
|
|
||||||
ms, min_ms, max_ms = triton.testing.do_bench(
|
|
||||||
lambda: moe_align_block_size_triton(
|
|
||||||
topk_ids,
|
|
||||||
num_experts,
|
|
||||||
block_size,
|
|
||||||
sorted_ids.clone(),
|
|
||||||
expert_ids.clone(),
|
|
||||||
num_tokens_post_pad.clone(),
|
|
||||||
),
|
|
||||||
quantiles=quantiles,
|
quantiles=quantiles,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -151,6 +71,4 @@ if __name__ == "__main__":
|
||||||
)
|
)
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
print("Running correctness check...")
|
|
||||||
check_correctness(num_tokens=1024, num_experts=args.num_experts, topk=args.topk)
|
|
||||||
benchmark.run(print_data=True, show_plots=True)
|
benchmark.run(print_data=True, show_plots=True)
|
||||||
|
|
|
@ -8,12 +8,13 @@ import ray
|
||||||
import torch
|
import torch
|
||||||
from transformers import AutoConfig
|
from transformers import AutoConfig
|
||||||
|
|
||||||
from vllm.model_executor.layers.fused_moe.deep_gemm_moe import (
|
from vllm.model_executor.layers.fused_moe.fused_moe import *
|
||||||
|
from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import (
|
||||||
_moe_permute,
|
_moe_permute,
|
||||||
_moe_unpermute_and_reduce,
|
_moe_unpermute_and_reduce,
|
||||||
|
moe_permute,
|
||||||
|
moe_unpermute,
|
||||||
)
|
)
|
||||||
from vllm.model_executor.layers.fused_moe.fused_moe import *
|
|
||||||
from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import *
|
|
||||||
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
|
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils import FlexibleArgumentParser
|
||||||
|
@ -63,18 +64,19 @@ def benchmark_permute(
|
||||||
|
|
||||||
def run():
|
def run():
|
||||||
if use_customized_permute:
|
if use_customized_permute:
|
||||||
(permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = (
|
(
|
||||||
moe_permute(
|
permuted_hidden_states,
|
||||||
qhidden_states,
|
a1q_scale,
|
||||||
topk_weights=topk_weights,
|
first_token_off,
|
||||||
topk_ids=topk_ids,
|
inv_perm_idx,
|
||||||
token_expert_indices=token_expert_indices,
|
m_indices,
|
||||||
topk=topk,
|
) = moe_permute(
|
||||||
n_expert=num_experts,
|
qhidden_states,
|
||||||
n_local_expert=num_experts,
|
a1q_scale=None,
|
||||||
expert_map=None,
|
topk_ids=topk_ids,
|
||||||
align_block_size=align_block_size,
|
n_expert=num_experts,
|
||||||
)
|
expert_map=None,
|
||||||
|
align_block_size=align_block_size,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
(
|
(
|
||||||
|
@ -150,18 +152,19 @@ def benchmark_unpermute(
|
||||||
|
|
||||||
def prepare():
|
def prepare():
|
||||||
if use_customized_permute:
|
if use_customized_permute:
|
||||||
(permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = (
|
(
|
||||||
moe_permute(
|
permuted_hidden_states,
|
||||||
qhidden_states,
|
a1q_scale,
|
||||||
topk_weights=topk_weights,
|
first_token_off,
|
||||||
topk_ids=topk_ids,
|
inv_perm_idx,
|
||||||
token_expert_indices=token_expert_indices,
|
m_indices,
|
||||||
topk=topk,
|
) = moe_permute(
|
||||||
n_expert=num_experts,
|
qhidden_states,
|
||||||
n_local_expert=num_experts,
|
a1q_scale=None,
|
||||||
expert_map=None,
|
topk_ids=topk_ids,
|
||||||
align_block_size=align_block_size,
|
n_expert=num_experts,
|
||||||
)
|
expert_map=None,
|
||||||
|
align_block_size=align_block_size,
|
||||||
)
|
)
|
||||||
# convert to fp16/bf16 as gemm output
|
# convert to fp16/bf16 as gemm output
|
||||||
return (
|
return (
|
||||||
|
@ -191,16 +194,19 @@ def benchmark_unpermute(
|
||||||
|
|
||||||
def run(input: tuple):
|
def run(input: tuple):
|
||||||
if use_customized_permute:
|
if use_customized_permute:
|
||||||
(permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = input
|
(
|
||||||
|
permuted_hidden_states,
|
||||||
|
first_token_off,
|
||||||
|
inv_perm_idx,
|
||||||
|
m_indices,
|
||||||
|
) = input
|
||||||
|
output = torch.empty_like(hidden_states)
|
||||||
moe_unpermute(
|
moe_unpermute(
|
||||||
|
output,
|
||||||
permuted_hidden_states,
|
permuted_hidden_states,
|
||||||
topk_weights,
|
topk_weights,
|
||||||
topk_ids,
|
|
||||||
inv_perm_idx,
|
inv_perm_idx,
|
||||||
first_token_off,
|
first_token_off,
|
||||||
topk,
|
|
||||||
num_experts,
|
|
||||||
num_experts,
|
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
(
|
(
|
||||||
|
@ -211,7 +217,11 @@ def benchmark_unpermute(
|
||||||
inv_perm,
|
inv_perm,
|
||||||
) = input
|
) = input
|
||||||
_moe_unpermute_and_reduce(
|
_moe_unpermute_and_reduce(
|
||||||
output_hidden_states, permuted_hidden_states, inv_perm, topk_weights
|
output_hidden_states,
|
||||||
|
permuted_hidden_states,
|
||||||
|
inv_perm,
|
||||||
|
topk_weights,
|
||||||
|
True,
|
||||||
)
|
)
|
||||||
|
|
||||||
# JIT compilation & warmup
|
# JIT compilation & warmup
|
||||||
|
|
|
@ -0,0 +1,159 @@
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import math
|
||||||
|
from contextlib import contextmanager
|
||||||
|
from typing import Callable
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from vllm.model_executor.layers.quantization.utils import fp8_utils, int8_utils
|
||||||
|
from vllm.platforms import current_platform
|
||||||
|
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def _triton_mode():
|
||||||
|
"""Temporarily force the Triton fallback path"""
|
||||||
|
with patch("vllm.platforms.current_platform.is_cuda", return_value=False):
|
||||||
|
yield
|
||||||
|
|
||||||
|
|
||||||
|
def _time_cuda(
|
||||||
|
fn: Callable[[], tuple[torch.Tensor, torch.Tensor]],
|
||||||
|
warmup_iters: int,
|
||||||
|
bench_iters: int,
|
||||||
|
) -> float:
|
||||||
|
# warmup
|
||||||
|
for _ in range(warmup_iters):
|
||||||
|
fn()
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
|
||||||
|
start = torch.cuda.Event(enable_timing=True)
|
||||||
|
end = torch.cuda.Event(enable_timing=True)
|
||||||
|
|
||||||
|
start.record()
|
||||||
|
for _ in range(bench_iters):
|
||||||
|
fn()
|
||||||
|
end.record()
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
|
||||||
|
return start.elapsed_time(end) / bench_iters # ms/iter
|
||||||
|
|
||||||
|
|
||||||
|
def _run_single(
|
||||||
|
shape: tuple[int, int],
|
||||||
|
group_size: int,
|
||||||
|
dtype: str,
|
||||||
|
*,
|
||||||
|
column_major: bool = False,
|
||||||
|
scale_ue8m0: bool = False,
|
||||||
|
warmup_iters: int,
|
||||||
|
bench_iters: int,
|
||||||
|
) -> None:
|
||||||
|
num_tokens, hidden_dim = shape
|
||||||
|
|
||||||
|
device = torch.device("cuda")
|
||||||
|
torch.manual_seed(42)
|
||||||
|
x = torch.randn(num_tokens, hidden_dim, device=device, dtype=torch.bfloat16) * 8
|
||||||
|
|
||||||
|
if dtype == "fp8":
|
||||||
|
|
||||||
|
def cuda_impl():
|
||||||
|
return fp8_utils.per_token_group_quant_fp8(
|
||||||
|
x,
|
||||||
|
group_size,
|
||||||
|
column_major_scales=column_major,
|
||||||
|
use_ue8m0=scale_ue8m0,
|
||||||
|
)
|
||||||
|
|
||||||
|
def triton_impl():
|
||||||
|
with _triton_mode():
|
||||||
|
return fp8_utils.per_token_group_quant_fp8(
|
||||||
|
x,
|
||||||
|
group_size,
|
||||||
|
column_major_scales=column_major,
|
||||||
|
use_ue8m0=scale_ue8m0,
|
||||||
|
)
|
||||||
|
elif dtype == "int8":
|
||||||
|
|
||||||
|
def cuda_impl():
|
||||||
|
return int8_utils.per_token_group_quant_int8(x, group_size)
|
||||||
|
|
||||||
|
def triton_impl():
|
||||||
|
with _triton_mode():
|
||||||
|
return int8_utils.per_token_group_quant_int8(x, group_size)
|
||||||
|
else:
|
||||||
|
raise ValueError("dtype must be 'fp8' or 'int8'")
|
||||||
|
|
||||||
|
cuda_ms = _time_cuda(cuda_impl, warmup_iters, bench_iters)
|
||||||
|
triton_ms = _time_cuda(triton_impl, warmup_iters, bench_iters)
|
||||||
|
|
||||||
|
speedup = triton_ms / cuda_ms if cuda_ms else math.inf
|
||||||
|
|
||||||
|
cfg_desc = (
|
||||||
|
f"shape={shape} gs={group_size:<3} col_major={column_major:<5} "
|
||||||
|
f"ue8m0={scale_ue8m0:<5} dtype={dtype}"
|
||||||
|
)
|
||||||
|
print(
|
||||||
|
f"{cfg_desc:55} | CUDA {cuda_ms:7.3f} ms | Triton {triton_ms:7.3f} ms | "
|
||||||
|
f"speed-up ×{speedup:5.2f}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_args():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--warmup-iters", type=int, default=10)
|
||||||
|
parser.add_argument("--bench-iters", type=int, default=100)
|
||||||
|
parser.add_argument("--dtype", choices=["fp8", "int8", "both"], default="both")
|
||||||
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if not current_platform.is_cuda():
|
||||||
|
raise RuntimeError("CUDA device is required to run this benchmark.")
|
||||||
|
|
||||||
|
args = parse_args()
|
||||||
|
warmup_iters, bench_iters = args.warmup_iters, args.bench_iters
|
||||||
|
|
||||||
|
shapes = [(32, 128), (64, 256), (16, 512)]
|
||||||
|
group_sizes = [64, 128]
|
||||||
|
|
||||||
|
dtypes = ["fp8", "int8"] if args.dtype == "both" else [args.dtype]
|
||||||
|
|
||||||
|
header = (
|
||||||
|
"Configuration".ljust(55)
|
||||||
|
+ " | "
|
||||||
|
+ "CUDA (ms)".center(12)
|
||||||
|
+ " | "
|
||||||
|
+ "Triton (ms)".center(13)
|
||||||
|
+ " | "
|
||||||
|
+ "Speed-up"
|
||||||
|
)
|
||||||
|
print(header)
|
||||||
|
print("-" * len(header))
|
||||||
|
|
||||||
|
for dtype in dtypes:
|
||||||
|
for shape in shapes:
|
||||||
|
for gs in group_sizes:
|
||||||
|
if dtype == "fp8":
|
||||||
|
for col_major in (False, True):
|
||||||
|
for ue8m0 in (False, True):
|
||||||
|
_run_single(
|
||||||
|
shape,
|
||||||
|
gs,
|
||||||
|
dtype,
|
||||||
|
column_major=col_major,
|
||||||
|
scale_ue8m0=ue8m0,
|
||||||
|
warmup_iters=warmup_iters,
|
||||||
|
bench_iters=bench_iters,
|
||||||
|
)
|
||||||
|
else: # INT8 has no col-major / ue8m0 switches
|
||||||
|
_run_single(
|
||||||
|
shape,
|
||||||
|
gs,
|
||||||
|
dtype,
|
||||||
|
warmup_iters=warmup_iters,
|
||||||
|
bench_iters=bench_iters,
|
||||||
|
)
|
|
@ -71,22 +71,20 @@ def benchmark_decode(
|
||||||
if kv_cache_dtype.startswith("fp8"):
|
if kv_cache_dtype.startswith("fp8"):
|
||||||
kv_cache, _ = to_float8(kv_cache)
|
kv_cache, _ = to_float8(kv_cache)
|
||||||
|
|
||||||
|
output_trtllm = torch.empty(q.shape, dtype=dtype)
|
||||||
|
|
||||||
# Benchmark TRT decode
|
# Benchmark TRT decode
|
||||||
def trt_decode():
|
def trt_decode():
|
||||||
return flashinfer.decode.trtllm_batch_decode_with_kv_cache(
|
return flashinfer.decode.trtllm_batch_decode_with_kv_cache(
|
||||||
q,
|
q,
|
||||||
kv_cache,
|
kv_cache,
|
||||||
workspace_buffer,
|
workspace_buffer,
|
||||||
num_qo_heads,
|
|
||||||
num_kv_heads,
|
|
||||||
sm_scale,
|
|
||||||
block_tables,
|
block_tables,
|
||||||
kv_lens_tensor,
|
kv_lens_tensor,
|
||||||
page_size,
|
|
||||||
max_kv_len,
|
max_kv_len,
|
||||||
kv_cache_dtype,
|
bmm1_scale=k_scale * sm_scale,
|
||||||
k_scale,
|
bmm2_scale=v_scale,
|
||||||
v_scale,
|
out=output_trtllm,
|
||||||
)
|
)
|
||||||
|
|
||||||
def time_fn(fn, warmup=10, trials=20):
|
def time_fn(fn, warmup=10, trials=20):
|
||||||
|
@ -125,6 +123,8 @@ def benchmark_decode(
|
||||||
kv_indices = torch.tensor(kv_indices, dtype=torch.int32)
|
kv_indices = torch.tensor(kv_indices, dtype=torch.int32)
|
||||||
kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32)
|
kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32)
|
||||||
|
|
||||||
|
output_baseline = torch.empty(q.shape, dtype=dtype)
|
||||||
|
|
||||||
wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
|
wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
|
||||||
workspace_buffer,
|
workspace_buffer,
|
||||||
kv_layout,
|
kv_layout,
|
||||||
|
@ -145,7 +145,7 @@ def benchmark_decode(
|
||||||
)
|
)
|
||||||
|
|
||||||
def baseline_decode():
|
def baseline_decode():
|
||||||
return wrapper.run(q, kv_cache, sm_scale, k_scale, v_scale)
|
return wrapper.run(q, kv_cache, sm_scale, k_scale, v_scale, output_baseline)
|
||||||
|
|
||||||
baseline_mean, baseline_std = time_fn(baseline_decode)
|
baseline_mean, baseline_std = time_fn(baseline_decode)
|
||||||
|
|
||||||
|
@ -214,25 +214,39 @@ if __name__ == "__main__":
|
||||||
max_seq_lens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072]
|
max_seq_lens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072]
|
||||||
all_results = []
|
all_results = []
|
||||||
|
|
||||||
print("Running benchmark for kv_cache_dtype: bfloat16")
|
|
||||||
print(
|
print(
|
||||||
"\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\tbaseline_std\tspeedup_percent"
|
"Running benchmark for q_dtype = bfloat16, kv_cache_dtype: bfloat16, "
|
||||||
|
"output_dtype: bfloat16"
|
||||||
|
)
|
||||||
|
print(
|
||||||
|
"\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t"
|
||||||
|
"baseline_std\tspeedup_percent"
|
||||||
)
|
)
|
||||||
for max_seq_len in max_seq_lens:
|
for max_seq_len in max_seq_lens:
|
||||||
for bs in num_seqs:
|
for bs in num_seqs:
|
||||||
result = benchmark_decode(
|
result = benchmark_decode(
|
||||||
bs, max_seq_len, dtype=torch.bfloat16, kv_cache_dtype="auto"
|
bs,
|
||||||
|
max_seq_len,
|
||||||
|
dtype=torch.bfloat16,
|
||||||
|
kv_cache_dtype="auto",
|
||||||
)
|
)
|
||||||
all_results.append(result)
|
all_results.append(result)
|
||||||
|
|
||||||
print("Running benchmark for q_dtype = bfloat16, kv_cache_dtype: fp8")
|
|
||||||
print(
|
print(
|
||||||
"\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\tbaseline_std\tspeedup_percent"
|
"Running benchmark for q_dtype = bfloat16, kv_cache_dtype: fp8, "
|
||||||
|
"output_dtype: bfloat16"
|
||||||
|
)
|
||||||
|
print(
|
||||||
|
"\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t"
|
||||||
|
"baseline_std\tspeedup_percent"
|
||||||
)
|
)
|
||||||
for max_seq_len in max_seq_lens:
|
for max_seq_len in max_seq_lens:
|
||||||
for bs in num_seqs:
|
for bs in num_seqs:
|
||||||
result = benchmark_decode(
|
result = benchmark_decode(
|
||||||
bs, max_seq_len, dtype=torch.bfloat16, kv_cache_dtype="fp8"
|
bs,
|
||||||
|
max_seq_len,
|
||||||
|
dtype=torch.bfloat16,
|
||||||
|
kv_cache_dtype="fp8",
|
||||||
)
|
)
|
||||||
all_results.append(result)
|
all_results.append(result)
|
||||||
|
|
||||||
|
|
|
@ -8,7 +8,7 @@ Currently this just includes dense GEMMs and only works on Hopper GPUs.
|
||||||
|
|
||||||
You need to install vLLM in your usual fashion, then install DeepGEMM from source in its own directory:
|
You need to install vLLM in your usual fashion, then install DeepGEMM from source in its own directory:
|
||||||
|
|
||||||
```
|
```bash
|
||||||
git clone --recursive https://github.com/deepseek-ai/DeepGEMM
|
git clone --recursive https://github.com/deepseek-ai/DeepGEMM
|
||||||
cd DeepGEMM
|
cd DeepGEMM
|
||||||
python setup.py install
|
python setup.py install
|
||||||
|
@ -17,7 +17,7 @@ uv pip install -e .
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
```
|
```console
|
||||||
python benchmark_fp8_block_dense_gemm.py
|
python benchmark_fp8_block_dense_gemm.py
|
||||||
INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda.
|
INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda.
|
||||||
===== STARTING FP8 GEMM BENCHMARK =====
|
===== STARTING FP8 GEMM BENCHMARK =====
|
||||||
|
|
|
@ -58,6 +58,22 @@ function (find_isa CPUINFO TARGET OUT)
|
||||||
endif()
|
endif()
|
||||||
endfunction()
|
endfunction()
|
||||||
|
|
||||||
|
|
||||||
|
function(check_sysctl TARGET OUT)
|
||||||
|
execute_process(COMMAND sysctl -n "${TARGET}"
|
||||||
|
RESULT_VARIABLE SYSCTL_RET
|
||||||
|
OUTPUT_VARIABLE SYSCTL_INFO
|
||||||
|
ERROR_QUIET
|
||||||
|
OUTPUT_STRIP_TRAILING_WHITESPACE)
|
||||||
|
if(SYSCTL_RET EQUAL 0 AND
|
||||||
|
(SYSCTL_INFO STREQUAL "1" OR SYSCTL_INFO GREATER 0))
|
||||||
|
set(${OUT} ON PARENT_SCOPE)
|
||||||
|
else()
|
||||||
|
set(${OUT} OFF PARENT_SCOPE)
|
||||||
|
endif()
|
||||||
|
endfunction()
|
||||||
|
|
||||||
|
|
||||||
function (is_avx512_disabled OUT)
|
function (is_avx512_disabled OUT)
|
||||||
set(DISABLE_AVX512 $ENV{VLLM_CPU_DISABLE_AVX512})
|
set(DISABLE_AVX512 $ENV{VLLM_CPU_DISABLE_AVX512})
|
||||||
if(DISABLE_AVX512 AND DISABLE_AVX512 STREQUAL "true")
|
if(DISABLE_AVX512 AND DISABLE_AVX512 STREQUAL "true")
|
||||||
|
@ -70,7 +86,10 @@ endfunction()
|
||||||
is_avx512_disabled(AVX512_DISABLED)
|
is_avx512_disabled(AVX512_DISABLED)
|
||||||
|
|
||||||
if (MACOSX_FOUND AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
|
if (MACOSX_FOUND AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
|
||||||
set(APPLE_SILICON_FOUND TRUE)
|
message(STATUS "Apple Silicon Detected")
|
||||||
|
set(ENABLE_NUMA OFF)
|
||||||
|
check_sysctl(hw.optional.neon ASIMD_FOUND)
|
||||||
|
check_sysctl(hw.optional.arm.FEAT_BF16 ARM_BF16_FOUND)
|
||||||
else()
|
else()
|
||||||
find_isa(${CPUINFO} "avx2" AVX2_FOUND)
|
find_isa(${CPUINFO} "avx2" AVX2_FOUND)
|
||||||
find_isa(${CPUINFO} "avx512f" AVX512_FOUND)
|
find_isa(${CPUINFO} "avx512f" AVX512_FOUND)
|
||||||
|
@ -82,7 +101,6 @@ else()
|
||||||
find_isa(${CPUINFO} "S390" S390_FOUND)
|
find_isa(${CPUINFO} "S390" S390_FOUND)
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
|
|
||||||
if (AVX512_FOUND AND NOT AVX512_DISABLED)
|
if (AVX512_FOUND AND NOT AVX512_DISABLED)
|
||||||
list(APPEND CXX_COMPILE_FLAGS
|
list(APPEND CXX_COMPILE_FLAGS
|
||||||
"-mavx512f"
|
"-mavx512f"
|
||||||
|
@ -149,9 +167,6 @@ elseif (ASIMD_FOUND)
|
||||||
set(MARCH_FLAGS "-march=armv8.2-a+dotprod+fp16")
|
set(MARCH_FLAGS "-march=armv8.2-a+dotprod+fp16")
|
||||||
endif()
|
endif()
|
||||||
list(APPEND CXX_COMPILE_FLAGS ${MARCH_FLAGS})
|
list(APPEND CXX_COMPILE_FLAGS ${MARCH_FLAGS})
|
||||||
elseif(APPLE_SILICON_FOUND)
|
|
||||||
message(STATUS "Apple Silicon Detected")
|
|
||||||
set(ENABLE_NUMA OFF)
|
|
||||||
elseif (S390_FOUND)
|
elseif (S390_FOUND)
|
||||||
message(STATUS "S390 detected")
|
message(STATUS "S390 detected")
|
||||||
# Check for S390 VXE support
|
# Check for S390 VXE support
|
||||||
|
|
|
@ -24,7 +24,7 @@
|
||||||
|
|
||||||
#include "attention_dtypes.h"
|
#include "attention_dtypes.h"
|
||||||
#include "attention_utils.cuh"
|
#include "attention_utils.cuh"
|
||||||
#include "cuda_compat.h"
|
#include "../cuda_compat.h"
|
||||||
|
|
||||||
#ifdef USE_ROCM
|
#ifdef USE_ROCM
|
||||||
#include <hip/hip_bf16.h>
|
#include <hip/hip_bf16.h>
|
||||||
|
|
|
@ -16,9 +16,8 @@
|
||||||
* See the License for the specific language governing permissions and
|
* See the License for the specific language governing permissions and
|
||||||
* limitations under the License.
|
* limitations under the License.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
#include "attention_kernels.cuh"
|
#include "attention_kernels.cuh"
|
||||||
#include "cuda_compat.h"
|
#include "../cuda_compat.h"
|
||||||
|
|
||||||
#define MAX(a, b) ((a) > (b) ? (a) : (b))
|
#define MAX(a, b) ((a) > (b) ? (a) : (b))
|
||||||
#define MIN(a, b) ((a) < (b) ? (a) : (b))
|
#define MIN(a, b) ((a) < (b) ? (a) : (b))
|
||||||
|
@ -75,7 +74,7 @@ void paged_attention_v1_launcher(
|
||||||
const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr());
|
const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr());
|
||||||
const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr());
|
const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr());
|
||||||
|
|
||||||
constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
|
const int NUM_WARPS = NUM_THREADS / WARP_SIZE;
|
||||||
int padded_max_seq_len =
|
int padded_max_seq_len =
|
||||||
DIVIDE_ROUND_UP(max_seq_len, BLOCK_SIZE) * BLOCK_SIZE;
|
DIVIDE_ROUND_UP(max_seq_len, BLOCK_SIZE) * BLOCK_SIZE;
|
||||||
int logits_size = padded_max_seq_len * sizeof(float);
|
int logits_size = padded_max_seq_len * sizeof(float);
|
||||||
|
|
|
@ -16,9 +16,8 @@
|
||||||
* See the License for the specific language governing permissions and
|
* See the License for the specific language governing permissions and
|
||||||
* limitations under the License.
|
* limitations under the License.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
#include "attention_kernels.cuh"
|
#include "attention_kernels.cuh"
|
||||||
#include "cuda_compat.h"
|
#include "../cuda_compat.h"
|
||||||
|
|
||||||
#define MAX(a, b) ((a) > (b) ? (a) : (b))
|
#define MAX(a, b) ((a) > (b) ? (a) : (b))
|
||||||
#define MIN(a, b) ((a) < (b) ? (a) : (b))
|
#define MIN(a, b) ((a) < (b) ? (a) : (b))
|
||||||
|
@ -79,7 +78,7 @@ void paged_attention_v2_launcher(
|
||||||
const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr());
|
const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr());
|
||||||
const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr());
|
const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr());
|
||||||
|
|
||||||
constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
|
const int NUM_WARPS = NUM_THREADS / WARP_SIZE;
|
||||||
int max_num_partitions = DIVIDE_ROUND_UP(max_seq_len, PARTITION_SIZE);
|
int max_num_partitions = DIVIDE_ROUND_UP(max_seq_len, PARTITION_SIZE);
|
||||||
int logits_size = PARTITION_SIZE * sizeof(float);
|
int logits_size = PARTITION_SIZE * sizeof(float);
|
||||||
int outputs_size = (NUM_WARPS / 2) * head_size * sizeof(float);
|
int outputs_size = (NUM_WARPS / 2) * head_size * sizeof(float);
|
||||||
|
|
|
@ -16,12 +16,14 @@ struct KernelVecType<float> {
|
||||||
using cvt_vec_type = vec_op::FP32Vec16;
|
using cvt_vec_type = vec_op::FP32Vec16;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
#if !defined(__aarch64__) || defined(ARM_BF16_SUPPORT)
|
||||||
template <>
|
template <>
|
||||||
struct KernelVecType<c10::BFloat16> {
|
struct KernelVecType<c10::BFloat16> {
|
||||||
using load_vec_type = vec_op::BF16Vec16;
|
using load_vec_type = vec_op::BF16Vec16;
|
||||||
using azp_adj_load_vec_type = vec_op::INT32Vec16;
|
using azp_adj_load_vec_type = vec_op::INT32Vec16;
|
||||||
using cvt_vec_type = vec_op::FP32Vec16;
|
using cvt_vec_type = vec_op::FP32Vec16;
|
||||||
};
|
};
|
||||||
|
#endif
|
||||||
|
|
||||||
template <>
|
template <>
|
||||||
struct KernelVecType<c10::Half> {
|
struct KernelVecType<c10::Half> {
|
||||||
|
|
|
@ -151,7 +151,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
|
||||||
ops.impl("rotary_embedding", torch::kCPU, &rotary_embedding);
|
ops.impl("rotary_embedding", torch::kCPU, &rotary_embedding);
|
||||||
|
|
||||||
// Quantization
|
// Quantization
|
||||||
#if defined(__AVX512F__) || defined(__aarch64__)
|
#if defined(__AVX512F__) || (defined(__aarch64__) && !defined(__APPLE__))
|
||||||
at::Tag stride_tag = at::Tag::needs_fixed_stride_order;
|
at::Tag stride_tag = at::Tag::needs_fixed_stride_order;
|
||||||
|
|
||||||
// Compute int8 quantized tensor for given scaling factor.
|
// Compute int8 quantized tensor for given scaling factor.
|
||||||
|
|
|
@ -4,8 +4,35 @@
|
||||||
#include <hip/hip_runtime.h>
|
#include <hip/hip_runtime.h>
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if defined(USE_ROCM) && defined(__GFX9__)
|
#ifdef USE_ROCM
|
||||||
#define WARP_SIZE 64
|
struct Utils {
|
||||||
|
static __host__ int get_warp_size() {
|
||||||
|
static bool is_cached = false;
|
||||||
|
static int result;
|
||||||
|
|
||||||
|
if (!is_cached) {
|
||||||
|
int device_id;
|
||||||
|
cudaDeviceProp deviceProp;
|
||||||
|
cudaGetDevice(&device_id);
|
||||||
|
cudaGetDeviceProperties(&deviceProp, device_id);
|
||||||
|
|
||||||
|
result = deviceProp.warpSize;
|
||||||
|
is_cached = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
static __device__ constexpr int get_warp_size() {
|
||||||
|
#ifdef __GFX9__
|
||||||
|
return 64;
|
||||||
|
#else
|
||||||
|
return 32;
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
#define WARP_SIZE Utils::get_warp_size()
|
||||||
#else
|
#else
|
||||||
#define WARP_SIZE 32
|
#define WARP_SIZE 32
|
||||||
#endif
|
#endif
|
||||||
|
|
|
@ -10,32 +10,28 @@
|
||||||
|
|
||||||
void moe_permute(
|
void moe_permute(
|
||||||
const torch::Tensor& input, // [n_token, hidden]
|
const torch::Tensor& input, // [n_token, hidden]
|
||||||
const torch::Tensor& topk_weights, //[n_token, topk]
|
const torch::Tensor& topk_ids, // [n_token, topk]
|
||||||
torch::Tensor& topk_ids, // [n_token, topk]
|
|
||||||
const torch::Tensor& token_expert_indices, // [n_token, topk]
|
const torch::Tensor& token_expert_indices, // [n_token, topk]
|
||||||
const std::optional<torch::Tensor>& expert_map, // [n_expert]
|
const std::optional<torch::Tensor>& expert_map, // [n_expert]
|
||||||
int64_t n_expert, int64_t n_local_expert, int64_t topk,
|
int64_t n_expert, int64_t n_local_expert, int64_t topk,
|
||||||
const std::optional<int64_t>& align_block_size,
|
const std::optional<int64_t>& align_block_size,
|
||||||
torch::Tensor&
|
torch::Tensor& permuted_input, // [permuted_size, hidden]
|
||||||
permuted_input, // [topk * n_token/align_block_size_m, hidden]
|
|
||||||
torch::Tensor& expert_first_token_offset, // [n_local_expert + 1]
|
torch::Tensor& expert_first_token_offset, // [n_local_expert + 1]
|
||||||
torch::Tensor& src_row_id2dst_row_id_map, // [n_token, topk]
|
torch::Tensor& inv_permuted_idx, // [n_token, topk]
|
||||||
|
torch::Tensor& permuted_idx, // [permute_size]
|
||||||
torch::Tensor& m_indices) { // [align_expand_m]
|
torch::Tensor& m_indices) { // [align_expand_m]
|
||||||
TORCH_CHECK(topk_weights.scalar_type() == at::ScalarType::Float,
|
|
||||||
"topk_weights must be float32");
|
|
||||||
TORCH_CHECK(expert_first_token_offset.scalar_type() == at::ScalarType::Long,
|
TORCH_CHECK(expert_first_token_offset.scalar_type() == at::ScalarType::Long,
|
||||||
"expert_first_token_offset must be int64");
|
"expert_first_token_offset must be int64");
|
||||||
TORCH_CHECK(topk_ids.scalar_type() == at::ScalarType::Int,
|
TORCH_CHECK(topk_ids.scalar_type() == at::ScalarType::Int,
|
||||||
"topk_ids must be int32");
|
"topk_ids must be int32");
|
||||||
TORCH_CHECK(token_expert_indices.scalar_type() == at::ScalarType::Int,
|
TORCH_CHECK(token_expert_indices.scalar_type() == at::ScalarType::Int,
|
||||||
"token_expert_indices must be int32");
|
"token_expert_indices must be int32");
|
||||||
TORCH_CHECK(src_row_id2dst_row_id_map.scalar_type() == at::ScalarType::Int,
|
TORCH_CHECK(inv_permuted_idx.scalar_type() == at::ScalarType::Int,
|
||||||
"src_row_id2dst_row_id_map must be int32");
|
"inv_permuted_idx must be int32");
|
||||||
TORCH_CHECK(expert_first_token_offset.size(0) == n_local_expert + 1,
|
TORCH_CHECK(expert_first_token_offset.size(0) == n_local_expert + 1,
|
||||||
"expert_first_token_offset shape != n_local_expert+1")
|
"expert_first_token_offset shape != n_local_expert+1")
|
||||||
TORCH_CHECK(
|
TORCH_CHECK(inv_permuted_idx.sizes() == token_expert_indices.sizes(),
|
||||||
src_row_id2dst_row_id_map.sizes() == token_expert_indices.sizes(),
|
"token_expert_indices shape must be same as inv_permuted_idx");
|
||||||
"token_expert_indices shape must be same as src_row_id2dst_row_id_map");
|
|
||||||
auto n_token = input.sizes()[0];
|
auto n_token = input.sizes()[0];
|
||||||
auto n_hidden = input.sizes()[1];
|
auto n_hidden = input.sizes()[1];
|
||||||
auto align_block_size_value =
|
auto align_block_size_value =
|
||||||
|
@ -46,8 +42,9 @@ void moe_permute(
|
||||||
auto sort_workspace = torch::empty(
|
auto sort_workspace = torch::empty(
|
||||||
{sorter_size},
|
{sorter_size},
|
||||||
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
|
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
|
||||||
|
auto copy_topk_ids = topk_ids.clone(); // copy topk_ids for preprocess
|
||||||
auto permuted_experts_id = torch::empty_like(topk_ids);
|
auto permuted_experts_id = torch::empty_like(topk_ids);
|
||||||
auto dst_row_id2src_row_id_map = torch::empty_like(src_row_id2dst_row_id_map);
|
auto sorted_row_idx = torch::empty_like(inv_permuted_idx);
|
||||||
auto align_expert_first_token_offset =
|
auto align_expert_first_token_offset =
|
||||||
torch::zeros_like(expert_first_token_offset);
|
torch::zeros_like(expert_first_token_offset);
|
||||||
|
|
||||||
|
@ -67,24 +64,22 @@ void moe_permute(
|
||||||
const int* expert_map_ptr = get_ptr<int>(expert_map.value());
|
const int* expert_map_ptr = get_ptr<int>(expert_map.value());
|
||||||
valid_num_ptr =
|
valid_num_ptr =
|
||||||
get_ptr<int64_t>(expert_first_token_offset) + n_local_expert;
|
get_ptr<int64_t>(expert_first_token_offset) + n_local_expert;
|
||||||
preprocessTopkIdLauncher(get_ptr<int>(topk_ids), n_token * topk,
|
preprocessTopkIdLauncher(get_ptr<int>(copy_topk_ids), n_token * topk,
|
||||||
expert_map_ptr, n_expert, stream);
|
expert_map_ptr, n_expert, stream);
|
||||||
}
|
}
|
||||||
// expert sort topk expert id and scan expert id get expert_first_token_offset
|
// expert sort topk expert id and scan expert id get expert_first_token_offset
|
||||||
sortAndScanExpert(get_ptr<int>(topk_ids), get_ptr<int>(token_expert_indices),
|
sortAndScanExpert(
|
||||||
get_ptr<int>(permuted_experts_id),
|
get_ptr<int>(copy_topk_ids), get_ptr<int>(token_expert_indices),
|
||||||
get_ptr<int>(dst_row_id2src_row_id_map),
|
get_ptr<int>(permuted_experts_id), get_ptr<int>(sorted_row_idx),
|
||||||
get_ptr<int64_t>(expert_first_token_offset), n_token,
|
get_ptr<int64_t>(expert_first_token_offset), n_token, n_expert,
|
||||||
n_expert, n_local_expert, topk, sorter,
|
n_local_expert, topk, sorter, get_ptr<int>(sort_workspace), stream);
|
||||||
get_ptr<int>(sort_workspace), stream);
|
|
||||||
|
|
||||||
// dispatch expandInputRowsKernelLauncher
|
// dispatch expandInputRowsKernelLauncher
|
||||||
MOE_DISPATCH(input.scalar_type(), [&] {
|
MOE_DISPATCH(input.scalar_type(), [&] {
|
||||||
expandInputRowsKernelLauncher<scalar_t>(
|
expandInputRowsKernelLauncher<scalar_t>(
|
||||||
get_ptr<scalar_t>(input), get_ptr<scalar_t>(permuted_input),
|
get_ptr<scalar_t>(input), get_ptr<scalar_t>(permuted_input),
|
||||||
get_ptr<float>(topk_weights), get_ptr<int>(permuted_experts_id),
|
get_ptr<int>(permuted_experts_id), get_ptr<int>(sorted_row_idx),
|
||||||
get_ptr<int>(dst_row_id2src_row_id_map),
|
get_ptr<int>(inv_permuted_idx), get_ptr<int>(permuted_idx),
|
||||||
get_ptr<int>(src_row_id2dst_row_id_map),
|
|
||||||
get_ptr<int64_t>(expert_first_token_offset), n_token, valid_num_ptr,
|
get_ptr<int64_t>(expert_first_token_offset), n_token, valid_num_ptr,
|
||||||
n_hidden, topk, n_local_expert, align_block_size_value, stream);
|
n_hidden, topk, n_local_expert, align_block_size_value, stream);
|
||||||
});
|
});
|
||||||
|
@ -101,32 +96,34 @@ void moe_permute(
|
||||||
}
|
}
|
||||||
|
|
||||||
void moe_unpermute(
|
void moe_unpermute(
|
||||||
const torch::Tensor& permuted_hidden_states, // [n_token * topk, hidden]
|
const torch::Tensor& permuted_hidden_states, // [n_token * topk, hidden]
|
||||||
const torch::Tensor& topk_weights, //[n_token, topk]
|
const torch::Tensor& topk_weights, // [n_token, topk]
|
||||||
const torch::Tensor& topk_ids, // [n_token, topk]
|
const torch::Tensor& inv_permuted_idx, // [n_token, topk]
|
||||||
const torch::Tensor& src_row_id2dst_row_id_map, // [n_token, topk]
|
const std::optional<torch::Tensor>&
|
||||||
const torch::Tensor& expert_first_token_offset, // [n_local_expert+1]
|
expert_first_token_offset, // [n_local_expert+1]
|
||||||
int64_t n_expert, int64_t n_local_expert, int64_t topk,
|
int64_t topk,
|
||||||
torch::Tensor& hidden_states // [n_token, hidden]
|
torch::Tensor& hidden_states // [n_token, hidden]
|
||||||
) {
|
) {
|
||||||
TORCH_CHECK(src_row_id2dst_row_id_map.sizes() == topk_ids.sizes(),
|
|
||||||
"topk_ids shape must be same as src_row_id2dst_row_id_map");
|
|
||||||
TORCH_CHECK(topk_ids.scalar_type() == at::ScalarType::Int,
|
|
||||||
"topk_ids must be int32");
|
|
||||||
TORCH_CHECK(
|
TORCH_CHECK(
|
||||||
permuted_hidden_states.scalar_type() == hidden_states.scalar_type(),
|
permuted_hidden_states.scalar_type() == hidden_states.scalar_type(),
|
||||||
"topk_ids dtype must be same as src_row_id2dst_row_id_map");
|
"permuted_hidden_states dtype must be same as hidden_states");
|
||||||
auto n_token = hidden_states.size(0);
|
auto n_token = hidden_states.size(0);
|
||||||
auto n_hidden = hidden_states.size(1);
|
auto n_hidden = hidden_states.size(1);
|
||||||
auto stream = at::cuda::getCurrentCUDAStream().stream();
|
auto stream = at::cuda::getCurrentCUDAStream().stream();
|
||||||
const int64_t* valid_ptr =
|
|
||||||
get_ptr<int64_t>(expert_first_token_offset) + n_local_expert;
|
int64_t const* valid_ptr = nullptr;
|
||||||
|
if (expert_first_token_offset.has_value()) {
|
||||||
|
int n_local_expert = expert_first_token_offset.value().size(0) - 1;
|
||||||
|
valid_ptr =
|
||||||
|
get_ptr<int64_t>(expert_first_token_offset.value()) + n_local_expert;
|
||||||
|
}
|
||||||
|
|
||||||
MOE_DISPATCH(hidden_states.scalar_type(), [&] {
|
MOE_DISPATCH(hidden_states.scalar_type(), [&] {
|
||||||
finalizeMoeRoutingKernelLauncher<scalar_t, scalar_t>(
|
finalizeMoeRoutingKernelLauncher<scalar_t, scalar_t>(
|
||||||
get_ptr<scalar_t>(permuted_hidden_states),
|
get_ptr<scalar_t>(permuted_hidden_states),
|
||||||
get_ptr<scalar_t>(hidden_states), get_ptr<float>(topk_weights),
|
get_ptr<scalar_t>(hidden_states), get_ptr<float>(topk_weights),
|
||||||
get_ptr<int>(src_row_id2dst_row_id_map), get_ptr<int>(topk_ids),
|
get_ptr<int>(inv_permuted_idx), n_token, n_hidden, topk, valid_ptr,
|
||||||
n_token, n_hidden, topk, valid_ptr, stream);
|
stream);
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
@ -177,7 +177,7 @@ __global__ void getMIndicesKernel(int64_t* expert_first_token_offset,
|
||||||
int tidx = threadIdx.x;
|
int tidx = threadIdx.x;
|
||||||
extern __shared__ int64_t smem_expert_first_token_offset[];
|
extern __shared__ int64_t smem_expert_first_token_offset[];
|
||||||
for (int i = tidx; i <= num_local_expert; i += blockDim.x) {
|
for (int i = tidx; i <= num_local_expert; i += blockDim.x) {
|
||||||
smem_expert_first_token_offset[tidx] = __ldg(expert_first_token_offset + i);
|
smem_expert_first_token_offset[i] = __ldg(expert_first_token_offset + i);
|
||||||
}
|
}
|
||||||
__syncthreads();
|
__syncthreads();
|
||||||
auto last_token_offset = smem_expert_first_token_offset[eidx + 1];
|
auto last_token_offset = smem_expert_first_token_offset[eidx + 1];
|
||||||
|
|
|
@ -57,31 +57,19 @@ void sortAndScanExpert(int* expert_for_source_row, const int* source_rows,
|
||||||
|
|
||||||
template <typename T>
|
template <typename T>
|
||||||
void expandInputRowsKernelLauncher(
|
void expandInputRowsKernelLauncher(
|
||||||
T const* unpermuted_input, T* permuted_output,
|
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
||||||
const float* unpermuted_scales, int* sorted_experts,
|
|
||||||
int const* expanded_dest_row_to_expanded_source_row,
|
int const* expanded_dest_row_to_expanded_source_row,
|
||||||
int* expanded_source_row_to_expanded_dest_row,
|
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
||||||
int64_t* expert_first_token_offset, int64_t const num_rows,
|
int64_t* expert_first_token_offset, int64_t const num_rows,
|
||||||
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
|
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
|
||||||
int num_local_experts, const int& align_block_size, cudaStream_t stream);
|
int num_local_experts, const int& align_block_size, cudaStream_t stream);
|
||||||
|
|
||||||
// Final kernel to unpermute and scale
|
|
||||||
// This kernel unpermutes the original data, does the k-way reduction and
|
|
||||||
// performs the final skip connection.
|
|
||||||
template <typename T, typename OutputType, bool CHECK_SKIPPED>
|
|
||||||
__global__ void finalizeMoeRoutingKernel(
|
|
||||||
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
|
|
||||||
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
|
|
||||||
int const* expert_for_source_row, int64_t const orig_cols, int64_t const k,
|
|
||||||
int64_t const* num_valid_ptr);
|
|
||||||
|
|
||||||
template <class T, class OutputType>
|
template <class T, class OutputType>
|
||||||
void finalizeMoeRoutingKernelLauncher(
|
void finalizeMoeRoutingKernelLauncher(
|
||||||
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
|
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
|
||||||
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
|
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
|
||||||
int const* expert_for_source_row, int64_t const num_rows,
|
int64_t const num_rows, int64_t const cols, int64_t const k,
|
||||||
int64_t const cols, int64_t const k, int64_t const* num_valid_ptr,
|
int64_t const* num_valid_ptr, cudaStream_t stream);
|
||||||
cudaStream_t stream);
|
|
||||||
|
|
||||||
void preprocessTopkIdLauncher(int* topk_id_ptr, int size,
|
void preprocessTopkIdLauncher(int* topk_id_ptr, int size,
|
||||||
const int* expert_map_ptr, int num_experts,
|
const int* expert_map_ptr, int num_experts,
|
||||||
|
|
|
@ -2,10 +2,9 @@
|
||||||
|
|
||||||
template <typename T, bool CHECK_SKIPPED, bool ALIGN_BLOCK_SIZE>
|
template <typename T, bool CHECK_SKIPPED, bool ALIGN_BLOCK_SIZE>
|
||||||
__global__ void expandInputRowsKernel(
|
__global__ void expandInputRowsKernel(
|
||||||
T const* unpermuted_input, T* permuted_output,
|
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
||||||
const float* unpermuted_scales, int* sorted_experts,
|
|
||||||
int const* expanded_dest_row_to_expanded_source_row,
|
int const* expanded_dest_row_to_expanded_source_row,
|
||||||
int* expanded_source_row_to_expanded_dest_row,
|
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
||||||
int64_t* expert_first_token_offset, int64_t const num_rows,
|
int64_t* expert_first_token_offset, int64_t const num_rows,
|
||||||
int64_t const* num_dest_rows, int64_t const cols, int64_t k,
|
int64_t const* num_dest_rows, int64_t const cols, int64_t k,
|
||||||
int num_local_experts, int align_block_size) {
|
int num_local_experts, int align_block_size) {
|
||||||
|
@ -54,6 +53,10 @@ __global__ void expandInputRowsKernel(
|
||||||
assert(expanded_dest_row <= INT32_MAX);
|
assert(expanded_dest_row <= INT32_MAX);
|
||||||
expanded_source_row_to_expanded_dest_row[expanded_source_row] =
|
expanded_source_row_to_expanded_dest_row[expanded_source_row] =
|
||||||
static_cast<int>(expanded_dest_row);
|
static_cast<int>(expanded_dest_row);
|
||||||
|
// skip non local expert token
|
||||||
|
if (!CHECK_SKIPPED || blockIdx.x < *num_dest_rows) {
|
||||||
|
permuted_idx[expanded_dest_row] = expanded_source_row;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
if (!CHECK_SKIPPED || blockIdx.x < *num_dest_rows) {
|
if (!CHECK_SKIPPED || blockIdx.x < *num_dest_rows) {
|
||||||
|
@ -62,7 +65,7 @@ __global__ void expandInputRowsKernel(
|
||||||
using DataElem = cutlass::Array<T, ELEM_PER_THREAD>;
|
using DataElem = cutlass::Array<T, ELEM_PER_THREAD>;
|
||||||
|
|
||||||
// Duplicate and permute rows
|
// Duplicate and permute rows
|
||||||
int64_t const source_row = expanded_source_row % num_rows;
|
int64_t const source_row = expanded_source_row / k;
|
||||||
|
|
||||||
auto const* source_row_ptr =
|
auto const* source_row_ptr =
|
||||||
reinterpret_cast<DataElem const*>(unpermuted_input + source_row * cols);
|
reinterpret_cast<DataElem const*>(unpermuted_input + source_row * cols);
|
||||||
|
@ -82,10 +85,9 @@ __global__ void expandInputRowsKernel(
|
||||||
|
|
||||||
template <typename T>
|
template <typename T>
|
||||||
void expandInputRowsKernelLauncher(
|
void expandInputRowsKernelLauncher(
|
||||||
T const* unpermuted_input, T* permuted_output,
|
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
||||||
const float* unpermuted_scales, int* sorted_experts,
|
|
||||||
int const* expanded_dest_row_to_expanded_source_row,
|
int const* expanded_dest_row_to_expanded_source_row,
|
||||||
int* expanded_source_row_to_expanded_dest_row,
|
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
||||||
int64_t* expert_first_token_offset, int64_t const num_rows,
|
int64_t* expert_first_token_offset, int64_t const num_rows,
|
||||||
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
|
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
|
||||||
int num_local_experts, const int& align_block_size, cudaStream_t stream) {
|
int num_local_experts, const int& align_block_size, cudaStream_t stream) {
|
||||||
|
@ -105,11 +107,11 @@ void expandInputRowsKernelLauncher(
|
||||||
int64_t smem_size = sizeof(int64_t) * (num_local_experts + 1);
|
int64_t smem_size = sizeof(int64_t) * (num_local_experts + 1);
|
||||||
|
|
||||||
func<<<blocks, threads, smem_size, stream>>>(
|
func<<<blocks, threads, smem_size, stream>>>(
|
||||||
unpermuted_input, permuted_output, unpermuted_scales, sorted_experts,
|
unpermuted_input, permuted_output, sorted_experts,
|
||||||
expanded_dest_row_to_expanded_source_row,
|
expanded_dest_row_to_expanded_source_row,
|
||||||
expanded_source_row_to_expanded_dest_row, expert_first_token_offset,
|
expanded_source_row_to_expanded_dest_row, permuted_idx,
|
||||||
num_rows, num_valid_tokens_ptr, cols, k, num_local_experts,
|
expert_first_token_offset, num_rows, num_valid_tokens_ptr, cols, k,
|
||||||
align_block_size);
|
num_local_experts, align_block_size);
|
||||||
}
|
}
|
||||||
|
|
||||||
template <class T, class U>
|
template <class T, class U>
|
||||||
|
@ -128,11 +130,9 @@ template <typename T, typename OutputType, bool CHECK_SKIPPED>
|
||||||
__global__ void finalizeMoeRoutingKernel(
|
__global__ void finalizeMoeRoutingKernel(
|
||||||
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
|
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
|
||||||
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
|
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
|
||||||
int const* expert_for_source_row, int64_t const orig_cols, int64_t const k,
|
int64_t const orig_cols, int64_t const k, int64_t const* num_valid_ptr) {
|
||||||
int64_t const* num_valid_ptr) {
|
|
||||||
assert(orig_cols % 4 == 0);
|
assert(orig_cols % 4 == 0);
|
||||||
int64_t const original_row = blockIdx.x;
|
int64_t const original_row = blockIdx.x;
|
||||||
int64_t const num_rows = gridDim.x;
|
|
||||||
auto const offset = original_row * orig_cols;
|
auto const offset = original_row * orig_cols;
|
||||||
OutputType* reduced_row_ptr = reduced_unpermuted_output + offset;
|
OutputType* reduced_row_ptr = reduced_unpermuted_output + offset;
|
||||||
int64_t const num_valid = *num_valid_ptr;
|
int64_t const num_valid = *num_valid_ptr;
|
||||||
|
@ -159,14 +159,13 @@ __global__ void finalizeMoeRoutingKernel(
|
||||||
ComputeElem thread_output;
|
ComputeElem thread_output;
|
||||||
thread_output.fill(0);
|
thread_output.fill(0);
|
||||||
for (int k_idx = 0; k_idx < k; ++k_idx) {
|
for (int k_idx = 0; k_idx < k; ++k_idx) {
|
||||||
int64_t const expanded_original_row = original_row + k_idx * num_rows;
|
int64_t const expanded_original_row = original_row * k + k_idx;
|
||||||
int64_t const expanded_permuted_row =
|
int64_t const expanded_permuted_row =
|
||||||
expanded_source_row_to_expanded_dest_row[expanded_original_row];
|
expanded_source_row_to_expanded_dest_row[expanded_original_row];
|
||||||
|
|
||||||
int64_t const k_offset = original_row * k + k_idx;
|
int64_t const k_offset = original_row * k + k_idx;
|
||||||
float const row_scale = scales[k_offset];
|
float const row_scale = scales[k_offset];
|
||||||
|
|
||||||
// Check after row_rescale has accumulated
|
|
||||||
if (CHECK_SKIPPED && expanded_permuted_row >= num_valid) {
|
if (CHECK_SKIPPED && expanded_permuted_row >= num_valid) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
@ -189,9 +188,8 @@ template <class T, class OutputType>
|
||||||
void finalizeMoeRoutingKernelLauncher(
|
void finalizeMoeRoutingKernelLauncher(
|
||||||
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
|
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
|
||||||
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
|
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
|
||||||
int const* expert_for_source_row, int64_t const num_rows,
|
int64_t const num_rows, int64_t const cols, int64_t const k,
|
||||||
int64_t const cols, int64_t const k, int64_t const* num_valid_ptr,
|
int64_t const* num_valid_ptr, cudaStream_t stream) {
|
||||||
cudaStream_t stream) {
|
|
||||||
int64_t const blocks = num_rows;
|
int64_t const blocks = num_rows;
|
||||||
int64_t const threads = 256;
|
int64_t const threads = 256;
|
||||||
bool const check_finished = num_valid_ptr != nullptr;
|
bool const check_finished = num_valid_ptr != nullptr;
|
||||||
|
@ -201,6 +199,5 @@ void finalizeMoeRoutingKernelLauncher(
|
||||||
auto* const kernel = func_map[check_finished];
|
auto* const kernel = func_map[check_finished];
|
||||||
kernel<<<blocks, threads, 0, stream>>>(
|
kernel<<<blocks, threads, 0, stream>>>(
|
||||||
expanded_permuted_rows, reduced_unpermuted_output, scales,
|
expanded_permuted_rows, reduced_unpermuted_output, scales,
|
||||||
expanded_source_row_to_expanded_dest_row, expert_for_source_row, cols, k,
|
expanded_source_row_to_expanded_dest_row, cols, k, num_valid_ptr);
|
||||||
num_valid_ptr);
|
|
||||||
}
|
}
|
||||||
|
|
|
@ -190,8 +190,8 @@ __launch_bounds__(TPB) __global__ void moeTopK(
|
||||||
2) This implementation assumes k is small, but will work for any k.
|
2) This implementation assumes k is small, but will work for any k.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
template <int VPT, int NUM_EXPERTS, int WARPS_PER_CTA, int BYTES_PER_LDG, typename IndType>
|
template <int VPT, int NUM_EXPERTS, int WARPS_PER_CTA, int BYTES_PER_LDG, int WARP_SIZE_PARAM, typename IndType>
|
||||||
__launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__
|
__launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
|
||||||
void topkGatingSoftmax(const float* input, const bool* finished, float* output, const int num_rows, IndType* indices,
|
void topkGatingSoftmax(const float* input, const bool* finished, float* output, const int num_rows, IndType* indices,
|
||||||
int* source_rows, const int k, const int start_expert, const int end_expert)
|
int* source_rows, const int k, const int start_expert, const int end_expert)
|
||||||
{
|
{
|
||||||
|
@ -209,12 +209,12 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__
|
||||||
|
|
||||||
// Restrictions based on previous section.
|
// Restrictions based on previous section.
|
||||||
static_assert(VPT % ELTS_PER_LDG == 0, "The elements per thread must be a multiple of the elements per ldg");
|
static_assert(VPT % ELTS_PER_LDG == 0, "The elements per thread must be a multiple of the elements per ldg");
|
||||||
static_assert(WARP_SIZE % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp");
|
static_assert(WARP_SIZE_PARAM % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp");
|
||||||
static_assert(THREADS_PER_ROW == (THREADS_PER_ROW & -THREADS_PER_ROW), "THREADS_PER_ROW must be power of 2");
|
static_assert(THREADS_PER_ROW == (THREADS_PER_ROW & -THREADS_PER_ROW), "THREADS_PER_ROW must be power of 2");
|
||||||
static_assert(THREADS_PER_ROW <= WARP_SIZE, "THREADS_PER_ROW can be at most warp size");
|
static_assert(THREADS_PER_ROW <= WARP_SIZE_PARAM, "THREADS_PER_ROW can be at most warp size");
|
||||||
|
|
||||||
// We have NUM_EXPERTS elements per row. We specialize for small #experts
|
// We have NUM_EXPERTS elements per row. We specialize for small #experts
|
||||||
static constexpr int ELTS_PER_WARP = WARP_SIZE * VPT;
|
static constexpr int ELTS_PER_WARP = WARP_SIZE_PARAM * VPT;
|
||||||
static constexpr int ROWS_PER_WARP = ELTS_PER_WARP / ELTS_PER_ROW;
|
static constexpr int ROWS_PER_WARP = ELTS_PER_WARP / ELTS_PER_ROW;
|
||||||
static constexpr int ROWS_PER_CTA = WARPS_PER_CTA * ROWS_PER_WARP;
|
static constexpr int ROWS_PER_CTA = WARPS_PER_CTA * ROWS_PER_WARP;
|
||||||
|
|
||||||
|
@ -393,41 +393,51 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__
|
||||||
namespace detail
|
namespace detail
|
||||||
{
|
{
|
||||||
// Constructs some constants needed to partition the work across threads at compile time.
|
// Constructs some constants needed to partition the work across threads at compile time.
|
||||||
template <int EXPERTS, int BYTES_PER_LDG>
|
template <int EXPERTS, int BYTES_PER_LDG, int WARP_SIZE_PARAM>
|
||||||
struct TopkConstants
|
struct TopkConstants
|
||||||
{
|
{
|
||||||
static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(float);
|
static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(float);
|
||||||
static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE) == 0, "");
|
static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0, "");
|
||||||
static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE));
|
static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM));
|
||||||
static constexpr int VPT = VECs_PER_THREAD * ELTS_PER_LDG;
|
static constexpr int VPT = VECs_PER_THREAD * ELTS_PER_LDG;
|
||||||
static constexpr int THREADS_PER_ROW = EXPERTS / VPT;
|
static constexpr int THREADS_PER_ROW = EXPERTS / VPT;
|
||||||
static constexpr int ROWS_PER_WARP = WARP_SIZE / THREADS_PER_ROW;
|
static const int ROWS_PER_WARP = WARP_SIZE_PARAM / THREADS_PER_ROW;
|
||||||
};
|
};
|
||||||
} // namespace detail
|
} // namespace detail
|
||||||
|
|
||||||
template <int EXPERTS, int WARPS_PER_TB, typename IndType>
|
template <int EXPERTS, int WARPS_PER_TB, int WARP_SIZE_PARAM, typename IndType>
|
||||||
void topkGatingSoftmaxLauncherHelper(const float* input, const bool* finished, float* output, IndType* indices,
|
void topkGatingSoftmaxLauncherHelper(const float* input, const bool* finished, float* output, IndType* indices,
|
||||||
int* source_row, const int num_rows, const int k, const int start_expert, const int end_expert, cudaStream_t stream)
|
int* source_row, const int num_rows, const int k, const int start_expert, const int end_expert, cudaStream_t stream)
|
||||||
{
|
{
|
||||||
static constexpr std::size_t MAX_BYTES_PER_LDG = 16;
|
static constexpr std::size_t MAX_BYTES_PER_LDG = 16;
|
||||||
|
|
||||||
static constexpr int BYTES_PER_LDG = MIN(MAX_BYTES_PER_LDG, sizeof(float) * EXPERTS);
|
static constexpr int BYTES_PER_LDG = MIN(MAX_BYTES_PER_LDG, sizeof(float) * EXPERTS);
|
||||||
using Constants = detail::TopkConstants<EXPERTS, BYTES_PER_LDG>;
|
using Constants = detail::TopkConstants<EXPERTS, BYTES_PER_LDG, WARP_SIZE_PARAM>;
|
||||||
static constexpr int VPT = Constants::VPT;
|
static constexpr int VPT = Constants::VPT;
|
||||||
static constexpr int ROWS_PER_WARP = Constants::ROWS_PER_WARP;
|
static constexpr int ROWS_PER_WARP = Constants::ROWS_PER_WARP;
|
||||||
const int num_warps = (num_rows + ROWS_PER_WARP - 1) / ROWS_PER_WARP;
|
const int num_warps = (num_rows + ROWS_PER_WARP - 1) / ROWS_PER_WARP;
|
||||||
const int num_blocks = (num_warps + WARPS_PER_TB - 1) / WARPS_PER_TB;
|
const int num_blocks = (num_warps + WARPS_PER_TB - 1) / WARPS_PER_TB;
|
||||||
|
|
||||||
dim3 block_dim(WARP_SIZE, WARPS_PER_TB);
|
dim3 block_dim(WARP_SIZE_PARAM, WARPS_PER_TB);
|
||||||
topkGatingSoftmax<VPT, EXPERTS, WARPS_PER_TB, BYTES_PER_LDG><<<num_blocks, block_dim, 0, stream>>>(
|
topkGatingSoftmax<VPT, EXPERTS, WARPS_PER_TB, BYTES_PER_LDG, WARP_SIZE_PARAM><<<num_blocks, block_dim, 0, stream>>>(
|
||||||
input, finished, output, num_rows, indices, source_row, k, start_expert, end_expert);
|
input, finished, output, num_rows, indices, source_row, k, start_expert, end_expert);
|
||||||
}
|
}
|
||||||
|
|
||||||
#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB) \
|
#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB) \
|
||||||
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB>( \
|
switch (warpSize) { \
|
||||||
gating_output, nullptr, topk_weights, topk_indices, \
|
case 32: \
|
||||||
token_expert_indices, num_tokens, topk, 0, num_experts, \
|
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 32>( \
|
||||||
stream);
|
gating_output, nullptr, topk_weights, topk_indices, \
|
||||||
|
token_expert_indices, num_tokens, topk, 0, num_experts, stream); \
|
||||||
|
break; \
|
||||||
|
case 64: \
|
||||||
|
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 64>( \
|
||||||
|
gating_output, nullptr, topk_weights, topk_indices, \
|
||||||
|
token_expert_indices, num_tokens, topk, 0, num_experts, stream); \
|
||||||
|
break; \
|
||||||
|
default: \
|
||||||
|
TORCH_CHECK(false, "Unsupported warp size: ", warpSize); \
|
||||||
|
}
|
||||||
|
|
||||||
template <typename IndType>
|
template <typename IndType>
|
||||||
void topkGatingSoftmaxKernelLauncher(
|
void topkGatingSoftmaxKernelLauncher(
|
||||||
|
@ -441,6 +451,7 @@ void topkGatingSoftmaxKernelLauncher(
|
||||||
const int topk,
|
const int topk,
|
||||||
cudaStream_t stream) {
|
cudaStream_t stream) {
|
||||||
static constexpr int WARPS_PER_TB = 4;
|
static constexpr int WARPS_PER_TB = 4;
|
||||||
|
auto warpSize = WARP_SIZE;
|
||||||
switch (num_experts) {
|
switch (num_experts) {
|
||||||
case 1:
|
case 1:
|
||||||
LAUNCH_SOFTMAX(1, WARPS_PER_TB);
|
LAUNCH_SOFTMAX(1, WARPS_PER_TB);
|
||||||
|
|
|
@ -56,18 +56,17 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
|
||||||
" -> Tensor");
|
" -> Tensor");
|
||||||
|
|
||||||
m.def(
|
m.def(
|
||||||
"moe_permute(Tensor input, Tensor topk_weight, Tensor! topk_ids,"
|
"moe_permute(Tensor input, Tensor topk_ids,"
|
||||||
"Tensor token_expert_indices, Tensor? expert_map, int n_expert,"
|
"Tensor token_expert_indices, Tensor? expert_map, int n_expert,"
|
||||||
"int n_local_expert,"
|
"int n_local_expert,"
|
||||||
"int topk, int? align_block_size,Tensor! permuted_input, Tensor! "
|
"int topk, int? align_block_size,Tensor! permuted_input, Tensor! "
|
||||||
"expert_first_token_offset, Tensor! src_row_id2dst_row_id_map, Tensor! "
|
"expert_first_token_offset, Tensor! inv_permuted_idx, Tensor! "
|
||||||
"m_indices)->()");
|
"permuted_idx, Tensor! m_indices)->()");
|
||||||
|
|
||||||
m.def(
|
m.def(
|
||||||
"moe_unpermute(Tensor permuted_hidden_states, Tensor topk_weights,"
|
"moe_unpermute(Tensor permuted_hidden_states, Tensor topk_weights,"
|
||||||
"Tensor topk_ids,Tensor src_row_id2dst_row_id_map, Tensor "
|
"Tensor inv_permuted_idx, Tensor? expert_first_token_offset, "
|
||||||
"expert_first_token_offset, int n_expert, int n_local_expert,int "
|
"int topk, Tensor! hidden_states)->()");
|
||||||
"topk, Tensor! hidden_states)->()");
|
|
||||||
|
|
||||||
m.def("moe_permute_unpermute_supported() -> bool");
|
m.def("moe_permute_unpermute_supported() -> bool");
|
||||||
m.impl("moe_permute_unpermute_supported", &moe_permute_unpermute_supported);
|
m.impl("moe_permute_unpermute_supported", &moe_permute_unpermute_supported);
|
||||||
|
|
|
@ -292,6 +292,11 @@ void per_token_group_quant_fp8(const torch::Tensor& input,
|
||||||
torch::Tensor& output_q, torch::Tensor& output_s,
|
torch::Tensor& output_q, torch::Tensor& output_s,
|
||||||
int64_t group_size, double eps, double fp8_min,
|
int64_t group_size, double eps, double fp8_min,
|
||||||
double fp8_max, bool scale_ue8m0);
|
double fp8_max, bool scale_ue8m0);
|
||||||
|
|
||||||
|
void per_token_group_quant_int8(const torch::Tensor& input,
|
||||||
|
torch::Tensor& output_q,
|
||||||
|
torch::Tensor& output_s, int64_t group_size,
|
||||||
|
double eps, double int8_min, double int8_max);
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input,
|
void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input,
|
||||||
|
|
|
@ -4,7 +4,7 @@
|
||||||
|
|
||||||
#include <cmath>
|
#include <cmath>
|
||||||
#include "core/math.hpp"
|
#include "core/math.hpp"
|
||||||
#include "cuda_compat.h"
|
#include "../cuda_compat.h"
|
||||||
#include "dispatch_utils.h"
|
#include "dispatch_utils.h"
|
||||||
|
|
||||||
#include "quantization/fp8/common.cuh"
|
#include "quantization/fp8/common.cuh"
|
||||||
|
|
|
@ -1,6 +1,10 @@
|
||||||
#include <ATen/cuda/CUDAContext.h>
|
#include <ATen/cuda/CUDAContext.h>
|
||||||
#include <torch/all.h>
|
#include <torch/all.h>
|
||||||
|
|
||||||
|
#ifndef USE_ROCM
|
||||||
|
#include "../per_token_group_quant_8bit.h"
|
||||||
|
#endif
|
||||||
|
|
||||||
#include <cmath>
|
#include <cmath>
|
||||||
|
|
||||||
#include "../../dispatch_utils.h"
|
#include "../../dispatch_utils.h"
|
||||||
|
@ -336,3 +340,13 @@ void dynamic_scaled_int8_quant(
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#ifndef USE_ROCM
|
||||||
|
void per_token_group_quant_int8(const torch::Tensor& input,
|
||||||
|
torch::Tensor& output_q,
|
||||||
|
torch::Tensor& output_s, int64_t group_size,
|
||||||
|
double eps, double int8_min, double int8_max) {
|
||||||
|
per_token_group_quant_8bit(input, output_q, output_s, group_size, eps,
|
||||||
|
int8_min, int8_max);
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
|
@ -86,6 +86,7 @@ D = s_a s_b \widehat A \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
|
|
||||||
|
@ -135,7 +136,7 @@ That is precomputed and stored in `azp_with_adj` as a row-vector.
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- Generally this will be per-tensor as the zero-points are per-tensor.
|
- Generally this will be per-tensor as the zero-points are per-tensor.
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
- `azp_with_adj` is the precomputed zero-point term ($` z_a J_a \widehat B `$), is per-channel (row-vector).
|
- `azp_with_adj` is the precomputed zero-point term ($` z_a J_a \widehat B `$), is per-channel (row-vector).
|
||||||
- `bias` is the bias, is always per-channel (row-vector).
|
- `bias` is the bias, is always per-channel (row-vector).
|
||||||
|
@ -152,7 +153,7 @@ That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- Generally this will be per-token as the zero-points are per-token.
|
- Generally this will be per-token as the zero-points are per-token.
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
- `azp_adj` is the precomputed zero-point adjustment term ($` \mathbf 1 \widehat B `$), is per-channel (row-vector).
|
- `azp_adj` is the precomputed zero-point adjustment term ($` \mathbf 1 \widehat B `$), is per-channel (row-vector).
|
||||||
- `azp` is the zero-point (`z_a`), is per-token (column-vector).
|
- `azp` is the zero-point (`z_a`), is per-token (column-vector).
|
||||||
|
|
|
@ -1,6 +1,5 @@
|
||||||
#include "scaled_mm_kernels.hpp"
|
#include "scaled_mm_kernels.hpp"
|
||||||
#include "scaled_mm_sm90_fp8_dispatch.cuh"
|
#include "scaled_mm_sm90_fp8_dispatch.cuh"
|
||||||
#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp"
|
|
||||||
|
|
||||||
namespace vllm {
|
namespace vllm {
|
||||||
|
|
||||||
|
@ -13,11 +12,11 @@ void cutlass_scaled_mm_sm90_fp8(torch::Tensor& out, torch::Tensor const& a,
|
||||||
if (bias) {
|
if (bias) {
|
||||||
TORCH_CHECK(bias->dtype() == out.dtype(),
|
TORCH_CHECK(bias->dtype() == out.dtype(),
|
||||||
"currently bias dtype must match output dtype ", out.dtype());
|
"currently bias dtype must match output dtype ", out.dtype());
|
||||||
return cutlass_scaled_mm_sm90_fp8_epilogue<c3x::ScaledEpilogueBias>(
|
return cutlass_scaled_mm_sm90_fp8_epilogue<true>(out, a, b, a_scales,
|
||||||
out, a, b, a_scales, b_scales, *bias);
|
b_scales, *bias);
|
||||||
} else {
|
} else {
|
||||||
return cutlass_scaled_mm_sm90_fp8_epilogue<c3x::ScaledEpilogue>(
|
return cutlass_scaled_mm_sm90_fp8_epilogue<false>(out, a, b, a_scales,
|
||||||
out, a, b, a_scales, b_scales);
|
b_scales);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
|
|
||||||
#include "scaled_mm.cuh"
|
#include "scaled_mm.cuh"
|
||||||
#include "cutlass_gemm_caller.cuh"
|
#include "cutlass_gemm_caller.cuh"
|
||||||
|
#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp"
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* This file defines Gemm kernel configurations for SM90 (fp8) based on the Gemm
|
* This file defines Gemm kernel configurations for SM90 (fp8) based on the Gemm
|
||||||
|
@ -12,8 +13,91 @@ namespace vllm {
|
||||||
|
|
||||||
using c3x::cutlass_gemm_caller;
|
using c3x::cutlass_gemm_caller;
|
||||||
|
|
||||||
template <typename InType, typename OutType,
|
template <typename ElementAB_, typename ElementD_,
|
||||||
template <typename, typename, typename> typename Epilogue>
|
template <typename, typename, typename> typename Epilogue_,
|
||||||
|
typename TileShape, typename ClusterShape, typename KernelSchedule,
|
||||||
|
typename EpilogueSchedule, bool swap_ab_ = false>
|
||||||
|
struct cutlass_3x_gemm_sm90_fp8 {
|
||||||
|
using ElementAB = ElementAB_;
|
||||||
|
using ElementC = ElementD_;
|
||||||
|
using ElementD = ElementD_;
|
||||||
|
using ElementAcc =
|
||||||
|
typename std::conditional<std::is_same_v<ElementAB, int8_t>, int32_t,
|
||||||
|
float>::type;
|
||||||
|
|
||||||
|
using Epilogue = Epilogue_<ElementAcc, ElementD, TileShape>;
|
||||||
|
|
||||||
|
using EVTCompute = typename Epilogue::EVTCompute;
|
||||||
|
|
||||||
|
static constexpr int AlignmentAB =
|
||||||
|
128 / cutlass::sizeof_bits<ElementAB>::value;
|
||||||
|
static constexpr int AlignmentCD =
|
||||||
|
128 / cutlass::sizeof_bits<ElementD>::value;
|
||||||
|
|
||||||
|
// Compile-time swap_ab flag
|
||||||
|
static constexpr bool swap_ab = swap_ab_;
|
||||||
|
|
||||||
|
// -----------------------------------------------------------
|
||||||
|
// Layout definitions
|
||||||
|
// -----------------------------------------------------------
|
||||||
|
using LayoutA = cutlass::layout::RowMajor;
|
||||||
|
using LayoutA_T = typename cutlass::layout::LayoutTranspose<LayoutA>::type;
|
||||||
|
|
||||||
|
using LayoutB = cutlass::layout::ColumnMajor;
|
||||||
|
using LayoutB_T = typename cutlass::layout::LayoutTranspose<LayoutB>::type;
|
||||||
|
|
||||||
|
using LayoutD = cutlass::layout::RowMajor;
|
||||||
|
using LayoutD_Transpose =
|
||||||
|
typename cutlass::layout::LayoutTranspose<LayoutD>::type;
|
||||||
|
|
||||||
|
using LayoutC = LayoutD;
|
||||||
|
using LayoutC_Transpose = LayoutD_Transpose;
|
||||||
|
|
||||||
|
// -----------------------------------------------------------
|
||||||
|
// Collective epilogue (conditionally swap operands and layouts)
|
||||||
|
// -----------------------------------------------------------
|
||||||
|
using CollectiveEpilogue =
|
||||||
|
typename cutlass::epilogue::collective::CollectiveBuilder<
|
||||||
|
cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp, TileShape,
|
||||||
|
ClusterShape, cutlass::epilogue::collective::EpilogueTileAuto,
|
||||||
|
ElementAcc, float, ElementC,
|
||||||
|
conditional_t<swap_ab, LayoutC_Transpose, LayoutC>, AlignmentCD,
|
||||||
|
ElementD, conditional_t<swap_ab, LayoutD_Transpose, LayoutD>,
|
||||||
|
AlignmentCD, EpilogueSchedule, EVTCompute>::CollectiveOp;
|
||||||
|
|
||||||
|
static constexpr size_t CEStorageSize =
|
||||||
|
sizeof(typename CollectiveEpilogue::SharedStorage);
|
||||||
|
|
||||||
|
using Stages = typename cutlass::gemm::collective::StageCountAutoCarveout<
|
||||||
|
static_cast<int>(CEStorageSize)>;
|
||||||
|
|
||||||
|
// -----------------------------------------------------------
|
||||||
|
// Collective mainloop (conditionally swap operands and layouts)
|
||||||
|
// -----------------------------------------------------------
|
||||||
|
using CollectiveMainloop = conditional_t<
|
||||||
|
swap_ab,
|
||||||
|
typename cutlass::gemm::collective::CollectiveBuilder<
|
||||||
|
cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp, ElementAB,
|
||||||
|
LayoutB_T, AlignmentAB, // Swapped B (as A)
|
||||||
|
ElementAB, LayoutA_T, AlignmentAB, // Swapped A (as B)
|
||||||
|
ElementAcc, TileShape, ClusterShape, Stages,
|
||||||
|
KernelSchedule>::CollectiveOp,
|
||||||
|
typename cutlass::gemm::collective::CollectiveBuilder<
|
||||||
|
cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp, ElementAB,
|
||||||
|
LayoutA, AlignmentAB, ElementAB, LayoutB, AlignmentAB, ElementAcc,
|
||||||
|
TileShape, ClusterShape, Stages, KernelSchedule>::CollectiveOp>;
|
||||||
|
|
||||||
|
// -----------------------------------------------------------
|
||||||
|
// Kernel definition
|
||||||
|
// -----------------------------------------------------------
|
||||||
|
using KernelType = enable_sm90_or_later<cutlass::gemm::kernel::GemmUniversal<
|
||||||
|
cute::Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue,
|
||||||
|
cutlass::gemm::PersistentScheduler>>;
|
||||||
|
|
||||||
|
struct GemmKernel : public KernelType {};
|
||||||
|
};
|
||||||
|
|
||||||
|
template <typename InType, typename OutType, bool EnableBias>
|
||||||
struct sm90_fp8_config_default {
|
struct sm90_fp8_config_default {
|
||||||
// M in (128, inf)
|
// M in (128, inf)
|
||||||
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
||||||
|
@ -22,13 +106,17 @@ struct sm90_fp8_config_default {
|
||||||
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
||||||
using TileShape = Shape<_128, _128, _128>;
|
using TileShape = Shape<_128, _128, _128>;
|
||||||
using ClusterShape = Shape<_2, _1, _1>;
|
using ClusterShape = Shape<_2, _1, _1>;
|
||||||
using Cutlass3xGemm =
|
|
||||||
cutlass_3x_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
|
using Cutlass3xGemm = conditional_t<
|
||||||
KernelSchedule, EpilogueSchedule>;
|
EnableBias,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueBias,
|
||||||
|
TileShape, ClusterShape, KernelSchedule,
|
||||||
|
EpilogueSchedule>,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
|
||||||
|
ClusterShape, KernelSchedule, EpilogueSchedule>>;
|
||||||
};
|
};
|
||||||
|
|
||||||
template <typename InType, typename OutType,
|
template <typename InType, typename OutType, bool EnableBias>
|
||||||
template <typename, typename, typename> typename Epilogue>
|
|
||||||
struct sm90_fp8_config_M128 {
|
struct sm90_fp8_config_M128 {
|
||||||
// M in (64, 128]
|
// M in (64, 128]
|
||||||
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
||||||
|
@ -37,33 +125,146 @@ struct sm90_fp8_config_M128 {
|
||||||
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
||||||
using TileShape = Shape<_64, _128, _128>;
|
using TileShape = Shape<_64, _128, _128>;
|
||||||
using ClusterShape = Shape<_2, _1, _1>;
|
using ClusterShape = Shape<_2, _1, _1>;
|
||||||
using Cutlass3xGemm =
|
using Cutlass3xGemm = conditional_t<
|
||||||
cutlass_3x_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
|
EnableBias,
|
||||||
KernelSchedule, EpilogueSchedule>;
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueBias,
|
||||||
|
TileShape, ClusterShape, KernelSchedule,
|
||||||
|
EpilogueSchedule>,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
|
||||||
|
ClusterShape, KernelSchedule, EpilogueSchedule>>;
|
||||||
};
|
};
|
||||||
|
|
||||||
template <typename InType, typename OutType,
|
template <typename InType, typename OutType, bool EnableBias>
|
||||||
template <typename, typename, typename> typename Epilogue>
|
struct sm90_fp8_config_M64_N1280 {
|
||||||
struct sm90_fp8_config_M64 {
|
// M in (16, 64], N in [1 1280]
|
||||||
// M in [1, 64]
|
|
||||||
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
||||||
using KernelSchedule =
|
using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
|
||||||
cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum;
|
|
||||||
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
||||||
using TileShape = Shape<_64, _64, _128>;
|
using TileShape = Shape<_64, _16, _256>;
|
||||||
using ClusterShape = Shape<_1, _8, _1>;
|
using ClusterShape = Shape<_1, _4, _1>;
|
||||||
|
|
||||||
using Cutlass3xGemm =
|
// enable swap AB for M < 64
|
||||||
cutlass_3x_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
|
using Cutlass3xGemm = conditional_t<
|
||||||
KernelSchedule, EpilogueSchedule>;
|
EnableBias,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueColumnBias,
|
||||||
|
TileShape, ClusterShape, KernelSchedule,
|
||||||
|
EpilogueSchedule, true>,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
|
||||||
|
ClusterShape, KernelSchedule, EpilogueSchedule,
|
||||||
|
true>>;
|
||||||
};
|
};
|
||||||
|
|
||||||
template <typename InType, typename OutType,
|
template <typename InType, typename OutType, bool EnableBias>
|
||||||
template <typename, typename, typename> typename Epilogue,
|
struct sm90_fp8_config_M64_N8192 {
|
||||||
|
// M in (16, 64], N > 1280
|
||||||
|
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
||||||
|
using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
|
||||||
|
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
||||||
|
using TileShape = Shape<_64, _64, _256>;
|
||||||
|
using ClusterShape = Shape<_1, _1, _1>;
|
||||||
|
|
||||||
|
// enable swap AB for M < 64
|
||||||
|
using Cutlass3xGemm = conditional_t<
|
||||||
|
EnableBias,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueColumnBias,
|
||||||
|
TileShape, ClusterShape, KernelSchedule,
|
||||||
|
EpilogueSchedule, true>,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
|
||||||
|
ClusterShape, KernelSchedule, EpilogueSchedule,
|
||||||
|
true>>;
|
||||||
|
};
|
||||||
|
|
||||||
|
template <typename InType, typename OutType, bool EnableBias>
|
||||||
|
struct sm90_fp8_config_M16_N1280 {
|
||||||
|
// M in [1, 16], N in [1, 1280]
|
||||||
|
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
||||||
|
using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
|
||||||
|
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
||||||
|
using TileShape = Shape<_64, _16, _256>;
|
||||||
|
using ClusterShape = Shape<_1, _2, _1>;
|
||||||
|
|
||||||
|
// enable swap AB for M < 64
|
||||||
|
using Cutlass3xGemm = conditional_t<
|
||||||
|
EnableBias,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueColumnBias,
|
||||||
|
TileShape, ClusterShape, KernelSchedule,
|
||||||
|
EpilogueSchedule, true>,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
|
||||||
|
ClusterShape, KernelSchedule, EpilogueSchedule,
|
||||||
|
true>>;
|
||||||
|
};
|
||||||
|
|
||||||
|
template <typename InType, typename OutType, bool EnableBias>
|
||||||
|
struct sm90_fp8_config_M16_N8192 {
|
||||||
|
// M in [1, 16], N > 1280
|
||||||
|
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
||||||
|
using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
|
||||||
|
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
|
||||||
|
using TileShape = Shape<_64, _16, _256>;
|
||||||
|
using ClusterShape = Shape<_1, _1, _1>;
|
||||||
|
|
||||||
|
// enable swap AB for M < 64
|
||||||
|
using Cutlass3xGemm = conditional_t<
|
||||||
|
EnableBias,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueColumnBias,
|
||||||
|
TileShape, ClusterShape, KernelSchedule,
|
||||||
|
EpilogueSchedule, true>,
|
||||||
|
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
|
||||||
|
ClusterShape, KernelSchedule, EpilogueSchedule,
|
||||||
|
true>>;
|
||||||
|
};
|
||||||
|
|
||||||
|
template <typename Gemm, typename... EpilogueArgs>
|
||||||
|
void cutlass_gemm_caller_sm90_fp8(torch::Tensor& out, torch::Tensor const& a,
|
||||||
|
torch::Tensor const& b,
|
||||||
|
EpilogueArgs&&... epilogue_params) {
|
||||||
|
static constexpr bool swap_ab = Gemm::swap_ab;
|
||||||
|
using ElementAB = typename Gemm::ElementAB;
|
||||||
|
using ElementD = typename Gemm::ElementD;
|
||||||
|
using GemmKernel = typename Gemm::GemmKernel;
|
||||||
|
|
||||||
|
using StrideA = typename Gemm::GemmKernel::StrideA;
|
||||||
|
using StrideB = typename Gemm::GemmKernel::StrideB;
|
||||||
|
using StrideC = typename Gemm::GemmKernel::StrideC;
|
||||||
|
|
||||||
|
int32_t m = a.size(0), n = b.size(1), k = a.size(1);
|
||||||
|
auto prob_shape =
|
||||||
|
swap_ab ? cute::make_shape(n, m, k, 1) : cute::make_shape(m, n, k, 1);
|
||||||
|
|
||||||
|
StrideA a_stride =
|
||||||
|
cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(m, k, 1));
|
||||||
|
StrideB b_stride =
|
||||||
|
cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(n, k, 1));
|
||||||
|
StrideC c_stride = cutlass::make_cute_packed_stride(
|
||||||
|
StrideC{},
|
||||||
|
swap_ab ? cute::make_shape(n, m, 1) : cute::make_shape(m, n, 1));
|
||||||
|
|
||||||
|
auto a_ptr = static_cast<ElementAB*>(a.data_ptr());
|
||||||
|
auto b_ptr = static_cast<ElementAB*>(b.data_ptr());
|
||||||
|
auto c_ptr = static_cast<ElementD*>(out.data_ptr());
|
||||||
|
|
||||||
|
typename GemmKernel::MainloopArguments mainloop_args =
|
||||||
|
swap_ab ? typename GemmKernel::MainloopArguments{b_ptr, b_stride, a_ptr,
|
||||||
|
a_stride}
|
||||||
|
: typename GemmKernel::MainloopArguments{a_ptr, a_stride, b_ptr,
|
||||||
|
b_stride};
|
||||||
|
|
||||||
|
typename GemmKernel::EpilogueArguments epilogue_args{
|
||||||
|
Gemm::Epilogue::prepare_args(
|
||||||
|
std::forward<EpilogueArgs>(epilogue_params)...),
|
||||||
|
c_ptr, c_stride, c_ptr, c_stride};
|
||||||
|
|
||||||
|
c3x::cutlass_gemm_caller<GemmKernel>(a.device(), prob_shape, mainloop_args,
|
||||||
|
epilogue_args);
|
||||||
|
}
|
||||||
|
|
||||||
|
template <typename InType, typename OutType, bool EnableBias,
|
||||||
typename... EpilogueArgs>
|
typename... EpilogueArgs>
|
||||||
inline void cutlass_gemm_sm90_fp8_dispatch(torch::Tensor& out,
|
inline void cutlass_gemm_sm90_fp8_dispatch(torch::Tensor& out,
|
||||||
torch::Tensor const& a,
|
torch::Tensor const& a,
|
||||||
torch::Tensor const& b,
|
torch::Tensor const& b,
|
||||||
|
torch::Tensor const& a_scales,
|
||||||
|
torch::Tensor const& b_scales,
|
||||||
EpilogueArgs&&... args) {
|
EpilogueArgs&&... args) {
|
||||||
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
|
||||||
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn);
|
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn);
|
||||||
|
@ -71,50 +272,75 @@ inline void cutlass_gemm_sm90_fp8_dispatch(torch::Tensor& out,
|
||||||
|
|
||||||
using Cutlass3xGemmDefault =
|
using Cutlass3xGemmDefault =
|
||||||
typename sm90_fp8_config_default<InType, OutType,
|
typename sm90_fp8_config_default<InType, OutType,
|
||||||
Epilogue>::Cutlass3xGemm;
|
EnableBias>::Cutlass3xGemm;
|
||||||
using Cutlass3xGemmM64 =
|
|
||||||
typename sm90_fp8_config_M64<InType, OutType, Epilogue>::Cutlass3xGemm;
|
|
||||||
using Cutlass3xGemmM128 =
|
using Cutlass3xGemmM128 =
|
||||||
typename sm90_fp8_config_M128<InType, OutType, Epilogue>::Cutlass3xGemm;
|
typename sm90_fp8_config_M128<InType, OutType, EnableBias>::Cutlass3xGemm;
|
||||||
|
|
||||||
|
using Cutlass3xGemmM64_N1280 =
|
||||||
|
typename sm90_fp8_config_M64_N1280<InType, OutType,
|
||||||
|
EnableBias>::Cutlass3xGemm;
|
||||||
|
using Cutlass3xGemmM64_N8192 =
|
||||||
|
typename sm90_fp8_config_M64_N8192<InType, OutType,
|
||||||
|
EnableBias>::Cutlass3xGemm;
|
||||||
|
using Cutlass3xGemmM16_N1280 =
|
||||||
|
typename sm90_fp8_config_M16_N1280<InType, OutType,
|
||||||
|
EnableBias>::Cutlass3xGemm;
|
||||||
|
using Cutlass3xGemmM16_N8192 =
|
||||||
|
typename sm90_fp8_config_M16_N8192<InType, OutType,
|
||||||
|
EnableBias>::Cutlass3xGemm;
|
||||||
|
|
||||||
uint32_t const m = a.size(0);
|
uint32_t const m = a.size(0);
|
||||||
uint32_t const mp2 =
|
uint32_t const n = b.size(1);
|
||||||
std::max(static_cast<uint32_t>(64), next_pow_2(m)); // next power of 2
|
|
||||||
|
|
||||||
if (mp2 <= 64) {
|
if (m <= 16) {
|
||||||
// m in [1, 64]
|
// m in [1, 16]
|
||||||
return cutlass_gemm_caller<Cutlass3xGemmM64>(
|
if (n <= 1280) {
|
||||||
out, a, b, std::forward<EpilogueArgs>(args)...);
|
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM16_N1280>(
|
||||||
} else if (mp2 <= 128) {
|
out, a, b, b_scales, a_scales, std::forward<EpilogueArgs>(args)...);
|
||||||
|
}
|
||||||
|
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM16_N8192>(
|
||||||
|
out, a, b, b_scales, a_scales, std::forward<EpilogueArgs>(args)...);
|
||||||
|
} else if (m <= 64) {
|
||||||
|
// m in (16, 64]
|
||||||
|
if (n <= 1280) {
|
||||||
|
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM64_N1280>(
|
||||||
|
out, a, b, b_scales, a_scales, std::forward<EpilogueArgs>(args)...);
|
||||||
|
}
|
||||||
|
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM64_N8192>(
|
||||||
|
out, a, b, b_scales, a_scales, std::forward<EpilogueArgs>(args)...);
|
||||||
|
} else if (m <= 128) {
|
||||||
// m in (64, 128]
|
// m in (64, 128]
|
||||||
return cutlass_gemm_caller<Cutlass3xGemmM128>(
|
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM128>(
|
||||||
out, a, b, std::forward<EpilogueArgs>(args)...);
|
out, a, b, a_scales, b_scales, std::forward<EpilogueArgs>(args)...);
|
||||||
} else {
|
} else {
|
||||||
// m in (128, inf)
|
// m in (128, inf)
|
||||||
return cutlass_gemm_caller<Cutlass3xGemmDefault>(
|
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmDefault>(
|
||||||
out, a, b, std::forward<EpilogueArgs>(args)...);
|
out, a, b, a_scales, b_scales, std::forward<EpilogueArgs>(args)...);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
template <template <typename, typename, typename> typename Epilogue,
|
template <bool EnableBias, typename... EpilogueArgs>
|
||||||
typename... EpilogueArgs>
|
|
||||||
void cutlass_scaled_mm_sm90_fp8_epilogue(torch::Tensor& out,
|
void cutlass_scaled_mm_sm90_fp8_epilogue(torch::Tensor& out,
|
||||||
torch::Tensor const& a,
|
torch::Tensor const& a,
|
||||||
torch::Tensor const& b,
|
torch::Tensor const& b,
|
||||||
|
torch::Tensor const& a_scales,
|
||||||
|
torch::Tensor const& b_scales,
|
||||||
EpilogueArgs&&... epilogue_args) {
|
EpilogueArgs&&... epilogue_args) {
|
||||||
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn);
|
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn);
|
||||||
TORCH_CHECK(b.dtype() == torch::kFloat8_e4m3fn);
|
TORCH_CHECK(b.dtype() == torch::kFloat8_e4m3fn);
|
||||||
|
|
||||||
if (out.dtype() == torch::kBFloat16) {
|
if (out.dtype() == torch::kBFloat16) {
|
||||||
return cutlass_gemm_sm90_fp8_dispatch<cutlass::float_e4m3_t,
|
return cutlass_gemm_sm90_fp8_dispatch<cutlass::float_e4m3_t,
|
||||||
cutlass::bfloat16_t, Epilogue>(
|
cutlass::bfloat16_t, EnableBias>(
|
||||||
out, a, b, std::forward<EpilogueArgs>(epilogue_args)...);
|
out, a, b, a_scales, b_scales,
|
||||||
|
std::forward<EpilogueArgs>(epilogue_args)...);
|
||||||
} else {
|
} else {
|
||||||
TORCH_CHECK(out.dtype() == torch::kFloat16);
|
TORCH_CHECK(out.dtype() == torch::kFloat16);
|
||||||
return cutlass_gemm_sm90_fp8_dispatch<cutlass::float_e4m3_t,
|
return cutlass_gemm_sm90_fp8_dispatch<cutlass::float_e4m3_t,
|
||||||
cutlass::half_t, Epilogue>(
|
cutlass::half_t, EnableBias>(
|
||||||
out, a, b, std::forward<EpilogueArgs>(epilogue_args)...);
|
out, a, b, a_scales, b_scales,
|
||||||
|
std::forward<EpilogueArgs>(epilogue_args)...);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
} // namespace vllm
|
} // namespace vllm
|
||||||
|
|
|
@ -47,13 +47,12 @@ __global__ void compute_problem_sizes(const int32_t* __restrict__ topk_ids,
|
||||||
|
|
||||||
__global__ void compute_expert_offsets(
|
__global__ void compute_expert_offsets(
|
||||||
const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets,
|
const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets,
|
||||||
int32_t* atomic_buffer, const int num_experts, const int topk_length) {
|
int32_t* atomic_buffer, const int num_experts, const bool swap_ab) {
|
||||||
int32_t tot_offset = 0;
|
int32_t tot_offset = 0;
|
||||||
expert_offsets[0] = 0;
|
expert_offsets[0] = 0;
|
||||||
for (int i = 0; i < num_experts; ++i) {
|
for (int i = 0; i < num_experts; ++i) {
|
||||||
atomic_buffer[i] = tot_offset;
|
atomic_buffer[i] = tot_offset;
|
||||||
tot_offset += topk_length > SWAP_AB_THRESHOLD ? problem_sizes1[i * 3]
|
tot_offset += swap_ab ? problem_sizes1[i * 3 + 1] : problem_sizes1[i * 3];
|
||||||
: problem_sizes1[i * 3 + 1];
|
|
||||||
expert_offsets[i + 1] = tot_offset;
|
expert_offsets[i + 1] = tot_offset;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -61,15 +60,14 @@ __global__ void compute_expert_offsets(
|
||||||
__global__ void compute_expert_blockscale_offsets(
|
__global__ void compute_expert_blockscale_offsets(
|
||||||
const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets,
|
const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets,
|
||||||
int32_t* blockscale_offsets, int32_t* atomic_buffer, const int num_experts,
|
int32_t* blockscale_offsets, int32_t* atomic_buffer, const int num_experts,
|
||||||
const int topk_length) {
|
const bool swap_ab) {
|
||||||
int32_t tot_offset = 0;
|
int32_t tot_offset = 0;
|
||||||
int32_t tot_offset_round = 0;
|
int32_t tot_offset_round = 0;
|
||||||
expert_offsets[0] = 0;
|
expert_offsets[0] = 0;
|
||||||
blockscale_offsets[0] = 0;
|
blockscale_offsets[0] = 0;
|
||||||
for (int i = 0; i < num_experts; ++i) {
|
for (int i = 0; i < num_experts; ++i) {
|
||||||
int32_t cur_offset = topk_length > SWAP_AB_THRESHOLD
|
int32_t cur_offset =
|
||||||
? problem_sizes1[i * 3]
|
swap_ab ? problem_sizes1[i * 3 + 1] : problem_sizes1[i * 3];
|
||||||
: problem_sizes1[i * 3 + 1];
|
|
||||||
atomic_buffer[i] = tot_offset;
|
atomic_buffer[i] = tot_offset;
|
||||||
tot_offset += cur_offset;
|
tot_offset += cur_offset;
|
||||||
expert_offsets[i + 1] = tot_offset;
|
expert_offsets[i + 1] = tot_offset;
|
||||||
|
@ -119,15 +117,19 @@ void get_cutlass_moe_mm_data_caller(
|
||||||
|
|
||||||
int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel());
|
int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel());
|
||||||
|
|
||||||
if (topk_ids.numel() > SWAP_AB_THRESHOLD) {
|
// Swap-AB should be disabled for FP4 path
|
||||||
compute_problem_sizes<false><<<num_experts, num_threads, 0, stream>>>(
|
bool may_swap_ab = (!blockscale_offsets.has_value()) &&
|
||||||
|
(topk_ids.numel() <= SWAP_AB_THRESHOLD);
|
||||||
|
|
||||||
|
if (may_swap_ab) {
|
||||||
|
compute_problem_sizes<true><<<num_experts, num_threads, 0, stream>>>(
|
||||||
static_cast<const int32_t*>(topk_ids.data_ptr()),
|
static_cast<const int32_t*>(topk_ids.data_ptr()),
|
||||||
static_cast<int32_t*>(problem_sizes1.data_ptr()),
|
static_cast<int32_t*>(problem_sizes1.data_ptr()),
|
||||||
static_cast<int32_t*>(problem_sizes2.data_ptr()),
|
static_cast<int32_t*>(problem_sizes2.data_ptr()),
|
||||||
static_cast<int32_t*>(atomic_buffer.data_ptr()), topk_ids.numel(), n,
|
static_cast<int32_t*>(atomic_buffer.data_ptr()), topk_ids.numel(), n,
|
||||||
k);
|
k);
|
||||||
} else {
|
} else {
|
||||||
compute_problem_sizes<true><<<num_experts, num_threads, 0, stream>>>(
|
compute_problem_sizes<false><<<num_experts, num_threads, 0, stream>>>(
|
||||||
static_cast<const int32_t*>(topk_ids.data_ptr()),
|
static_cast<const int32_t*>(topk_ids.data_ptr()),
|
||||||
static_cast<int32_t*>(problem_sizes1.data_ptr()),
|
static_cast<int32_t*>(problem_sizes1.data_ptr()),
|
||||||
static_cast<int32_t*>(problem_sizes2.data_ptr()),
|
static_cast<int32_t*>(problem_sizes2.data_ptr()),
|
||||||
|
@ -136,18 +138,19 @@ void get_cutlass_moe_mm_data_caller(
|
||||||
}
|
}
|
||||||
|
|
||||||
if (blockscale_offsets.has_value()) {
|
if (blockscale_offsets.has_value()) {
|
||||||
|
// fp4 path
|
||||||
compute_expert_blockscale_offsets<<<1, 1, 0, stream>>>(
|
compute_expert_blockscale_offsets<<<1, 1, 0, stream>>>(
|
||||||
static_cast<const int32_t*>(problem_sizes1.data_ptr()),
|
static_cast<const int32_t*>(problem_sizes1.data_ptr()),
|
||||||
static_cast<int32_t*>(expert_offsets.data_ptr()),
|
static_cast<int32_t*>(expert_offsets.data_ptr()),
|
||||||
static_cast<int32_t*>(blockscale_offsets.value().data_ptr()),
|
static_cast<int32_t*>(blockscale_offsets.value().data_ptr()),
|
||||||
static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts,
|
static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts,
|
||||||
topk_ids.numel());
|
may_swap_ab);
|
||||||
} else {
|
} else {
|
||||||
compute_expert_offsets<<<1, 1, 0, stream>>>(
|
compute_expert_offsets<<<1, 1, 0, stream>>>(
|
||||||
static_cast<const int32_t*>(problem_sizes1.data_ptr()),
|
static_cast<const int32_t*>(problem_sizes1.data_ptr()),
|
||||||
static_cast<int32_t*>(expert_offsets.data_ptr()),
|
static_cast<int32_t*>(expert_offsets.data_ptr()),
|
||||||
static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts,
|
static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts,
|
||||||
topk_ids.numel());
|
may_swap_ab);
|
||||||
}
|
}
|
||||||
compute_arg_sorts<<<num_experts, num_threads, 0, stream>>>(
|
compute_arg_sorts<<<num_experts, num_threads, 0, stream>>>(
|
||||||
static_cast<const int32_t*>(topk_ids.data_ptr()),
|
static_cast<const int32_t*>(topk_ids.data_ptr()),
|
||||||
|
|
|
@ -1,10 +1,10 @@
|
||||||
#include <ATen/cuda/CUDAContext.h>
|
#include <ATen/cuda/CUDAContext.h>
|
||||||
#include <c10/util/Float8_e4m3fn.h>
|
|
||||||
|
#include "../per_token_group_quant_8bit.h"
|
||||||
|
|
||||||
#include <cmath>
|
#include <cmath>
|
||||||
|
|
||||||
#include <cuda_fp16.h>
|
#include <cuda_fp8.h>
|
||||||
#include <cuda_bf16.h>
|
|
||||||
|
|
||||||
#include <torch/all.h>
|
#include <torch/all.h>
|
||||||
|
|
||||||
|
@ -120,7 +120,7 @@ void per_token_group_quant_8bit(const torch::Tensor& input,
|
||||||
torch::Tensor& output_q,
|
torch::Tensor& output_q,
|
||||||
torch::Tensor& output_s, int64_t group_size,
|
torch::Tensor& output_s, int64_t group_size,
|
||||||
double eps, double min_8bit, double max_8bit,
|
double eps, double min_8bit, double max_8bit,
|
||||||
bool scale_ue8m0 = false) {
|
bool scale_ue8m0) {
|
||||||
TORCH_CHECK(input.is_contiguous());
|
TORCH_CHECK(input.is_contiguous());
|
||||||
TORCH_CHECK(output_q.is_contiguous());
|
TORCH_CHECK(output_q.is_contiguous());
|
||||||
|
|
||||||
|
@ -197,7 +197,9 @@ void per_token_group_quant_8bit(const torch::Tensor& input,
|
||||||
VLLM_DISPATCH_FLOATING_TYPES(
|
VLLM_DISPATCH_FLOATING_TYPES(
|
||||||
input.scalar_type(), "per_token_group_quant_8bit", ([&] {
|
input.scalar_type(), "per_token_group_quant_8bit", ([&] {
|
||||||
if (dst_type == at::ScalarType::Float8_e4m3fn) {
|
if (dst_type == at::ScalarType::Float8_e4m3fn) {
|
||||||
LAUNCH_KERNEL(scalar_t, c10::Float8_e4m3fn);
|
LAUNCH_KERNEL(scalar_t, __nv_fp8_e4m3);
|
||||||
|
} else if (dst_type == at::ScalarType::Char) {
|
||||||
|
LAUNCH_KERNEL(scalar_t, int8_t);
|
||||||
}
|
}
|
||||||
}));
|
}));
|
||||||
|
|
||||||
|
|
|
@ -4,7 +4,7 @@
|
||||||
#include <torch/all.h>
|
#include <torch/all.h>
|
||||||
#include <c10/cuda/CUDAGuard.h>
|
#include <c10/cuda/CUDAGuard.h>
|
||||||
|
|
||||||
#include "cuda_compat.h"
|
#include "../../cuda_compat.h"
|
||||||
#include "dispatch_utils.h"
|
#include "dispatch_utils.h"
|
||||||
|
|
||||||
#include "ggml-common.h"
|
#include "ggml-common.h"
|
||||||
|
|
|
@ -187,8 +187,12 @@ struct PrepackedLayoutBTemplate {
|
||||||
CUTE_HOST_DEVICE static constexpr auto TVbNbKL_to_offset_copy(
|
CUTE_HOST_DEVICE static constexpr auto TVbNbKL_to_offset_copy(
|
||||||
Shape_NKL shape_mkl) {
|
Shape_NKL shape_mkl) {
|
||||||
auto layout = TVbNbKL_to_offset(shape_mkl);
|
auto layout = TVbNbKL_to_offset(shape_mkl);
|
||||||
return make_layout(coalesce(get<0>(layout)), get<1>(layout),
|
// for 4-bit elements, having >= 64 values per column
|
||||||
get<2>(layout));
|
// allows TMA to load full 32-byte sectors
|
||||||
|
auto inner_layout =
|
||||||
|
make_layout(make_shape(_256{}, size<0>(layout) / _256{}));
|
||||||
|
|
||||||
|
return make_layout(inner_layout, get<1>(layout), get<2>(layout));
|
||||||
}
|
}
|
||||||
|
|
||||||
// ((BlockN, BlockK), (BlocksN, BlocksK), L) -> (storage_idx)
|
// ((BlockN, BlockK), (BlocksN, BlocksK), L) -> (storage_idx)
|
||||||
|
|
|
@ -0,0 +1,10 @@
|
||||||
|
#pragma once
|
||||||
|
#include <torch/all.h>
|
||||||
|
|
||||||
|
// TODO(wentao): refactor the folder to 8bit, then includes fp8 and int8 folders
|
||||||
|
// 8-bit per-token-group quantization helper used by both FP8 and INT8
|
||||||
|
void per_token_group_quant_8bit(const torch::Tensor& input,
|
||||||
|
torch::Tensor& output_q,
|
||||||
|
torch::Tensor& output_s, int64_t group_size,
|
||||||
|
double eps, double min_8bit, double max_8bit,
|
||||||
|
bool scale_ue8m0 = false);
|
|
@ -19,7 +19,7 @@
|
||||||
#include <c10/cuda/CUDAGuard.h>
|
#include <c10/cuda/CUDAGuard.h>
|
||||||
#include <hip/hip_fp8.h>
|
#include <hip/hip_fp8.h>
|
||||||
#include <hip/hip_bf16.h>
|
#include <hip/hip_bf16.h>
|
||||||
#include "cuda_compat.h"
|
#include "../cuda_compat.h"
|
||||||
|
|
||||||
#include <algorithm>
|
#include <algorithm>
|
||||||
#include "../attention/dtype_fp8.cuh"
|
#include "../attention/dtype_fp8.cuh"
|
||||||
|
|
|
@ -9,7 +9,7 @@
|
||||||
#include <stdexcept>
|
#include <stdexcept>
|
||||||
#include <algorithm>
|
#include <algorithm>
|
||||||
|
|
||||||
#include "cuda_compat.h"
|
#include "../cuda_compat.h"
|
||||||
#include "dispatch_utils.h"
|
#include "dispatch_utils.h"
|
||||||
#include "quantization/fp8/common.cuh"
|
#include "quantization/fp8/common.cuh"
|
||||||
|
|
||||||
|
|
|
@ -624,6 +624,14 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
|
||||||
ops.impl("per_token_group_fp8_quant", torch::kCUDA,
|
ops.impl("per_token_group_fp8_quant", torch::kCUDA,
|
||||||
&per_token_group_quant_fp8);
|
&per_token_group_quant_fp8);
|
||||||
|
|
||||||
|
// Compute per-token-group INT8 quantized tensor and scaling factor.
|
||||||
|
ops.def(
|
||||||
|
"per_token_group_quant_int8(Tensor input, Tensor! output_q, Tensor! "
|
||||||
|
"output_s, int group_size, float eps, float int8_min, float int8_max) -> "
|
||||||
|
"()");
|
||||||
|
ops.impl("per_token_group_quant_int8", torch::kCUDA,
|
||||||
|
&per_token_group_quant_int8);
|
||||||
|
|
||||||
// reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel
|
// reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel
|
||||||
ops.def(
|
ops.def(
|
||||||
"rearrange_kn_weight_as_n32k16_order(Tensor b_qweight, Tensor b_scales, "
|
"rearrange_kn_weight_as_n32k16_order(Tensor b_qweight, Tensor b_scales, "
|
||||||
|
|
|
@ -164,9 +164,6 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
# see https://github.com/pytorch/pytorch/pull/123243
|
# see https://github.com/pytorch/pytorch/pull/123243
|
||||||
ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'
|
ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'
|
||||||
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
|
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
|
||||||
# Override the arch list for flash-attn to reduce the binary size
|
|
||||||
ARG vllm_fa_cmake_gpu_arches='80-real;90-real'
|
|
||||||
ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches}
|
|
||||||
#################### BASE BUILD IMAGE ####################
|
#################### BASE BUILD IMAGE ####################
|
||||||
|
|
||||||
#################### WHEEL BUILD IMAGE ####################
|
#################### WHEEL BUILD IMAGE ####################
|
||||||
|
@ -209,16 +206,7 @@ ARG SCCACHE_REGION_NAME=us-west-2
|
||||||
ARG SCCACHE_S3_NO_CREDENTIALS=0
|
ARG SCCACHE_S3_NO_CREDENTIALS=0
|
||||||
|
|
||||||
# Flag to control whether to use pre-built vLLM wheels
|
# Flag to control whether to use pre-built vLLM wheels
|
||||||
ARG VLLM_USE_PRECOMPILED
|
ARG VLLM_USE_PRECOMPILED=""
|
||||||
# TODO: in setup.py VLLM_USE_PRECOMPILED is sensitive to truthiness, it will take =0 as "true", this should be fixed
|
|
||||||
ENV VLLM_USE_PRECOMPILED=""
|
|
||||||
RUN if [ "${VLLM_USE_PRECOMPILED}" = "1" ]; then \
|
|
||||||
export VLLM_USE_PRECOMPILED=1 && \
|
|
||||||
echo "Using precompiled wheels"; \
|
|
||||||
else \
|
|
||||||
unset VLLM_USE_PRECOMPILED && \
|
|
||||||
echo "Leaving VLLM_USE_PRECOMPILED unset to build wheels from source"; \
|
|
||||||
fi
|
|
||||||
|
|
||||||
# if USE_SCCACHE is set, use sccache to speed up compilation
|
# if USE_SCCACHE is set, use sccache to speed up compilation
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
|
@ -235,6 +223,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
&& export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
|
&& export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
|
||||||
&& export SCCACHE_IDLE_TIMEOUT=0 \
|
&& export SCCACHE_IDLE_TIMEOUT=0 \
|
||||||
&& export CMAKE_BUILD_TYPE=Release \
|
&& export CMAKE_BUILD_TYPE=Release \
|
||||||
|
&& export VLLM_USE_PRECOMPILED="${VLLM_USE_PRECOMPILED}" \
|
||||||
|
&& export VLLM_DOCKER_BUILD_CONTEXT=1 \
|
||||||
&& sccache --show-stats \
|
&& sccache --show-stats \
|
||||||
&& python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
|
&& python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
|
||||||
&& sccache --show-stats; \
|
&& sccache --show-stats; \
|
||||||
|
@ -248,9 +238,22 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
|
||||||
# Clean any existing CMake artifacts
|
# Clean any existing CMake artifacts
|
||||||
rm -rf .deps && \
|
rm -rf .deps && \
|
||||||
mkdir -p .deps && \
|
mkdir -p .deps && \
|
||||||
|
export VLLM_USE_PRECOMPILED="${VLLM_USE_PRECOMPILED}" && \
|
||||||
|
export VLLM_DOCKER_BUILD_CONTEXT=1 && \
|
||||||
python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
|
python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# When using precompiled wheels, keep only the newest manylinux1 wheel and delete others
|
||||||
|
RUN if [ "$VLLM_USE_PRECOMPILED" = "1" ]; then \
|
||||||
|
echo "Cleaning up extra wheels in dist/..." && \
|
||||||
|
# Identify the most recent manylinux1_x86_64 wheel
|
||||||
|
KEEP_WHEEL=$(ls -t dist/*manylinux1_x86_64.whl 2>/dev/null | head -n1) && \
|
||||||
|
if [ -n "$KEEP_WHEEL" ]; then \
|
||||||
|
echo "Keeping wheel: $KEEP_WHEEL"; \
|
||||||
|
find dist/ -type f -name "*.whl" ! -path "${KEEP_WHEEL}" -delete; \
|
||||||
|
fi; \
|
||||||
|
fi
|
||||||
|
|
||||||
# Check the size of the wheel if RUN_WHEEL_CHECK is true
|
# Check the size of the wheel if RUN_WHEEL_CHECK is true
|
||||||
COPY .buildkite/check-wheel-size.py check-wheel-size.py
|
COPY .buildkite/check-wheel-size.py check-wheel-size.py
|
||||||
# sync the default value with .buildkite/check-wheel-size.py
|
# sync the default value with .buildkite/check-wheel-size.py
|
||||||
|
@ -276,10 +279,6 @@ ARG PYTORCH_CUDA_INDEX_BASE_URL
|
||||||
ENV UV_HTTP_TIMEOUT=500
|
ENV UV_HTTP_TIMEOUT=500
|
||||||
ENV UV_INDEX_STRATEGY="unsafe-best-match"
|
ENV UV_INDEX_STRATEGY="unsafe-best-match"
|
||||||
|
|
||||||
# Workaround for #17068
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
|
||||||
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
|
|
||||||
|
|
||||||
COPY requirements/lint.txt requirements/lint.txt
|
COPY requirements/lint.txt requirements/lint.txt
|
||||||
COPY requirements/test.txt requirements/test.txt
|
COPY requirements/test.txt requirements/test.txt
|
||||||
COPY requirements/dev.txt requirements/dev.txt
|
COPY requirements/dev.txt requirements/dev.txt
|
||||||
|
@ -371,6 +370,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# Install vllm wheel first, so that torch etc will be installed.
|
# Install vllm wheel first, so that torch etc will be installed.
|
||||||
|
# !bang
|
||||||
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
|
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
|
||||||
--mount=type=cache,target=/root/.cache/uv \
|
--mount=type=cache,target=/root/.cache/uv \
|
||||||
uv pip install --system dist/*.whl --verbose \
|
uv pip install --system dist/*.whl --verbose \
|
||||||
|
@ -390,7 +390,9 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
|
||||||
|
|
||||||
# Install FlashInfer from source
|
# Install FlashInfer from source
|
||||||
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
|
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
|
||||||
ARG FLASHINFER_GIT_REF="v0.2.8rc1"
|
# Keep this in sync with https://github.com/vllm-project/vllm/blob/main/requirements/cuda.txt
|
||||||
|
# We use `--force-reinstall --no-deps` to avoid issues with the existing FlashInfer wheel.
|
||||||
|
ARG FLASHINFER_GIT_REF="v0.2.9rc2"
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH'
|
RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH'
|
||||||
. /etc/environment
|
. /etc/environment
|
||||||
git clone --depth 1 --recursive --shallow-submodules \
|
git clone --depth 1 --recursive --shallow-submodules \
|
||||||
|
@ -412,7 +414,7 @@ RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH'
|
||||||
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
|
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
|
||||||
python3 -m flashinfer.aot
|
python3 -m flashinfer.aot
|
||||||
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
|
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
|
||||||
uv pip install --system --no-build-isolation .
|
uv pip install --system --no-build-isolation --force-reinstall --no-deps .
|
||||||
popd
|
popd
|
||||||
rm -rf flashinfer
|
rm -rf flashinfer
|
||||||
BASH
|
BASH
|
||||||
|
@ -452,10 +454,6 @@ ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
|
||||||
ENV UV_HTTP_TIMEOUT=500
|
ENV UV_HTTP_TIMEOUT=500
|
||||||
ENV UV_INDEX_STRATEGY="unsafe-best-match"
|
ENV UV_INDEX_STRATEGY="unsafe-best-match"
|
||||||
|
|
||||||
# Workaround for #17068
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
|
||||||
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
|
|
||||||
|
|
||||||
# install development dependencies (for testing)
|
# install development dependencies (for testing)
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
|
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
|
||||||
|
|
|
@ -1,62 +0,0 @@
|
||||||
# This vLLM Dockerfile is used to construct an image that can build and run vLLM on ARM CPU platform.
|
|
||||||
|
|
||||||
FROM ubuntu:22.04 AS cpu-test-arm
|
|
||||||
|
|
||||||
ENV CCACHE_DIR=/root/.cache/ccache
|
|
||||||
|
|
||||||
ENV CMAKE_CXX_COMPILER_LAUNCHER=ccache
|
|
||||||
|
|
||||||
RUN --mount=type=cache,target=/var/cache/apt \
|
|
||||||
apt-get update -y \
|
|
||||||
&& apt-get install -y curl ccache git wget vim numactl gcc-12 g++-12 python3 python3-pip libtcmalloc-minimal4 libnuma-dev \
|
|
||||||
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
|
|
||||||
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
|
|
||||||
|
|
||||||
# tcmalloc provides better memory allocation efficiency, e.g., holding memory in caches to speed up access of commonly-used objects.
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/pip \
|
|
||||||
pip install py-cpuinfo # Use this to gather CPU info and optimize based on ARM Neoverse cores
|
|
||||||
|
|
||||||
# Set LD_PRELOAD for tcmalloc on ARM
|
|
||||||
ENV LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4"
|
|
||||||
|
|
||||||
RUN echo 'ulimit -c 0' >> ~/.bashrc
|
|
||||||
|
|
||||||
WORKDIR /workspace
|
|
||||||
|
|
||||||
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
|
|
||||||
ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/pip \
|
|
||||||
--mount=type=bind,src=requirements/build.txt,target=requirements/build.txt \
|
|
||||||
pip install --upgrade pip && \
|
|
||||||
pip install -r requirements/build.txt
|
|
||||||
|
|
||||||
FROM cpu-test-arm AS build
|
|
||||||
|
|
||||||
WORKDIR /workspace/vllm
|
|
||||||
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/pip \
|
|
||||||
--mount=type=bind,src=requirements/common.txt,target=requirements/common.txt \
|
|
||||||
--mount=type=bind,src=requirements/cpu.txt,target=requirements/cpu.txt \
|
|
||||||
pip install -v -r requirements/cpu.txt
|
|
||||||
|
|
||||||
COPY . .
|
|
||||||
ARG GIT_REPO_CHECK=0
|
|
||||||
RUN --mount=type=bind,source=.git,target=.git \
|
|
||||||
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
|
|
||||||
|
|
||||||
# Disabling AVX512 specific optimizations for ARM
|
|
||||||
ARG VLLM_CPU_DISABLE_AVX512="true"
|
|
||||||
ENV VLLM_CPU_DISABLE_AVX512=${VLLM_CPU_DISABLE_AVX512}
|
|
||||||
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/pip \
|
|
||||||
--mount=type=cache,target=/root/.cache/ccache \
|
|
||||||
--mount=type=bind,source=.git,target=.git \
|
|
||||||
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel && \
|
|
||||||
pip install dist/*.whl && \
|
|
||||||
rm -rf dist
|
|
||||||
|
|
||||||
WORKDIR /workspace/
|
|
||||||
|
|
||||||
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
|
|
||||||
|
|
||||||
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
|
|
|
@ -1,4 +1,11 @@
|
||||||
# This vLLM Dockerfile is used to construct image that can build and run vLLM on x86 CPU platform.
|
# This vLLM Dockerfile is used to build images that can run vLLM on both x86_64 and arm64 CPU platforms.
|
||||||
|
#
|
||||||
|
# Supported platforms:
|
||||||
|
# - linux/amd64 (x86_64)
|
||||||
|
# - linux/arm64 (aarch64)
|
||||||
|
#
|
||||||
|
# Use the `--platform` option with `docker buildx build` to specify the target architecture, e.g.:
|
||||||
|
# docker buildx build --platform=linux/arm64 -f docker/Dockerfile.cpu .
|
||||||
#
|
#
|
||||||
# Build targets:
|
# Build targets:
|
||||||
# vllm-openai (default): used for serving deployment
|
# vllm-openai (default): used for serving deployment
|
||||||
|
@ -12,16 +19,14 @@
|
||||||
# VLLM_CPU_AVX512VNNI=false (default)|true
|
# VLLM_CPU_AVX512VNNI=false (default)|true
|
||||||
#
|
#
|
||||||
|
|
||||||
######################### BASE IMAGE #########################
|
######################### COMMON BASE IMAGE #########################
|
||||||
FROM ubuntu:22.04 AS base
|
FROM ubuntu:22.04 AS base-common
|
||||||
|
|
||||||
WORKDIR /workspace/
|
WORKDIR /workspace/
|
||||||
|
|
||||||
ARG PYTHON_VERSION=3.12
|
ARG PYTHON_VERSION=3.12
|
||||||
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
|
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
|
||||||
|
|
||||||
ENV LD_PRELOAD=""
|
|
||||||
|
|
||||||
# Install minimal dependencies and uv
|
# Install minimal dependencies and uv
|
||||||
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
|
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
|
||||||
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
||||||
|
@ -53,7 +58,21 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
uv pip install --upgrade pip && \
|
uv pip install --upgrade pip && \
|
||||||
uv pip install -r requirements/cpu.txt
|
uv pip install -r requirements/cpu.txt
|
||||||
|
|
||||||
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so:$LD_PRELOAD"
|
ARG TARGETARCH
|
||||||
|
ENV TARGETARCH=${TARGETARCH}
|
||||||
|
|
||||||
|
######################### x86_64 BASE IMAGE #########################
|
||||||
|
FROM base-common AS base-amd64
|
||||||
|
|
||||||
|
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so"
|
||||||
|
|
||||||
|
######################### arm64 BASE IMAGE #########################
|
||||||
|
FROM base-common AS base-arm64
|
||||||
|
|
||||||
|
ENV LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4"
|
||||||
|
|
||||||
|
######################### BASE IMAGE #########################
|
||||||
|
FROM base-${TARGETARCH} AS base
|
||||||
|
|
||||||
RUN echo 'ulimit -c 0' >> ~/.bashrc
|
RUN echo 'ulimit -c 0' >> ~/.bashrc
|
||||||
|
|
||||||
|
|
|
@ -114,9 +114,6 @@ RUN cat torch_build_versions.txt
|
||||||
# explicitly set the list to avoid issues with torch 2.2
|
# explicitly set the list to avoid issues with torch 2.2
|
||||||
# see https://github.com/pytorch/pytorch/pull/123243
|
# see https://github.com/pytorch/pytorch/pull/123243
|
||||||
|
|
||||||
# Override the arch list for flash-attn to reduce the binary size
|
|
||||||
ARG vllm_fa_cmake_gpu_arches='80-real;90-real'
|
|
||||||
ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches}
|
|
||||||
#################### BASE BUILD IMAGE ####################
|
#################### BASE BUILD IMAGE ####################
|
||||||
|
|
||||||
#################### WHEEL BUILD IMAGE ####################
|
#################### WHEEL BUILD IMAGE ####################
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
ARG NIGHTLY_DATE="20250714"
|
ARG NIGHTLY_DATE="20250730"
|
||||||
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.12_tpuvm_$NIGHTLY_DATE"
|
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.12_tpuvm_$NIGHTLY_DATE"
|
||||||
|
|
||||||
FROM $BASE_IMAGE
|
FROM $BASE_IMAGE
|
||||||
|
|
|
@ -56,9 +56,7 @@ nav:
|
||||||
- contributing/model/tests.md
|
- contributing/model/tests.md
|
||||||
- contributing/model/multimodal.md
|
- contributing/model/multimodal.md
|
||||||
- CI: contributing/ci
|
- CI: contributing/ci
|
||||||
- Design Documents:
|
- Design Documents: design
|
||||||
- V0: design
|
|
||||||
- V1: design/v1
|
|
||||||
- API Reference:
|
- API Reference:
|
||||||
- Summary: api/README.md
|
- Summary: api/README.md
|
||||||
- Contents:
|
- Contents:
|
||||||
|
|
After Width: | Height: | Size: 187 KiB |
After Width: | Height: | Size: 189 KiB |
After Width: | Height: | Size: 227 KiB |
After Width: | Height: | Size: 128 KiB |
Before Width: | Height: | Size: 185 KiB After Width: | Height: | Size: 185 KiB |
Before Width: | Height: | Size: 162 KiB After Width: | Height: | Size: 162 KiB |
Before Width: | Height: | Size: 161 KiB After Width: | Height: | Size: 161 KiB |
Before Width: | Height: | Size: 27 KiB After Width: | Height: | Size: 27 KiB |
Before Width: | Height: | Size: 109 KiB After Width: | Height: | Size: 109 KiB |
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |
Before Width: | Height: | Size: 41 KiB After Width: | Height: | Size: 41 KiB |
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 32 KiB |
Before Width: | Height: | Size: 42 KiB After Width: | Height: | Size: 42 KiB |
Before Width: | Height: | Size: 167 KiB After Width: | Height: | Size: 167 KiB |
Before Width: | Height: | Size: 47 KiB After Width: | Height: | Size: 47 KiB |
Before Width: | Height: | Size: 50 KiB After Width: | Height: | Size: 50 KiB |
Before Width: | Height: | Size: 59 KiB After Width: | Height: | Size: 59 KiB |
Before Width: | Height: | Size: 54 KiB After Width: | Height: | Size: 54 KiB |
Before Width: | Height: | Size: 54 KiB After Width: | Height: | Size: 54 KiB |
Before Width: | Height: | Size: 55 KiB After Width: | Height: | Size: 55 KiB |
Before Width: | Height: | Size: 18 KiB After Width: | Height: | Size: 18 KiB |
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 32 KiB |
After Width: | Height: | Size: 12 KiB |