Compare commits

..

197 Commits

Author SHA1 Message Date
JohnJan 54f2b31184
[Doc] Add a doc for qwen omni (#1867)
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>

### What this PR does / why we need it?
Add FAQ note for qwen omni
Fixes https://github.com/vllm-project/vllm-ascend/issues/1760 issue1



- vLLM version: v0.9.2
- vLLM main:
b9a21e9173
2025-07-20 09:05:41 +08:00
wangxiyuan 2b726d8f90
[CI] Fix broken CI (#1889)
1. vLLM commit
45badd05d0
changed the pooling check logic which broken vLLM Ascend.
2. vLLM commit
3e04107d97
requires higher version of transformers. The transformers version bug
has been fixed by
e936e401de.
We can safe to remove the version limit now.
3. vLLM commit
217937221b
added a new input `enable_eplb` for FusedMoe Ops

This PR fix the broken CI.


- vLLM version: v0.9.2
- vLLM main:
6a971ed692

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-20 02:11:57 +08:00
leo-pony 2ee90461d0
Fix e2e data parallel test: add resource release code (#1881)
### What this PR does / why we need it?
Fix e2e data parallel test: add resource release code and give more time
to engine to pause their processing loops before exiting.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.9.2
- vLLM main:
5895afd780

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-19 11:39:48 +08:00
xleoken b824525be3
Move deepseek_v3 from deepseek_v2.py (#1793)
### What this PR does / why we need it?
Before patch, we can see
`vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM`, it seems
not friendly format.

```
WARNING 07-14 23:57:34 [registry.py:413] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 07-14 23:57:34 [registry.py:413] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 07-14 23:57:34 [registry.py:413] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
```


### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local Test.


- vLLM version: v0.9.2
- vLLM main:
bcdfb2a330

Signed-off-by: xleoken <xleoken@163.com>
2025-07-19 11:37:03 +08:00
Shanshan Shen ab68d31a24
[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
5895afd780

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-19 09:42:32 +08:00
lianyibo 53d2ea3789
[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 (#1811)
### What this PR does / why we need it?

maybe fixes
[#1728](https://github.com/vllm-project/vllm-ascend/issues/1728#issuecomment-3065083433)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Test Qwen3-32B tp=4 with: 

```bash
vllm serve --port 1234 Qwen/Qwen3-32B \
    --served-model-name Qwen3-32B \
    --tensor-parallel-size 4 \
    --swap-space 16 \
    --max-model-len 6000 \
    --load-format dummy \
    --disable-log-stats \
    --disable-log-requests \
```

Request batch_size=128 input/output token=1024

**In 0.9.2rc1**

```text
=====================================================
Total TPS with    prefill(tokens/s)         : 785.1395
Total TPS without prefill                   : 846.6809
Mean TPS with    prefill                    : 6.1339
Mean TPS without prefill                    : 6.6147
=====================================================
Mean TTFT(ms)                               : 10307.8123
Max  TTFT(ms)                               : 21423.0733
Min  TTFT(ms)                               : 362.3602
=====================================================
Mean TPOT(ms)                               : 151.3051
Max  TPOT(ms)                               : 159.4649
Min  TPOT(ms)                               : 140.899
=====================================================
Total Time(s)                               : 175.6032
Request Throughput(requests/s)              : 0.7289
=====================================================
```

**Apply this PR**

```text
=====================================================
Total TPS with    prefill(tokens/s)         : 811.0014
Total TPS without prefill                   : 876.4423
Mean TPS with    prefill                    : 6.3359
Mean TPS without prefill                    : 6.8472
=====================================================
Mean TTFT(ms)                               : 10263.8382
Max  TTFT(ms)                               : 21151.2547
Min  TTFT(ms)                               : 375.9136
=====================================================
Mean TPOT(ms)                               : 146.1686
Max  TPOT(ms)                               : 154.0957
Min  TPOT(ms)                               : 136.8879
=====================================================
Total Time(s)                               : 169.8579
Request Throughput(requests/s)              : 0.7536
=====================================================
```

The TPOT performance gap between these two sets of data is about 3%.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
2025-07-18 23:09:54 +08:00
Mengqing Cao 574fe407eb
[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841)
### What this PR does / why we need it?
We'll refator `CustomOp` in vllm-ascend from this pr on. 

Use function `CustomOp.register_oot` to achieve the customop registery,
taking `AscendQuickGELU` as an example:
```python
from vllm_ascend.ops.activation import AscendQuickGELU
CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU")
```

This is a quick adapt for `CustomOp.register_oot` mechanism from vllm
0.9.2. For further step, we can remove inherit from `QuickGELU` can
write our own `QuickGELU` at all.

Part of https://github.com/vllm-project/vllm-ascend/pull/1647



- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-18 23:07:14 +08:00
Shanshan Shen 8a91e6e59c
[Misc][V0 Deprecation] Remove V0 Related Custom Ops (#1871)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
ca4eb82bcb

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-18 23:06:03 +08:00
li chaoran 3e39d7234c
[CI] Switching to infra cache server to reduce network pressure (#1792)
### What this PR does / why we need it?
This PR introduce the infra cache server to speed up apt/pip package
installation

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
Tested locally, with this config, the network bandwith reduce from 100%
to 5% usage when a new PR was submitted.
<img width="807" height="334" alt="image"
src="https://github.com/user-attachments/assets/16f03bce-4531-4c71-ab6e-8308dc2c022c"
/>


- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

---------

Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>
2025-07-18 18:39:25 +08:00
Shanshan Shen d08ff304cd
[Misc][V0 Deprecation] Remove V0 Attention (#1835)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-18 14:10:13 +08:00
xudongLi-cmss 33ef5dc813
add unit test for func wrapper (#1863)
### What this PR does / why we need it?
test func wrapper file

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>
2025-07-18 11:05:17 +08:00
Li Wang f9dfde02fd
[Bugfix] Fix broken CI (#1848)
### What this PR does / why we need it?
- Fix broken commit by
[#20927](https://github.com/vllm-project/vllm/pull/20927)
- Fix broken commit by
[#20466](https://github.com/vllm-project/vllm/pull/20466)
- TODO: more fully adapt to the upstream reconstruction, let's first
make CI happy

- vLLM version: v0.9.2
- vLLM main:
11dfdf21bf

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-17 20:10:12 +08:00
Zhu Yi Lin 538dd357e6
Add graph mode and improve on multi_npu_moge.md (#1849)
### What this PR does / why we need it?
Add graph mode and improve on multi_npu_moge.md

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
CI passed with new existing test.


- vLLM version: v0.9.2
- vLLM main:
5a7fb3ab9e

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-07-17 17:53:37 +08:00
Shanshan Shen aeb5aa8b88
[Misc][V0 Deprecation] Add `__main__` guard to all offline examples (#1837)
### What this PR does / why we need it?
Add `__main__` guard to all offline examples.

- vLLM version: v0.9.2
- vLLM main:
76b494444f

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-17 14:13:30 +08:00
Shanshan Shen 19e37cd379
[Misc] Add `fusion_result.json` to `.gitignore` (#1836)
### What this PR does / why we need it?
Add `fusion_result.json` to `.gitignore`.



- vLLM version: v0.9.2
- vLLM main:
72ad273582

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-17 11:54:49 +08:00
Icey 875a920d4a
[Platform] Add support for Altlas A3 series (#1794)
### What this PR does / why we need it?
Add support for Ascend A3 and remove latest tag

### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas A3 series

### How was this patch tested?
CI passed with:

- remove latest tag test:
https://github.com/wxsIcey/wxs-vllm-ascend/actions/runs/16267635040/job/45926924765
- E2E image build for A3
- CI test on A3 with e2e test and longterm test
- Unit test missing because need a real A3 hardware to have a test

Closes: https://github.com/vllm-project/vllm-ascend/issues/1696


- vLLM version: v0.9.2
- vLLM main:
d0dc4cfca4

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-07-17 11:13:02 +08:00
wangxiyuan ef99fe1c54
[Test] Clean up duplicate test for ascend scheduler (#1819)
There are some duplicate tests for ascend scheduler. This PR remove them
to make the test clear.

After this PR. the singlecard e2e cost time is reduced from 47min to
46min.

- vLLM version: v0.9.2
- vLLM main:
1eb2b9c102

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-16 17:57:48 +08:00
Shanshan Shen c66b0827a7
[Misc][V0 Deprecation] Remove Pooling Model Runner (#1824)
### What this PR does / why we need it?
Remove pooling model runner.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
d31a647124

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 17:48:21 +08:00
Yikun Jiang ba7e934b21
Remove redundant empty lines in commit msg (#1814)
### What this PR does / why we need it?
Remove redundant empty lines in commit msg

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
test locally: https://github.com/Yikun/vllm-ascend/pull/48

- vLLM version: v0.9.2
- vLLM main:
d0dc4cfca4

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-16 16:50:44 +08:00
Shanshan Shen 06655002c5
[Misc][V0 Deprecation] Remove V0 Worker (#1821)
### What this PR does / why we need it?
Remove V0 worker.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
6cbc4d4bea

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 14:07:17 +08:00
Shanshan Shen b005def0a5
[Misc][V0 Deprecation] Remove Multi-Step Model Runner (#1820)
### What this PR does / why we need it?
Remove multi-step model runner.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.



- vLLM version: v0.9.2
- vLLM main:
34cda778a0

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 14:06:49 +08:00
Shanshan Shen f9e2e9bb31
[Misc][V0 Deprecation] Remove Draft Model Runner Used for V0 Spec Decode (#1810)
### What this PR does / why we need it?
Remove draft model runner used for V0 spec decode.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
34cda778a0

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 10:51:23 +08:00
Shanshan Shen f96100fad5
[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805)
### What this PR does / why we need it?
Remove V0 related codes of test, example, platform.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:58:55 +08:00
Shanshan Shen a929699e98
[Misc][V0 Deprecation] Remove multi-step worker (#1809)
### What this PR does / why we need it?
Remove multi-step worker

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:48:47 +08:00
wangxiyuan bf2549856f
[CI] Fix changes CI to recover codecov (#1799)
Add `checkout` action before `dorny/paths-filter` to make it works with
`push` case.
This is a known issue that `dorny/paths-filter` works without `checkout`
in `pull_request` case but failed in `push` case. More detail is here:
https://github.com/dorny/paths-filter/issues/60#issuecomment-1464281021

The push CI works after this PR. The test result is here:

https://github.com/wangxiyuan/vllm-ascend/actions/runs/16285606468/job/45983607539
- vLLM version: v0.9.2
- vLLM main:
d4d309409f

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 15:01:13 +08:00
wangxiyuan 787010a637
[Test] Remove VLLM_USE_V1 in example and tests (#1733)
V1 is enabled by default, no need to set it by hand now. This PR remove
the useless setting in example and tests

- vLLM version: v0.9.2
- vLLM main:
9ad0a4588b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 12:49:57 +08:00
wangxiyuan eb921d2b6f
[Doc] Fix 404 error (#1797)
Fix url 404 error in doc
- vLLM version: v0.9.2
- vLLM main:
9ad0a4588b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 11:52:38 +08:00
wangxiyuan 7bdada58eb
[Misc] Remove VLLM_USE_V1 usage in code (#1764)
We plan to remove V0 code from this version. The first step is to delete
v0 usage.

Related: https://github.com/vllm-project/vllm-ascend/issues/1620

- vLLM version: v0.9.2
- vLLM main:
61e20828da

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 11:52:16 +08:00
wangxiyuan 494b0f474f
[CI]Fix broken CI (#1773)
This PR fixed the broken CI. It require
https://github.com/vllm-project/vllm/pull/20900 merged first.

- vLLM version: v0.9.2
- vLLM main:
e8cc53af5e

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 00:54:20 +08:00
Li Wang afcfe91dfa
[Doc] Fix multi node doc (#1783)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Pin docker image to latest release
### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
1e9438e0b0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-14 17:56:57 +08:00
zhangxinyuehfad cabfb2bc31
[Test] Resolve vllm-ascend version accuracy test (#1769)
### What this PR does / why we need it?
Resolve vllm-ascend version for accuracy test

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
66f6fbd393

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-14 15:43:37 +08:00
Shanshan Shen d3c6dd985a
[Misc] Add `include` dir to `.gitignore` (#1771)
### What this PR does / why we need it?
Add `include` dir to `.gitignore`.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
66f6fbd393

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-14 12:05:29 +08:00
Li Wang 9cd4ac76a1
[CI] Remove benchmark patch and increase the scheduler frequency (#1762)
### What this PR does / why we need it?
This pr purpose to do the following things:
1. Remove `benchmark_datasets.py` patch
2. Increase the scheduler frequency to 2 times per day, due to the
recent large number of daily submissions, we need to increase the
default test time(6h)
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
247102f07f

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-13 20:00:35 +08:00
Yikun Jiang d118bf8a26
Update README.zh.md to fix typo (#1758)
### What this PR does / why we need it?


Update README.zh.md to fix typo

### Does this PR introduce _any_ user-facing change?


No

### How was this patch tested?


CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-12 14:01:34 +08:00
Yikun Jiang eff4b5791c
Recover offline_inference_npu.py to make doctest passed (#1756)
### What this PR does / why we need it?
Rename offline_inference_npu_v1.py to offline_inference_npu.py to
recover doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed









- vLLM version: v0.9.2
- vLLM main:
a8593237c0

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-12 12:36:35 +08:00
Yikun Jiang 8b3a483269
Add recommend version and refresh readme / contribution.md (#1757)
### What this PR does / why we need it?
Add recommend version and contribution.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed






- vLLM version: v0.9.2
- vLLM main:
890323dc1b

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-12 12:35:40 +08:00
wangxiyuan 3c404de1b1
[Release]Update release note (#1753)
There is still issue with pp in some case. such as aclgraph, ray. Remove
the related doc in release note

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:58:26 +08:00
wangxiyuan b5b7e0ecc7
[Doc] Add qwen3 embedding 8b guide (#1734)
1. Add the tutorials for qwen3-embedding-8b
2. Remove VLLM_USE_V1=1  in docs, it's useless any more from 0.9.2


- vLLM version: v0.9.2
- vLLM main:
5923ab9524

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:40:17 +08:00
wangxiyuan 9c560b009a
[Release] Add 0.9.2rc1 release note (#1725)
Add release note for 0.9.2rc1, we'll release soon









- vLLM version: v0.9.2
- vLLM main:
7bd4c37ae7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:36:05 +08:00
zhangxinyuehfad 1b4a2f3817
[CI] Add accuracy ci for DP and EP and TP and ETP (#1140)
### What this PR does / why we need it?

Add accuracy ci for DP and EP and TP

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
35514b682a

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-11 17:25:17 +08:00
Pr0Wh1teGivee d13fb0766e
[Perf] add patch to optimize apply_topk_topp (#1732)
### What this PR does / why we need it?
Performance optimization for apply_top_k_top_p
### Does this PR introduce _any_ user-facing change?
Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature
### How was this patch tested?
e2e & ut

















- vLLM version: v0.9.2
- vLLM main:
6a9e6b2abf

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-07-11 15:32:02 +08:00
weiguihua2 aa4240c67f
Support pipeline parallel in V1 Engine (#1700)
### What this PR does / why we need it?
This patch supports pipeline parallel in V1 Engine

### Does this PR introduce _any_ user-facing change?
Yes, users can run PP in V1

### How was this patch tested?
Manully test














- vLLM version: v0.9.2
- vLLM main:
31d5c1797f

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-07-11 15:30:51 +08:00
zhangxinyuehfad 1cd27da5fb
[Test] Remove VLLM_USE_V1 in accuracy test (#1739)
### What this PR does / why we need it?
Remove VLLM_USE_V1 in accuracy test

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-11 15:29:11 +08:00
ttanzhiqiang ee40d3d850
use npu_moe_gating_top_k_softmax (#1355)
### What this PR does / why we need it?
The optimization solution for non-deepseek select_experts is to replace
gating_topk_softmax with softmax+topk+to, which is optimized from 37us
to 14us on bf16/fp16 of qwen3-235b

- vLLM version: v0.9.2
- vLLM main:
1a4f35e2ea

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:55:06 +08:00
ttanzhiqiang 9d16c9982e
rm router logits Improve TTOP 3ms (#1407)
### What this PR does / why we need it?

The previous code is
router_logits, _ = self.gate(hidden_states)
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits = get_dp_group().all_gather(router_logits, 0)
I want to change the two all_gathers to one, reduce one all_gather
communication, and make it
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits, _ = self.gate(hidden_states)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh

gsm8k accuracy verification
<img width="1809" alt="截屏2025-06-24 21 53 24"
src="https://github.com/user-attachments/assets/47eace3b-a86b-41b4-9de8-773f57fea33b"
/>



- vLLM version: v0.9.2
- vLLM main:
77f77a951e

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:53:17 +08:00
ApsarasX 0fc9b56d40
[Perf] Improve MLA multistream performance (#1353)
### What this PR does / why we need it?
> Need to merge after PR #1322

According to benchmark results, this PR brings approximately 1%
performance gain.

#### Before Improvement
Profiling
<img width="1147" alt="截屏2025-06-22 14 54 47"
src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c"
/>

Evaluation
```
# server launch command
python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=16 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.96

# client benchmark command
python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 1536 \
        --num-prompts 200 \
        --ignore-eos \
        --model auto \
        --tokenizer /DeepSeek-R1-W8A8 \
        --port 8006 \
        --request-rate 1 \
        --max-concurrency 24 \
        --save-result \
        --skip-initial-test \
        --metric-percentiles "50,90,99"
```

```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  958.59    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2086    
Output token throughput (tok/s):         320.47    
Total Token throughput (tok/s):          1175.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          942.70    
Median TTFT (ms):                        713.87    
P50 TTFT (ms):                           713.87    
P90 TTFT (ms):                           1363.88   
P99 TTFT (ms):                           2008.73   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.96     
Median TPOT (ms):                        69.49     
P50 TPOT (ms):                           69.49     
P90 TPOT (ms):                           70.42     
P99 TPOT (ms):                           70.72     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.96     
Median ITL (ms):                         59.88     
P50 ITL (ms):                            59.88     
P90 ITL (ms):                            61.59     
P99 ITL (ms):                            68.82     
==================================================
```

#### After Improvement
Profiling
<img width="1200" alt="截屏2025-06-22 14 55 42"
src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f"
/>

Evaluation
```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  948.08    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2110    
Output token throughput (tok/s):         324.02    
Total Token throughput (tok/s):          1188.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          1019.25   
Median TTFT (ms):                        714.63    
P50 TTFT (ms):                           714.63    
P90 TTFT (ms):                           1367.31   
P99 TTFT (ms):                           2661.52   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.14     
Median TPOT (ms):                        68.68     
P50 TPOT (ms):                           68.68     
P90 TPOT (ms):                           69.33     
P99 TPOT (ms):                           70.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.14     
Median ITL (ms):                         59.04     
P50 ITL (ms):                            59.04     
P90 ITL (ms):                            60.93     
P99 ITL (ms):                            66.89     
==================================================
```
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?




- vLLM version: v0.9.2
- vLLM main:
65393ee064

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-11 08:51:17 +08:00
Mengqing Cao cc210f46e6
[AscendScheduler][Bugfix] Remove num_draft_tokens while allocating slots (#1718)
### What this PR does / why we need it?

Now there is no need to calculate `num_draft_tokens` when allocating
slots.

This PR follows the changes in vllm:
https://github.com/vllm-project/vllm/pull/20701

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test






- vLLM version: v0.9.2
- vLLM main:
cc876d0f29

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-10 18:47:45 +08:00
wangxiyuan 011fd73a48
[CI] Make CI tracker more clear (#1720)
1. enable lint check for all change
2. only run ut and e2e if it's the code change.
3. only run ut and disable e2e if the change is ut only.
4. disable wheel build for push case
5. run unit test when pr is merged
6. remove useless pytest.ini




- vLLM version: v0.9.2
- vLLM main:
fdfd409f8f

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-10 16:03:23 +08:00
wangxiyuan 3d1e6a5929
[Doc] Update user doc index (#1581)
Add user doc index to make the user guide more clear
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-10 14:26:59 +08:00
Li Wang c7446438a9
[1/N][CI] Move linting system to pre-commits hooks (#1256)
### What this PR does / why we need it?

Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml

Enable pre-commit to avoid some low level error  AMAP.

This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.

TODO: 
- clang-format(check for csrc with google style)
need clean code, disable for now 
- pymarkdown
need clean code, disable for now 
- shellcheck
need clean code, disable for now 

### Does this PR introduce _any_ user-facing change?

Only developer UX change:

https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally

```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```

### How was this patch tested?

CI passed with new added/existing test.

Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 14:17:15 +08:00
ApsarasX 643e6f5486
[Bugfix] Fix accuracy problem caused by mask pollution (#1678)
### What this PR does / why we need it?
If a small batch of short requests is sent first, forming a chunk with a
length <128, it will corrupt the `attn_mask_cache`, causing subsequent
requests that do not form a chunk to have accuracy issues.

The root cause of this problem is the use of in-place multiplication.
Modifying it to use out-of-place multiplication will resolve the
accuracy problem.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Yes.

- vLLM version: v0.9.2
- vLLM main:
ad6c2e1a0b

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-10 14:06:49 +08:00
ttanzhiqiang 60519c71bd
shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)
### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-10 12:07:05 +08:00
Yikun Jiang 997f156a51
Use ci_vllm_version when recording vLLM commit (#1689)
### What this PR does / why we need it?
Use ci_vllm_version when recording vllm commit

Followup on https://github.com/vllm-project/vllm-ascend/pull/1623

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test mannually.
$ python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"'
v0.9.2
- Test on my local repo: https://github.com/Yikun/vllm-ascend/pull/35

- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-10 11:07:27 +08:00
ApsarasX 89c1a0f006
[Bugfix] Fix memory-leak caused by dist._functional_collectives.reduce_scatter_tensor (#1380)
### What this PR does / why we need it?
In some cases, `dist._functional_collectives.reduce_scatter_tensor` can
cause its input tensor not to be released immediately after the current
layer ends. Instead, it will only be released when the GPU memory usage
of the current process reaches a certain threshold (approximately every
15 layers each time).

**Before Fix**

<img width="1441" alt="截屏2025-06-24 01 26 13"
src="https://github.com/user-attachments/assets/72d5dbb3-c8c8-4778-bf64-8db7bab8aff0"
/>

**After Fix**
<img width="1475" alt="截屏2025-06-24 01 23 43"
src="https://github.com/user-attachments/assets/6c69cfcd-a469-4ee5-b8c6-210aeb3a5bdf"
/>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.9.1
- vLLM main:
9ff2af6d2b

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-10 10:57:24 +08:00
Mengqing Cao b1c66b211f
[CI] Fix lint in CI (#1712)
### What this PR does / why we need it?
Fix lint in CI
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-10 10:47:18 +08:00
Li Wang 0c4aa2b4f1
[Doc] Add multi node data parallel doc (#1685)
### What this PR does / why we need it?
 add multi node data parallel doc
### Does this PR introduce _any_ user-facing change?
 add multi node data parallel doc
### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
805d62ca88

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 09:36:37 +08:00
leo-pony b4b19ea588
[Doc] Add multi-npu qwen3-MoE-32B Tutorials (#1419)
Signed-off-by: leo-pony <nengjunma@outlook.com>

### What this PR does / why we need it?
Add multi-npu qwen3-MoE-32B Tutorials
Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-10 09:06:51 +08:00
xleoken 3ef45d0cc2
feat: Improve the offline_inference npu v0/v1 scripts (#1669)
### What this PR does / why we need it?

Improve
- Keep the same file name format as v1, `offline_inference_npu_v0.py`,
`offline_inference_npu_v1.py`
- Use `VLLM_USE_V1` = 0/1 clearly in py scripts
- Fix some run errors in `offline_inference_npu_v1.py`, e.g.
`deepseekv3-lite-base-latest` not exists in modescope or hf.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
baed180aa0

Signed-off-by: xleoken <xleoken@163.com>
2025-07-09 17:03:53 +08:00
Shanshan Shen 6af35f60cc
[Bugfix][CI] Remove V0 Spec Decode CI (#1656)
### What this PR does / why we need it?

To solve the error in the CI of long term test:

```bash
modelscope - ERROR - Repo JackFram/llama-68m not exists on either https://www.modelscope.cn/ or https://www.modelscope.ai/
```

Replace the hf model with modelscope model.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
71d1d75b7a

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-07-09 15:53:58 +08:00
wangxiyuan b979ee353d
[Misc] Code clean up (#1679)
Make model_runner_v1 more readable

- vLLM version: v0.9.2
- vLLM main:
baed180aa0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 14:33:40 +08:00
wangxiyuan 392fd7239b
[Misc] Add attention mask (#1673)
Move attention mark from V0 to common place.
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 09:12:03 +08:00
wangxiyuan cc1588be50
[Misc] Code clean up (#1674)
Remove useless function
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:54:12 +08:00
wangxiyuan 830332ebfc
Clean up v0.9.1 code (#1672)
vllm has released 0.9.2. This PR drop 0.9.1 support.

- vLLM version: v0.9.1
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:52:24 +08:00
Icey 0d4bc03946
Fix wheel glibc version incompatibility (#1582)
### What this PR does / why we need it?
- Fixes https://github.com/vllm-project/vllm-ascend/issues/1533

### How was this patch tested?
1. Run the image
```
docker run \
    --name cann_container \
    --device /dev/davinci6 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -it  quay.io/ascend/cann:8.1.rc1-910b-openeuler22.03-py3.11 bash
```

2. Install package
torch=2.5.1
torch-npu=2.5.1.post1.dev20250619 
vllm=0.9.1

vllm-ascend=vllm_ascend-0.1.dev1+g02ac443-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.whl
Artifact download URL:
https://github.com/vllm-project/vllm-ascend/actions/runs/16039661265/artifacts/3454481370

3. Test offline script

```
from vllm import LLM, SamplingParams

import os
os.environ["VLLM_USE_V1"] = "1"

prompts = [
    "Hello, my name is",
]

llm = LLM(model="Qwen3/Qwen3-1.7B")

outputs = llm.generate(prompts)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

4. Results

![result](https://github.com/user-attachments/assets/20f9d923-00ce-4a2d-8598-9b216045705d)

- vLLM version: v0.9.2
- vLLM main:
b942c094e3

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-07-08 18:46:02 +08:00
Yikun Jiang e4e9ea02ab
Upgrade vLLM version to v0.9.2 (#1652)
### What this PR does / why we need it?

This patch upgrade vLLM version to v0.9.2, this patch didn't remove the
v0.9.1 compatible code to easy review.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
14601f5fba
- Accuracy test with 0.9.2:
https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-08 14:18:17 +08:00
NeverRaR 71de52d3a9
feat: add kv cache memory cache and skip dynamo guard (#1549)
### What this PR does / why we need it?

1、Sometimes loading torchair cache will fail because of the floating of
npu memory, so this pr add a new cache to save the old kv cache bytes to
avoid the possible crash while loading the torchair graph cache.
2、When caching is enabled and does not exist, the first compilation
introduces the overhead of Dynamo Gurad. So in this case, we will
compile them directly twice to skip them (This will bring 3-4 ms of tpot
optimization)

### Does this PR introduce _any_ user-facing change?
Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to
control kv cache floating tolerance

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:37:14 +08:00
NeverRaR df84cceca8
perf: use multicast to avoid padding decode request to prefill size (#1555)
### What this PR does / why we need it?
perf: use multicast to avoid padding decode request to prefill size

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:36:03 +08:00
wm901115nwpu f08c4f15a2
fix spell error (#1654)
Fix the spell error in code

- vLLM version: v0.9.1
- vLLM main:
923147b5e8

Signed-off-by: unicorn <unicorn@unicorns-MacBook-Pro.local>
Co-authored-by: unicorn <unicorn@unicorns-MacBook-Pro.local>
2025-07-07 20:24:42 +08:00
Mengqing Cao f2a20393a2
[CI] Fix mypy check in CI (#1655)
### What this PR does / why we need it?
Fix mypy check in CI:
https://github.com/vllm-project/vllm-ascend/actions/runs/16115919385/job/45469646509?pr=1654

Mypy failed due to the greater numpy version. We need to pin
`numpy=1.26.4` in vllm-ascend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-07 20:19:16 +08:00
Angazenn 18495f44b2
[BugFix] Fix max_num_tokens_across_dp calculation bugs in attention_v1_torchair (#1636)
### What this PR does / why we need it?
This PR fixes a bug that is caused by max_num_tokens_across_dp
calculation. In earlier version, we compute this by graph_pad_size plus
max_num_tokens(actual). This will result in different
max_num_tokens_across_dp across dp ranks. If padding related is
required, this might cause a wrong padding.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed normally.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-07-07 20:03:02 +08:00
Zheng Wengang 9c886d0a1f
[EPLB] support deepseek eplb strategy (#1196)
### What this PR does / why we need it?

This PR implements the DeepSeek Expert Parallel Load Balancing (EPLB)
strategy to optimize expert distribution in vllm-ascend. The
implementation:
- Adapts the expert-map format to work with vllm-ascend's architecture
- Provides DeepSeek-provided mechanism to balance expert workload across
devices

### Does this PR introduce _any_ user-facing change?

This PR adds a new script that allows users to:
- Generate expert map configurations based on workload analysis
- Optimize expert distribution for their specific use case

### How was this patch tested?

To use this feature:
1. First collect expert heat information during model execution
2. Run the provided script to generate the expert map configuration
3. Apply the generated configuration to your vllm-ascend deployment

User example:

```bash
# expert_load_view.pt:  dumped expert heat info file
python3 examples/eplb/eplb_strategy.py --exp_name 'deepseek_demo' \
    --input_path expert_load_view.pt  --output_path examples/eplb/results/demo \
    --num_nodes 4
```

---------

Signed-off-by: ZhengWG <zwg0606@gmail.com>
2025-07-07 17:22:08 +08:00
wangyanhui-cmss 4e29c5a808
Add ut for test_pooling_model_runner.py (#1640)
### What this PR does / why we need it?
 Add ut for test_pooling_model_runner.py

### Does this PR introduce _any_ user-facing change? N/A

### How was this patch tested?
 python -m unittest  test_pooling_model_runner.py


- vLLM version: v0.9.1
- vLLM main:
2e610deb72

---------

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-07-07 17:12:11 +08:00
Yikun Jiang 493768eb30
Record vLLM commit in PR description (#1623)
### What this PR does / why we need it?
This patch enables the vllm commits recording and also cleanup unused
commit msg note in PR.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed
- Test on https://github.com/Yikun/vllm-ascend/pull/33 and vllm commit
refreshed as expected.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-07 10:20:38 +08:00
Mengqing Cao 7efa4e92fe
[CI] Fix oom in chunk prefill (#1622)
### What this PR does / why we need it?
Add the resource clear logic to fix oom issue when testing
`tests/e2e/singlecard/core/ascend_scheduler`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-07 10:14:40 +08:00
ApsarasX c58accc15e
[Bugfix] Support Qwen3-MOE on aclgraph mode (#1381)
### What this PR does / why we need it?
Fix the shape of the `npu_moe_init_routing` input parameters to support
aclgraph mode on qwen3-moe

In addition to this PR, resolving the `gatherv3` error might be
necessary. See related PR
https://github.com/vllm-project/vllm-ascend/pull/1297
https://github.com/vllm-project/vllm-ascend/pull/1446

Thanks to @yiz-liu  for providing the idea

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested on Qwen3-30B-A3B

Closes: https://github.com/vllm-project/vllm-ascend/issues/1368

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 15:29:36 +08:00
zhangxinyuehfad 14373f65d7
[Test] Remove V0 accuracy test and enable MoE and VL test on V1 (#1574)
### What this PR does / why we need it?
Update accuracy test
1. remove accuarcy report on V0
2. add parallel and execution mode
3. add Qwen/Qwen3-30B-A3B and remove Qwen/Qwen2.5-7B-Instruct


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-06 11:10:19 +08:00
Yikun Jiang 0c1d239df4
Add unit test local cpu guide and enable base testcase (#1566)
### What this PR does / why we need it?
Use Base test and cleanup all manaul patch code
- Cleanup EPLB config to avoid tmp test file
- Use BaseTest with global cache
- Add license
- Add a doc to setup unit test in local env 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 10:42:27 +08:00
Vincent Yuan eb390545ec
[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series (#1591)
### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after #1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See #1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
2025-07-05 16:29:21 +08:00
Mengqing Cao dd22ac38b2
[CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136)
### What this PR does / why we need it?
1. run deepseek acc ut per pr --- multicard CI time increased by 9 min
2. run spec decode e2e test on v1 per pr --- singlecard CI time
increased by 3 min (partly is disabled due to not work now)
~~3. align the output of whether dbo is enabled or not~~
    The generated results with and without dbo cannot be aligned.

https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136
4. skip V0 mtp test due to failure in
https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816
5. fix some version conflicts
### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-04 18:05:45 +08:00
wangxiyuan 343955c7ac
[CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625)
This commit
78fe77534b
from vllm reverted the change for FusedMoEParallelConfig

This PR do the same to fix the CI error

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-04 17:54:33 +08:00
zhangxinyuehfad 4e910186de
[CI/UT] Unify model usage via ModelScope in CI (#1207)
### What this PR does / why we need it?
Unify Model Usage via ModelScope

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-04 10:52:17 +08:00
Angazenn a5f33590d3
[CORE]initial support for torchair with non-mla backend (#1506)
### What this PR does / why we need it?
This PR supports torchair graph mode with non-mla backend on both 800IA2
and 300I Duo platforms. The main change is to add
`attention_v1_torchair.py` to support specific attention related
operations that are required by torchair.

### Does this PR introduce _any_ user-facing change?
Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we
can also use it with pangu. Besides, we add a support model list to
control which type of models that can use torchair.

### How was this patch tested?
We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms,
and model generates answer normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:21:42 +08:00
Angazenn 9fbd8017c0
[Quantization]300I Duo support w8a8 quantization (#1560)
### What this PR does / why we need it?
This pr supports w8a8 on 300I Duo platform. The main change is to use
`npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
offline inference on 310p runs normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:12:46 +08:00
Yikun Jiang 6d7cb14a24
Fix lint in examples/offline_embed.py (#1618)
### What this PR does / why we need it?
Fix lint

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-03 21:40:29 +08:00
xleoken e511ddd67d
[Bug] Fix wrong modescope env set order (#1611)
### What this PR does / why we need it?
The `os.environ["VLLM_USE_MODELSCOPE"] = "True"` should be placed before
module imports

if not 
```
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/xleoken/projects/vllm-ascend/examples/offline_embed.py", line 48, in <module>
    model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 494, in from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1018, in create_engine_config
    model_config = self.create_model_config()
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 910, in create_model_config
    return ModelConfig(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 120, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/config.py", line 528, in __post_init__
    hf_config = get_config(self.hf_config_path or self.model,
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 321, in get_config
    config_dict, _ = PretrainedConfig.get_config_dict(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 649, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 266, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 491, in cached_files
    raise OSError(
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
[ERROR] 2025-07-03-15:27:10 (PID:333665, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local.

Signed-off-by: xleoken <xleoken@163.com>
2025-07-03 18:50:53 +08:00
wangxiyuan a45dfde283
[CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602)
Make CI happy

1.
c1909e7e8c
changed moeConfig init way
2.
48fb076cbc
changed input batch logic.

This PR address these change to vllm-ascend.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1600

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-03 18:36:17 +08:00
yupeng d96da1f00c
[DOC] Fix word spelling (#1595)
### What this PR does / why we need it?
Fix word spelling in DOC.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Signed-off-by: paulyu12 <507435917@qq.com>
2025-07-02 21:42:39 +08:00
zhanghw0354 9fb3d558e5
[Test]Add unit test for platform.py (#1476)
### What this PR does / why we need it?
According to issue #1298 , this pull request adds unit test code for
platform.py.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

---------

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: zhuyilin <809721801@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Angazenn <92204292+Angazenn@users.noreply.github.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Zhu Yi Lin <116337067+GDzhu01@users.noreply.github.com>
2025-07-02 17:46:06 +08:00
Li Wang 30bf7014d0
[Bugfix] Add func `swap_states` to fix MLA attention (#1580)
### What this PR does / why we need it?
mla attention still using the gpu_input_batch's attr:`swap_states`, which will lead to
an error `AttributeError: 'InputBatch' object has no attribute 'swap_states'`

This PR fixed the mla input patch error
### How was this patch tested?
will be tested by #1136

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-02 17:42:53 +08:00
Mengqing Cao 59237ea788
[CI/UT] Add test for chunk prefill and prefix cache on v1/AscendScheduler (#1505)
### What this PR does / why we need it?
Add test for chunked prefill and prefix cache on v1/AscendScheduler

Covered scenarios:
- `Qwen/Qwen3-0.6B-Base` and `deepseek-ai/DeepSeek-V2-Lite-Chat` ---
multicard CI time increased by 19 min
- `V1 + default scheduler` vs `V1 + default scheduler + enable prefix
cache`
- `V1 + Ascend scheduler` vs `V1 + Ascend scheduler + enable prefix
cache` vs `V1 + Ascend scheduler + enable prefix cache + enable chunked
prefill`
- `Qwen/Qwen3-0.6B-Base` --- singlecard CI time increased by 8 min
- `V1 + Ascend scheduler` vs `V1 + Ascend scheduler + enable chunked
prefill`

should rebase after #1498 and #1446
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-02 16:57:03 +08:00
Zhu Yi Lin 6b80c5acba
Fix W8A8 fused moe bug (#1529)
### What this PR does / why we need it?
1. drop some useless code for w8a8 fusedmoe
2. Add in8 kv cache check
3. Add more ut.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: zhuyilin <809721801@qq.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-02 16:40:51 +08:00
Agonixiaoxiao 7fc1a98489
add ut for kv tansfer module (#1531)
### What this PR does / why we need it?
test kv data transfer contains connect,pipe,buffer

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: lixudong <lixudong@cmss.chinamobile.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-07-02 16:14:52 +08:00
Yikun Jiang aa5fa07478
Only enable single version for wheel pr build (#1571)
### What this PR does / why we need it?
Only enable single version for wheel pr build to speedup PR triggered CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 14:50:34 +08:00
yupeng c3c8c9317c
[DOC] add LoRA user guide (#1265)
### What this PR does / why we need it?
Add LoRA user guide to DOC. The content refers to [LoRA
Adapters](https://docs.vllm.ai/en/latest/features/lora.html).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2025-07-02 14:41:31 +08:00
Li Wang f39365d2ea
[Benchmark] Fix error msg upload in performance benchmark (#1559)
### What this PR does / why we need it?

Make sure that None parameters are not passed in for `--error`
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

CI passed locally

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-02 14:06:08 +08:00
wangxiyuan 641a4e6092
[CI] Cache sampled token ids in model runner to fix CI error (#1573)
### What this PR does / why we need it?
vllm change
7f280d69c9
break vllm-ascend.

This PR Fix the broken CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/1572

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-02 12:11:14 +08:00
Pleaplusone 0e43813120
[ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546)
### What this PR does / why we need it?

This PR (adapted from
2863befce3)
updates the CachedRequestData definition to use a single instance shared
across all requests in a batch, instead of creating a new instance per
request.

Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53
[core.py:521] TypeError: 'CachedRequestData' object is not iterable`,
Modify the model_runner to fix it.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
pass ci will verify this.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 06:05:21 +08:00
Li Wang 6db7dc2c85
[Benchmark] Refactor perf script to use benchmark cli (#1524)
### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-30 23:42:04 +08:00
leo-pony 53ec583bbb
[Docs] Update Altlas 300I series doc and fix CI lint (#1537)
### What this PR does / why we need it?
- Update Altlas 300I series doc: cleanup unused parameters and enable
optimized ops
- Fix code spell CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 23:34:00 +08:00
wangxiyuan a054f0f4ca
[CI] change to new ds model (#1513)
Previous, the DeepSeek V3 Pruning weight is not correct, the moe layer
is not tested. We update a new Pruning model to enable moe layer
compute.

This PR fix the CI to address the new weight.

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-30 19:02:29 +08:00
Shanshan Shen 8013634e9c
[Structured Output] Remove redundant check for `grammar_bitmask` (#1459)
### What this PR does / why we need it?
Remove redundant check since we have check this at
https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L1450.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:39:19 +08:00
Shanshan Shen ba577dfc52
[Doc] Add Structured Output guide (#1499)
### What this PR does / why we need it?
Add Structured Output guide.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:21:44 +08:00
whx f286265791
[BugFix] Address PrefillCacheHit state to fix prefix cache accuracy bug (#1498)
When use AscendScheduler with prefix-cache enabled and chunk-prefill
disabled, there will be accuray problem because there is no branch in
mla_v1 to process this scenario. This PR fixes it.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-30 16:51:20 +08:00
Li Wang 5f8241c25c
[V1][ModelRunner] Support pooling model for v1 engine (#1359)
### What this PR does / why we need it?
Change as little existing code as possible to add v1 pooling task's
support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to
vllm-ascend, Considering the frequent changes in upstream interfaces, in
order to decouple, so i move it here
### How was this patch tested?
CI passed with new added/existing test, and I have a simple test was
first conducted locally which is adapted from
https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like
bellow:
```python
import os

import torch
from vllm import LLM


os.environ["VLLM_USE_MODELSCOPE"]="True"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]
```
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-30 16:31:12 +08:00
dependabot[bot] 790c810bf7
Bump actions/github-script from 6 to 7 (#1519)
Bumps [actions/github-script](https://github.com/actions/github-script)
from 6 to 7.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/github-script/releases">actions/github-script's
releases</a>.</em></p>
<blockquote>
<h2>v7.0.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add base-url option by <a
href="https://github.com/robandpdx"><code>@​robandpdx</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li>
<li>Expose async-function argument type by <a
href="https://github.com/viktorlott"><code>@​viktorlott</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a>,
see for details <a
href="https://github.com/actions/github-script#use-scripts-with-jsdoc-support">https://github.com/actions/github-script#use-scripts-with-jsdoc-support</a></li>
<li>Update dependencies and use Node 20 by <a
href="https://github.com/joshmgross"><code>@​joshmgross</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/425">actions/github-script#425</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/navarroaxel"><code>@​navarroaxel</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/285">actions/github-script#285</a></li>
<li><a href="https://github.com/robandpdx"><code>@​robandpdx</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li>
<li><a
href="https://github.com/viktorlott"><code>@​viktorlott</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.4.1...v7.0.0">https://github.com/actions/github-script/compare/v6.4.1...v7.0.0</a></p>
<h2>v6.4.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Add <code>@​octokit/plugin-request-log</code>, to produce debug
output for requests by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li>
<li>fix input handling by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/357">actions/github-script#357</a></li>
<li>Remove unused dependencies by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/356">actions/github-script#356</a></li>
<li>Default debug to current runner debug state by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/363">actions/github-script#363</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/mjpieters"><code>@​mjpieters</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.4.0...v6.4.1">https://github.com/actions/github-script/compare/v6.4.0...v6.4.1</a></p>
<h2>v6.4.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Bump json5 from 2.1.3 to 2.2.3 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/319">actions/github-script#319</a></li>
<li>Bump minimatch from 3.0.4 to 3.1.2 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/320">actions/github-script#320</a></li>
<li>Add node-fetch by <a
href="https://github.com/danmichaelo"><code>@​danmichaelo</code></a> in
<a
href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/jongwooo"><code>@​jongwooo</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/313">actions/github-script#313</a></li>
<li><a
href="https://github.com/austinvazquez"><code>@​austinvazquez</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/306">actions/github-script#306</a></li>
<li><a
href="https://github.com/danmichaelo"><code>@​danmichaelo</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.3.3...v6.4.0">https://github.com/actions/github-script/compare/v6.3.3...v6.4.0</a></p>
<h2>v6.3.3</h2>
<h2>What's Changed</h2>
<ul>
<li>Update <code>@actions/glob</code> to 0.3.0 by <a
href="https://github.com/nineinchnick"><code>@​nineinchnick</code></a>
in <a
href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/nineinchnick"><code>@​nineinchnick</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.3.2...v6.3.3">https://github.com/actions/github-script/compare/v6.3.2...v6.3.3</a></p>
<h2>v6.3.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Update <code>@​actions/core</code> to 1.10.0 by <a
href="https://github.com/rentziass"><code>@​rentziass</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/295">actions/github-script#295</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="60a0d83039"><code>60a0d83</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/440">#440</a>
from actions/joshmgross/v7.0.1</li>
<li><a
href="b7fb2001b4"><code>b7fb200</code></a>
Update version to 7.0.1</li>
<li><a
href="12e22ed06b"><code>12e22ed</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/439">#439</a>
from actions/joshmgross/avoid-setting-base-url</li>
<li><a
href="d319f8f5b5"><code>d319f8f</code></a>
Avoid setting <code>baseUrl</code> to undefined when input is not
provided</li>
<li><a
href="e69ef5462f"><code>e69ef54</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/425">#425</a>
from actions/joshmgross/node-20</li>
<li><a
href="ee0914b839"><code>ee0914b</code></a>
Update licenses</li>
<li><a
href="d6fc56f33b"><code>d6fc56f</code></a>
Use <code>@types/node</code> for Node 20</li>
<li><a
href="384d6cf581"><code>384d6cf</code></a>
Fix quotations in tests</li>
<li><a
href="84724927e3"><code>8472492</code></a>
Only validate GraphQL <code>previews</code></li>
<li><a
href="84903f5182"><code>84903f5</code></a>
Remove <code>node-fetch</code> from type</li>
<li>Additional commits viewable in <a
href="https://github.com/actions/github-script/compare/v6...v7">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/github-script&package-manager=github_actions&previous-version=6&new-version=7)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-30 16:04:41 +08:00
Yikun Jiang e4df0a4395
Add Pangu MoE Pro for 300I series docs (#1516)
### What this PR does / why we need it?
Add Pangu MoE Pro for 300I series docs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 13:37:22 +08:00
Yikun Jiang cad4c693c6
Add Pangu MoE Pro docs (#1512)
### What this PR does / why we need it?
This PR add Pangu MoE Pro 72B docs

[1] https://gitcode.com/ascend-tribe/pangu-pro-moe-model

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 12:15:33 +08:00
yiz-liu 75d05ee200
[Core] Fix block table shape to make Prefix cache work with Ascend scheduler (#1446)
### What this PR does / why we need it?

This fix the shape of block_table which was introduced by hybrid kv
groups several weeks ago.

Error will be raised when enable prefix-cache (eager or not) and Ascend
Scheduler at the same time, just send two identical requests and it will
reproduce.

v0.9.1: https://github.com/vllm-project/vllm-ascend/pull/1297

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test manually

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-30 11:25:19 +08:00
Zhu Yi Lin b308a7a258
support pangumoe w8a8c8 and docs (#1477)
### What this PR does / why we need it?
support pangu moe w8a8c8

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

Signed-off-by: zhuyilin <809721801@qq.com>
2025-06-28 18:51:07 +08:00
Angazenn c59d69d9e6
[PERF]support MERRouter (#1421)
### What this PR does / why we need it?
This PR introduces an expert rearrange algorithm for PanguProMoE model.
Different from the original grouped topk, it filters out the top experts
that are allocated more tokens. Therefore, we can load less experts when
calculating gmm.

We have test this algorithm for PanguProMoE-72B on 300I Duo platform and
800I A2 platform. On 300I Duo platform, we find that `num_voted_experts`
set to 5 achieves both good performance and accuracy. While on 800I A2,
we still set it to 8 to use original pangu grouped topk.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:14:49 +08:00
Angazenn 8fa188111d
[PERF]support H2P communication optimization for PanguProMoe (#1463)
### What this PR does / why we need it?
In this PR, we support H2P communication optimization when running
PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather`
to replace `all_reduce` to improve performance:

original layer:
input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm
--> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce
now:
input_layernorm --> tp all_gather --> attn --> tp reduce_scatter -->
post_attention_layernorm --> all_rank all_gather --> moe/mlp -->
all_rank reduce_scatter

Besides, because `reduce_scatter` requires num_tokens that can be
divided by group size, we need pad the seqs based on
`max_tokens_across_dp`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This PR has been tested with both offline and online inference using
PanguProMoE-72B.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:10:27 +08:00
Angazenn 5c53cbaf2a
[BugFix]Fix bugs when initializing communication groups with dp on 300I Duo (#1478)
### What this PR does / why we need it?
This PR fixes a bug that use broadcast with cpu_group when running dp.
The `broadcast310p` patch will take effects for both cpu_group and
device group, but we only need it for device group. Hence a wrapper is
added to allow cpu_group use native torch broadcast and it solves the
bug.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
With this PR, DP on 310p runs normally and generates reasonable answers.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:07:52 +08:00
Mengqing Cao 2cf9c4c3a2
[CI/Build] Fix version conflict on transformers (#1490)
### What this PR does / why we need it?
Fix version conflict on transformers:
`pip._vendor.pkg_resources.ContextualVersionConflict: (transformers
4.53.0 (/usr/local/python3.10.17/lib/python3.10/site-packages),
Requirement.parse('transformers<4.53.0'), {'vllm-ascend'})`
Fix
https://github.com/vllm-project/vllm-ascend/actions/runs/15933263325/job/44947231642

### Does this PR introduce _any_ user-facing change?
Fix broken build

### How was this patch tested?
CI passed with new existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 15:11:04 +08:00
Mengqing Cao 5f4391652f
[PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483)
### What this PR does / why we need it?
Support prompt logprobs in V1. This also enable lm_eval to test accuracy
on V1

### Does this PR introduce _any_ user-facing change?
support prompt logprobs output

### How was this patch tested?
CI passed with accuracy test.

Using lm_eval, which use prompt logprobs as output to test accuracy, to
test:
```python
VLLM_USE_V1=1 lm_eval \
  --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \
  --tasks ceval-valid_computer_network \
  --batch_size 8
```
After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct`
on V1 is:
```bash
|           Tasks            |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|ceval-valid_computer_network|      2|none  |     0|acc     |↑  |0.7368|±  |0.1038|
|                            |       |none  |     0|acc_norm|↑  |0.7368|±  |0.1038|
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1043

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 09:38:52 +08:00
Shanshan Shen 99e685532d
[Doc] Add Qwen2.5-VL eager mode doc (#1394)
### What this PR does / why we need it?
Add Qwen2.5-VL eager mode doc.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-28 09:08:51 +08:00
Mengqing Cao d59e7fa095
[CI] Pin transformers<4.53.0 and fix EPLB load_weights to make CI passed (#1482)
### What this PR does / why we need it?

- Fix vLLM EPLB break
e9fd658a73
by recovering load_weights back to [v0.9.1
version](07b8fae219)
temporarily.

- Fix transformers>=4.53.0 image processor break
Related: https://github.com/vllm-project/vllm-ascend/issues/1470

- Mirror torch_npu requirements to pyproject.toml

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-28 00:12:43 +08:00
Shanshan Shen 3687676fa7
[Doc] Add guidance on how to implement and register new models (#1426)
### What this PR does / why we need it?
Add guidance on how to implement and register new models.

Modified based on PR
https://github.com/vllm-project/vllm-ascend/pull/1126, thanks for the
contribution of @linfeng-yuan.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-27 16:46:49 +08:00
wangxiyuan 5571fb7118
[Misc] Add release checklist issue template (#1447)
Add the release checklist issue template.

Every release manager should create and follow the checklist to do the
release step by step.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-27 09:15:36 +08:00
wangxiyuan 5968dff4e0
[Build] Add build info (#1386)
Add static build_info py file to show soc and sleep mode info. It helps
to make the code clean and the error info will be more friendly for
users

This PR also added the unit test for vllm_ascend/utils.py

This PR also added the base test class for all ut in tests/ut/base.py

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-27 09:14:43 +08:00
Li Wang c563a08f0a
[CI] Fix nightly benchmark (#1453)
### What this PR does / why we need it?
Sometimes the performance benchmark workflow may fail. We hope to add a
prompt when the operation fails and not upload the dirty data of the
failed operation.

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-26 19:39:18 +08:00
Zesheng Zong 192dbbcc6e
Optimize Patch developer guide (#1452)
### What this PR does / why we need it?
Fix some terms in the user guide.


Signed-off-by: zeshengzong <zesheng.zong@outlook.com>
2025-06-26 19:10:16 +08:00
wangyanhui-cmss e5eea64b66
[CI/UT] Add ut for parallel_state.py (#1460)
### What this PR does / why we need it?
 Add ut for parallel_state.py

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
 python -m unittest  test_parallel_state.py

---------

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-06-26 19:03:27 +08:00
Shanshan Shen 4e2daf5ab7
[Doc] Add qwen2-audio eager mode tutorial (#1371)
### What this PR does / why we need it?
Add qwen2-audio eager mode tutorial.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-26 16:56:05 +08:00
leo-pony 1025344912
Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374)
### What this PR does / why we need it?
Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode.
Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248

### Does this PR introduce _any_ user-facing change?
No changes.


### How was this patch tested?
Preview

 Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-06-26 16:52:54 +08:00
sdmyzlp 53c2d58ae1
Handle with_prefill_across_dp for multistream mla (#1322)
### What this PR does / why we need it?
After #1094, decode might be executed with non-compiled mode, despite of
`torchair_graph_config.enabled`, causing multistream mla to fail, which
assumes torchair compiled mode for decode when
`torchair_graph_config.enabled == True`.
Augment that assumption to fix this.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested both offline, and by graph mode mla e2e testcase.

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-26 09:32:07 +08:00
yiz-liu 2690697caa
[Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416)
### What this PR does / why we need it?
Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds
asserts in the `GatherV3` operator.

Currently, in
[`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124),
the `position` tensor may contain values that exceed the dimensions of
the attention mask, triggering a `GatherV3` boundary check failure.
These invalid indices originate from stale “dirty” entries left over in
`position` due to padding logic in the ACL graph. Specifically, in
[`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989),
the variable `num_input_tokens` is always greater than or equal to
`total_num_scheduled_tokens`, so any positions not explicitly cleared
from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed
internally using other args, so these lingering values do not surface.
However, on the Ascend platform—where split-fuse attention requires
externally supplied masks—these residual indices become critical and
lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the
`position` tensor before passing it to `GatherV3`, ensuring that every
index lies within the valid range of the attention mask.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1038

### Does this PR introduce _any_ user-facing change?
No


Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-26 09:27:43 +08:00
zhangxinyuehfad 06ccce1ddf
[FOLLOWUP] fix name and format in accuracy test (#1288) (#1435)
### What this PR does / why we need it?
fix accuracy test:
1. fix accuracy report
like:https://vllm-ascend--1429.org.readthedocs.build/en/1429/developer_guide/evaluation/accuracy_report/Qwen2.5-7B-Instruct-V0.html
2. fix create pr for report

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-26 00:26:54 +08:00
Pr0Wh1teGivee 2fda60464c
[Perf] Use fused ops npu_top_k_top_p (#1308)
### What this PR does / why we need it?
Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are
not None, otherwise fallback to the original one. The replacement will
take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` .

This patch are using `npu_top_k_top_p` which required
torch_npu>=2.5.1.post1.dev20250619

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested by DeepSeek R1 and UT passed

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-06-25 20:59:06 +08:00
yuancaoyaoHW e7efc7e7e7
[BugFix] Remove not using patch_eagle.py for CI. (#1385)
### What this PR does / why we need it?
This PR aims to address a long-standing **CI bug** and remove unused
code. The specific changes include:

1. **Fixing CI Bug**: Resolves the root cause of CI test failures or
instability. This often stems from incorrect environment configurations,
dependency version conflicts, or flawed test script logic. This fix
ensures the reliability and consistency of the CI pipeline.
2. **Removing `patch_eagle.py`**: Deletes the `patch_eagle.py` file,
which is no longer utilized by the project. This file was likely legacy
code, experimental code, or its functionality has since been replaced by
other modules. Its removal helps reduce codebase complexity, improves
maintainability, and prevents potential confusion.

### Does this PR introduce _any_ user-facing change?
No, this PR primarily focuses on internal CI stability maintenance and
code cleanup. It does not introduce any user-visible changes to APIs,
interfaces, or other behaviors.

### How was this patch tested?
CI passed. Specifically:

1. **Existing CI Pipelines Passed**: After fixing the CI bug, all
existing CI tests and pipelines were verified to run correctly and pass
successfully.
2. **Code Cleanup Verified**: Following the removal of `patch_eagle.py`,
it was ensured that any related functional modules (if applicable)
continue to work as expected, without introducing new regressions. This
was typically verified by running the project's main test suite.

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-25 20:36:05 +08:00
sharonyunyun 941269a6c5
adjusting the communication method in graph mode (#1194)
### What this PR does / why we need it?
Communication performance optimization: replace allreduce with
reduce_scatter+all_gather in MLA layer's TP group,to remove
stridedsliced and all_gather in MOE layer.
when tp > 1, It is enabled during the decode phase of the graph mode
when enable_multistream_moe、MLA, use_v1, and MC2 are used.
According to the end-to-end RL inference test results, this PR can bring
3% gain in the decode stage.

**Before Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003)
Evaluation

![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7)

![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057)

**After Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e)
Evaluation

![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0)

![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4)

### Does this PR introduce _any_ user-facing change?
Users need to configure enable_multistream_moe=True

### How was this patch tested?
Add e2e test cases to cover code logic

Signed-off-by: sharonyunyun <zhangying134@huawei.com>
2025-06-25 19:56:49 +08:00
wangxiyuan 205cb85a1e
[Doc] Fix doc typo (#1424)
1. Fix the typo
2. Fix 404 url
3. update graph mode and additional config user guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 19:28:26 +08:00
wangxiyuan ca884ef86d
[Misc] Clean up uesless code for LLM initialize (#1373)
This PR aims to clean up the useless code for LLM setup. It helps to
make the code more clear.
1. remove useless `self.xxx` property
2. change `set_random_seed` to `seed_everything`
3. remove `set_custom_all_reduce`, it's only used for cuda

This is just a code clean. no change for any code logic.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 16:20:14 +08:00
zhangxinyuehfad 0060886a37
[CI]Update accuracy report test (#1288)
### What this PR does / why we need it?
Update accuracy report test
1. Add Record commit hashes and GitHub links for both vllm and
vllm-ascend in accuracy reports
2. Add accuracy result verification checks to ensure output correctness
3. Creat PR via forked repository workflow

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
dense-accuracy-test:
https://github.com/vllm-project/vllm-ascend/actions/runs/15745619485
create pr via forked repository workflow:
https://github.com/zhangxinyuehfad/vllm-ascend/actions/runs/15747013719/job/44385134080
accuracy report pr:
https://github.com/vllm-project/vllm-ascend/pull/1292

Currently, the accuracy report used is old and needs to be merged into
pr, retest, update new report, then close #1292 .


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-25 14:10:34 +08:00
Li Wang 15df8be937
[Doc] Add sleep mode doc (#1295)
### What this PR does / why we need it?
Add sleep related doc and example

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-25 14:07:14 +08:00
wangxiyuan e4e0b7af05
[Doc] Add patch doc (#1414)
1. Format the developer guide  content to make it more clear
2. Add the patch doc for developer guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 12:00:45 +08:00
Mengqing Cao 52317f92cb
[DP] Tiny fix of dp and update example (#1273)
### What this PR does / why we need it?
Add `max_num_tokens_across_dp` to AscendMetadata to fix dp

This pr fixes the bug introduced by
https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg
`max_num_tokens_across_dp` when dp_size > 1.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-25 11:03:04 +08:00
Mengqing Cao c1c5d56255
[Doc] Update FAQ and add test guidance (#1360)
### What this PR does / why we need it?
- Add test guidance
- Add reduce layer guidance
- update faq on determinitic calculation

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-25 09:59:23 +08:00
Li Wang 5f5800ba42
[Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399)
### What this PR does / why we need it?

Sync MRotaryEmbedding interface change to recover main CI
(https://github.com/vllm-project/vllm/pull/19939)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-24 22:56:39 +08:00
liziyu 6ed3f00427
[Doc] remove environment variable VLLM_ENABLE_MC2 (#1406)
### What this PR does / why we need it?
remove unused environment variable VLLM_ENABLE_MC2


Signed-off-by: liziyu <liziyu16@huawei.com>
2025-06-24 21:18:10 +08:00
Mengqing Cao 20767a043c
[CI/UT] Fix disaggregated prefill ci (#1313)
### What this PR does / why we need it?
Use eager mode to run disaggregated prefill ci

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-24 17:11:00 +08:00
wangxiyuan 9cbce423ce
[MISC] Remove useless patch (#1366)
### What this PR does / why we need it?
`stateless_init_dp_group` in vllm works with non-cuda platform now.
Remove this useless patch.

Which was introduced in vllm-ascend by
e74331a1ed
(v0.8.4rc2)
vLLM upstream merged:
3e472d882a
(v0.8.0)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-24 10:05:59 +08:00
lyj-jjj 5177bef87a
support fused_moe_allgather_ep (#1335)
### What this PR does / why we need it?
support fused_moe_allgather_ep

### How was this patch tested?
It was tested by UT.

Signed-off-by: lyj-jjj <liuyingjun5@huawei.com>
2025-06-23 22:03:38 +08:00
Yikun Jiang 917c6b71af
[TEST][DOC] Fix doctest and add system package installation (#1375)
### What this PR does / why we need it?
- Fix
[doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule)
- add system package installation
- Add doc for run doctests
- Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml
- Change schedule job from 4 ---> 12 hours

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- doctest CI passed
- Local test with
`/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-23 20:50:33 +08:00
Icey 08cfc7cb4b
Modify installation.md for adding pip extra index of torch-npu (#1272)
### What this PR does / why we need it?
Modify installation.md for adding pip extra index of torch-npu

### How was this patch tested?
No need

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-06-23 15:37:50 +08:00
weiguihua2 e1123172d1
[Doc] Add reinstall instructions doc (#1303)
Add a new FAQ, if users re-install vllm-ascend with pip, the `build`
folder should be removed first

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: weiguihua <weiguihua2@huawei.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-06-23 14:06:27 +08:00
linfeng-yuan 15592c0d48
[bugfix] fix accuracy prolem for deepseek V3/R1 models with torchair graph in long sequence predictions (#1331)
### What this PR does / why we need it?
Fix the issue of insufficient cached cosine and sine length in MLA's
TorchAir graph mode, which causes accuracy deviation during
long-sequence inference.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We tested the accuracy of this patch with DeepSeek R1 e2e becnhmark
serving, and get 83.33 sore for AIME2024 dataset with DP4TP4EP16
setting.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-06-23 09:52:27 +08:00
zxdukki f04c6763d8
[Bugfix] fix env variable in dbo (#1284)
### What this PR does / why we need it?
Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we
have fixed an known issue in deepseek-dbo.


### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
This patch can be tested with newly added e2e tests:
[tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e).
It can be verified with pytest.

---------

Signed-off-by: zhuohuan <zxdu1997@gmail.com>
2025-06-23 09:07:57 +08:00
Shanshan Shen 21fb68a03a
[CI] Update guided decoding ut (#1312)
### What this PR does / why we need it?
Update guided decoding ut.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-23 09:06:20 +08:00
wemaster 339d6894f6
[CI/UT][bugfix] fix v0 spec decode (#1321)
### What this PR does / why we need it?
1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove it in this pr.

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-06-23 09:05:13 +08:00
Pleaplusone 7e6efbf2a9
update torch-npu to 2.5.1.post1.dev20250619 (#1347)
### What this PR does / why we need it?
This PR update the torch_npu to newest release version
2.5.1.post1.dev20250619 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI tested will guarantee the update

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-06-23 09:02:09 +08:00
xleoken 4447e53d7a
[Doc] Change not to no in faqs.md (#1357)
### What this PR does / why we need it?

Change not to no in faqs.md.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local Test

Signed-off-by: xleoken <xleoken@163.com>
2025-06-23 09:01:00 +08:00
Yikun Jiang a95afc011e
[CI] Enable merge trigger unit test and accuracy test schedule job (#1345)
### What this PR does / why we need it?
- Enable merge trigger unit test and accuracy test schedule job
- Pin lm-eval==0.4.8 to resovle Qwen3 8B accuracy
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 17:21:57 +08:00
Yikun Jiang 2e5f312530
Cleanup ununsed doc (#1352)
### What this PR does / why we need it?
Cleanup ununsed doc for MoGE model, we will add back this when MoGE
model ready.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 15:05:30 +08:00
Yikun Jiang c30ddb8331
Bump v0.9.1rc1 release (#1349)
### What this PR does / why we need it?
Bump v0.9.1rc1 release

Closes: https://github.com/vllm-project/vllm-ascend/pull/1341
Closes: https://github.com/vllm-project/vllm-ascend/pull/1334

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-22 13:15:36 +08:00
Yikun Jiang 097e7149f7
[Platform] Add initial experimental support for Altlas 300I series (#1333)
### What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash
below PR into one to help validation:

- https://github.com/vllm-project/vllm-ascend/pull/914
- https://github.com/vllm-project/vllm-ascend/pull/1318
- https://github.com/vllm-project/vllm-ascend/pull/1327


### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas 300I DUO series

### How was this patch tested?
CI passed with:
- E2E image build for 310P
- CI test on A2 with e2e test and longterm test
- Unit test missing because need a real 310P image to have the test,
will add in a separate PR later.
- Manually e2e test:
- Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B:
https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322
  - Pangu MGoE 72B


The patch has been tested locally on Ascend 310P hardware to ensure that
the changes do not break existing functionality and that the new
features work as intended.

#### ENV information

CANN, NNAL version: 8.1.RC1
> [!IMPORTANT]  
> PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ
format and calling NNAL operators on 310P

#### Code example

##### Build vllm-ascend from source code

```shell
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
```

##### Run offline inference

```python
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

---------

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Vincent Yuan <farawayboat@gmail.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-21 09:00:16 +08:00
Yikun Jiang 2009fdb8da
[Test] Enable code cov for V1 and enable push trigger (#1164)
### What this PR does / why we need it?
- Enable code cov for V1
- Enable push triggered job

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-21 00:01:05 +08:00
Angazenn 2f1266d451
Support Pangu Pro MoE model (#1204)
### What this PR does / why we need it?
Support Pangu Pro MoE model (https://arxiv.org/abs/2505.21411)

### Does this PR introduce _any_ user-facing change?
Yes, new model supported

### How was this patch tested?
Test locally

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-20 23:59:59 +08:00
yuancaoyaoHW 00ae250f3c
[V1][eagle3] Support eagle3 proposer for v1 (#1032)
### What this PR does / why we need it?
This PR implements the Eagle Pososer feature for vLLM v1, which enables
more efficient speculative decoding by using a draft model to predict
potential future tokens.
- The implementation includes the core Eagle algorithm integration with
vLLM's existing architecture, allowing for faster inference while
maintaining output quality.
- This is needed to significantly improve the generation speed of large
language models without compromising on the quality of generated text.

### Does this PR introduce any user-facing change?
Yes, this PR introduces a new speculative decoding mode that can be
enabled via configuration.
- Users can now choose to use Eagle Pososer by setting appropriate flags
in the inference configuration.
- The API remains backward compatible, with the new functionality being
opt-in.

### How was this patch tested?
CI passed with new unit tests added for the Eagle Pososer functionality.
- Benchmark tests were conducted comparing generation speed and quality
with and without Eagle Pososer.
- Integration tests were performed with various model architectures to
ensure compatibility.
- Manual testing was done using different prompt scenarios to verify
output quality remains consistent.
- we test accept rate on one Ascend 910B npu, The acceptance rate
results are basically consistent with those shown here:
https://github.com/vllm-project/vllm/pull/16937
- Currently, we support scenarios where num_spec_tokens <= 2. When
num_spec_tokens > 2, issues such as insufficient GPU memory and operator
computation errors may occur. We will address this in subsequent
updates.
- We will add support for Eagle v1 in future updates.

### Acceptance Test Script
```bash
SCRIPT="/offline/eagle.py"
DATASET="ShareGpt"
MODEL=Meta-Llama-3.1-8B-Instruct
DRAFT=EAGLE3-LLaMA3.1-Instruct-8B

CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \
    --dataset $DATASET \
    --num_spec_tokens 2 \
    --max_num_seqs 1 \
    --model_dir $MODEL \
    --eagle_dir $DRAFT \
    --tp 1 \
    --num_prompts 80
```
### Acceptance Test Results
```bash
██████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s]
-------------------------------------------------------------------------------------
mean acceptance length: 1.63
-------------------------------------------------------------------------------------
total_counts: 8062
acceptance at token 0: 1.00 (8062 times)
acceptance at token 1: 0.70 (5612 times)
acceptance at token 2: 0.47 (3765 times)
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1004

---------

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-20 17:19:54 +08:00
wangxiyuan 45be1aac0c
[CI] Add codespell check for doc (#1314)
Add codespell check test for doc only PR

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 16:48:14 +08:00
22dimensions 761bd3d9d7
Add user guide for quantization (#1206)
### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-20 15:53:25 +08:00
yiz-liu 2c7dd85fd8
[Fix] Fix the token-wise padding mechanism (#1300)
### What this PR does / why we need it?
Fix the token-wise padding mechanism.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-20 14:46:17 +08:00
wangxiyuan b350edae9a
[UT] refactor test_expert_load_balancer and fix broken CI (#1293)
refactor test_expert_load_balancer to keep the ut code style

This PR also fixed the break change from
https://github.com/vllm-project/vllm/pull/16188/files#diff-e2942ece30a5c580437694ffb964bfc664b510c59244c08e5921b8f5cefb4280

This is just a quick fix. We'll support embedding on V1 later

Closes: https://github.com/vllm-project/vllm-ascend/issues/1299

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 01:02:52 +08:00
songshanhu07 ebb2a70dbb
static EPLB fix bug, add unit test (#1186)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
1.add static EPLB unit test
2.fix bug: Tensor cannot be directly judged by if statements
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Run the unit test.

---------

Signed-off-by: songshanhu07 <1763685535@qq.com>
2025-06-18 19:46:56 +08:00
Shanshan Shen 2cd8ecdc4f
[Bugfix][Spec Decode] Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode (#1258)
### What this PR does / why we need it?

Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode.

Find more details at **mengwei805**'s comment in
https://github.com/vllm-project/vllm-ascend/pull/1123.

### Does this PR introduce _any_ user-facing change?

The user will not be aware of `VLLM_ASCEND_ACL_OP_INIT_MODE`
(`ACL_OP_INIT_MODE`).

### How was this patch tested?

Test scripts:

```python
from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    tensor_parallel_size=1,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 4,
    },
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Results:

```
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 76.70it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.33it/s, est. speed input: 6.64 toks/s, output: 21.26 toks/s]
Prompt: 'The future of AI is', Generated text: ' bright\n\n04/15/2020\n\nBy: James'
```

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-18 17:50:20 +08:00
zzzzwwjj db2f630aeb
[bugfix] fix deepseek with mc2 (#1268)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-06-18 00:58:38 +08:00
whx d7e19ed57a
[BugFix] fix length of sin/cos cache in rope (#1266)
This PR fixes the bug that constructs shorter sin/cos cache than model's
max positional embedding.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1038

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-17 23:14:25 +08:00
Jade Zheng afc8edb046
[Bugfix]: Pass scaling args to mc2 (#1202)
Pass `expert_scale` and `expand_scale` args to the dispatch and combine
functions.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-06-17 22:16:44 +08:00
Li Wang f8029945c3
[Bugfix] Remove cuda related lines and add additional pip mirror (#1252)
### What this PR does / why we need it?
- For npu environment, we should use `PYTORCH_NPU_ALLOC_CONF ` rather
than `PYTORCH_CUDA_ALLOC_CONF`
- Add `PIP_EXTRA_INDEX_URL` to make nightly_benchmarks happy


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-17 21:25:40 +08:00
zzzzwwjj 23ca68d0c8
[refactor] Refactoring AscendFusedMoE (#1229)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR is used for resolved [issue
1147](https://github.com/vllm-project/vllm-ascend/issues/1147)
1. Move fused_moe code into one file `fused_moe.py`.
2. Integrate branch conditions into function `get_fused_moe_state`.
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this
env is useless, we can make judgments based on the current scenario
without this env, it will only increase complexity.
2. This PR has removed the env `USING_LCCL_COM`, because this env has
already expired.
3. `additional_config.expert_tensor_parallel_size` has already expired,
and now we also use parameter `enable_expert_parallel`, consistent with
the vLLM.
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-06-17 17:49:03 +08:00
Yikun Jiang 05dec7eda9
[Doc] Refactor and init user story page (#1224)
### What this PR does / why we need it?
This PR refactor the user stories page:
- Move it to community
- Add initial info of LLaMA-Factory, Huggingface/trl, MindIE Turbo,
GPUStack, verl
- Add a new page for LLaMA-Factory

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview locally

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-17 09:36:35 +08:00
Yikun Jiang 9d3cbc0953
[Doctest] add installation doctest (#1179)
### What this PR does / why we need it?
Install doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Related: https://github.com/vllm-project/vllm-ascend/pull/983

Co-authored-by: wangli <wangli858794774@gmail.com>

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-06-17 08:52:26 +08:00
Mengqing Cao 96fa7ff63b
[DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235)
### What this PR does / why we need it?
1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: https://github.com/vllm-project/vllm-ascend/issues/1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: https://github.com/vllm-project/vllm-ascend/pull/1242
Closes: https://github.com/vllm-project/vllm-ascend/issues/1232


### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-06-16 23:09:53 +08:00
zhuo97 f5404dc650
Fix the device error when using ray as vllm-acend backend (#884)
1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES
2. Add lazy init for vllm_ascend_C

Signed-off-by: zhuo97 <1103045176@qq.com>
2025-06-16 21:03:16 +08:00
wangxiyuan 69b817ed65
[CI] Add unit test framework (#1201)
This PR added the unit test framework to enable ut for vLLM Ascend. Unit
test runs on CPU machines. It'll be ran once lint check is passed the
same as e2e test.

For unit test, this PR created a new folder called `ut` under `tests`
module. All the test file in `ut` should keep the same with the code in
`vllm-ascend`. The file name should be start with `test_` prefix. For
example, in this PR. the `test_ascend_config.py` is added for
`ascend_config.py` test.

A new fille `worker/test_worker_v1.py` is also added as the placeholder.
This file should be the unit test for `vllm-ascend/worker/worker_v1.py`.

Additional, a new `fake_weight` folder is added, it contains the
config.json from `facebook/opt-125m`, so that the test will not always
visit huggingface.

TODO:
We should add all the unit test file one by one in the future.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-16 18:32:28 +08:00
Yikun Jiang 966557a2a3
[Build] Speedup image build (#1216)
### What this PR does / why we need it?
1. Rename workflow name to show OS info
2. Speedup image build:
- PR: only arm64 build on openEuler arm64, only amd64 build on Ubuntu
amd64
- Push/Tag: still keep origin logic use qemu on amd64

This PR actually drop the e2e image build per PR but I think it's fine
consider it's stable enough, if we still meet some problem we can revert
this PR

43-44mins ---> about 8-10 mins

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-16 09:02:53 +08:00
Yikun Jiang 4ce860a2be
[CI] Make e2e test to be preemptible and simple (#1217)
### What this PR does / why we need it?
This PR make e2e test to be simple, even bring some repeat code between
single card and multicard, but we will not struggle with across
max-parallel, matrix and concurrency:
1. This PR make e2e test to be preemptible and simple:
- lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel)
- Anytime you push another PR will cancel previous job, whatever the job
is lint / e2e / multi-cards
2. Use Modelscope rather than hf-mirror
3. Resolve some error like `Canceling since a higher priority waiting
request for pr-XXXX-limit-npu-4 exists`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed
- lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel)
- e2e test will canceled by update patch

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-15 22:07:43 +08:00
ttanzhiqiang 4270682383
Waiting for BMM NZ support(Improve TPOP 2ms performance) (#1131)
### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use #1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-06-15 19:57:02 +08:00
22dimensions 0d2074a1ec
[Doc] fix VLLM_USE_V1 value in graph mode docs (#1226)
os.environ["VLLM_USE_V1"] must be assigned with str, not other type.


![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798)

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-15 15:41:11 +08:00
fems14 ab5d110fcc
vllm-ascend support chunked prefill (#1172)
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA


---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-06-14 22:31:16 +08:00
Mengqing Cao a3b5af8307
[CI/UT][Graph] Add ut for torchair graph mode (#1103)
### What this PR does / why we need it?
Add ut for torchair graph mode on DeepSeekV3

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-06-14 16:59:00 +08:00
Yikun Jiang 94a52cf577
Add ShouJian Zheng (@jianzs) as vLLM Ascend maintainer (#1203)
### What this PR does / why we need it?

Add @jianzs as vLLM Ascend maintainer

@jianzs
----
I would like to nominate Shoujian Zheng (@jianzs
<https://github.com/jianzs>) as a maintainer, starting with my +1.

- He focuses on the code quality and good design with solid reviews in P/D
disaggregation and DeepSeek improvement area about 30+ high quality review, such
as #issuecomment-2811764833, #discussion_r2069927605 and
#pullrequestreview-2820996674. This is the most important reason why I nominated
him, because helping community developers complete PRs with high quality and
continuously ensure the quality of codebase is one of the important
responsibilities of a maintainer. We believe he is a great addition.
- Shoujian's main expertise is distributed inference. He has a lot of experience
in production about AI infra. He has very good habits and explains in great
detail all changes #issue-3023082580 anqd share results open:
#issuecomment-2853140443. And High quality PR: #706, #774, #852.
- Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in 30+ PR and issue,
such as #issuecomment-2911934292 and #issuecomment-2833523571.

Reference:
[1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html
[2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-13 18:25:50 +08:00
whx 47b507b180
[CI] Recover ut for ascend scheduler only in ci of v1. (#1180)
Last PR [#943 ](https://github.com/vllm-project/vllm-ascend/pull/943)
wrongly open ut of AscendScheduler in V0 ci, this PR fixes this problem
and only run ut of it in V1 ci.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-13 07:51:23 +08:00
sdmyzlp e72f94e38f
Support multistream of MLA vector operations (#1135)
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-12 21:42:09 +08:00
Wan_Danfeng 55c0e68883
[Doc] Add Referer header for CANN package download url. (#1192)
### What this PR does / why we need it?
fix the CANN download url

### Does this PR introduce _any_ user-facing change?
no, do not have any user-facing change

### How was this patch tested?
run the **wget** command and cann package is rightly downloaded.

---------

Signed-off-by: wan_danfeng <wonderful199082@126.com>
2025-06-12 21:22:23 +08:00
wangyanhui-cmss c6e2a5fb40
[fix] fix bug in 1p1d disaggregated_prefill example (#1184)
### What this PR does / why we need it?
fix  bug in 1p1d  disaggregated_prefill  example

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested with python find_device_ips.py and run disaggregated_prefill
example

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-06-12 19:40:58 +08:00
Li Wang 37f4469a03
[CI][Benchmark] Add qwen2.5-7b test (#1104)
### What this PR does / why we need it?
- Add qwen2.5-7b performance benchmark, this is a sub pr of #1099, for
v1 test, need more verify
- Fix get commit time after checkout

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-12 10:47:30 +08:00
Li Wang dd207cb261
[CI][Benchmark] Add new model and v1 test to perf benchmarks (#1099)
### What this PR does / why we need it?
- Add qwen2.5-7b-instruct test
- Add v1 test
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-12 10:46:41 +08:00
ttanzhiqiang 2498d297ae
add custom ascendc kernel vocabparallelembedding (#796)
This PR add custom ascendc kernel vocabparallelembedding support in
vllm-ascend, related CMakeLists and setuptools is also added in this PR.

pytest -s benchmarks/ops/ben_vocabparallelembedding.py
pytest -s tests/ops/test_vocabparallelembedding.py

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-06-12 10:44:33 +08:00
whx 3393d53b36
[Scheduler][MTP] Add support for speculative decoding in AsecendScheduler. (#943)
This PR adds support for speculative decoding in AsecendScheduler.
Also inculde part of support for disaggregated prefill, full support
will be merged in follow-up PR.

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-11 20:55:44 +08:00
wangxiyuan 4f5964420e
[CI] Upgrade vllm to 0.9.1 (#1165)
1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now.
keep doc to 0.9.0 until we release the first 0.9.1 release.
2. disable V0 test for PR
3. move actionlint check to lint job

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-11 16:33:11 +08:00
chenwaner e46dc142bf
Enable kvcache_nz for the decode process in torchair graph mode (#1098)
What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588

---------

Signed-off-by: chenwaner <861645847@qq.com>
2025-06-11 14:09:28 +08:00
yz 4153a5091b
[Doc] Fix the config parameter name "enable" in graph_mode.md. (#1159)
Fix the doc typo in graph_mode.md

Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>
2025-06-11 11:03:37 +08:00
ttanzhiqiang 980cd81466
etp best a2 (#1101)
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-06-11 10:40:50 +08:00
depeng1994 860a5ef7fd
provide an e2e guide for execute duration profiling (#1113)
### What this PR does / why we need it?
provide an e2e guide for execute duration profiling


Signed-off-by: depeng1994 <depengzhang@foxmail.com>
2025-06-11 10:02:11 +08:00
sdmyzlp 7bdc606677
Support multistream of shared experts in FusedMoE (#997)
Contains on #1111 for completeness.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Implement multi-stream parallelism for MoE layers with shared experts,
where computation of shared experts will be overlapped with expert token
dispatch and combine. Also, when multi-stream is enabled, weights of
shared experts will be force to replicate across all cards, regardless
of any tensor parallelism configurations, to avoid AllReduce operations.

With the expected overlaping being:
```
| shared gate_up | shared act |              | shared down |
|    dispatch    | routed gate_up, act, down |   combine   |
```

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
No.

<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-11 09:18:38 +08:00
Mengqing Cao 04abfd8721
[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass (#1163)
[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make
longterm CI pass

Related: https://github.com/vllm-project/vllm-ascend/issues/1162

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-11 07:31:13 +08:00
22dimensions 8b48daaa44
[CI] rename Qwen2.5-0.5B-Instruct-W8A8 model (#1145)
1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to
vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-11 06:18:32 +08:00
274 changed files with 21139 additions and 14280 deletions

View File

@ -15,17 +15,16 @@
# This file is a part of the vllm-ascend project.
#
ARG PY_VERSION=3.10
FROM quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py${PY_VERSION}
FROM quay.io/ascend/manylinux:8.0.0-910b-manylinux_2_28-py${PY_VERSION}
ARG COMPILE_CUSTOM_KERNELS=1
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN apt-get update -y && \
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
rm -rf /var/cache/apt/* && \
rm -rf /var/lib/apt/lists/*
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
WORKDIR /workspace
@ -41,8 +40,6 @@ RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
cd vllm-ascend && \
python3 setup.py bdist_wheel && \
ls -l dist && \
for f in dist/*.whl; do mv "$f" "$(echo "$f" | sed -e 's/-linux_x86_64\.whl$/-manylinux1_x86_64.whl/' -e 's/-linux_aarch64\.whl$/-manylinux2014_aarch64.whl/')"; done && \
ls -l dist
ls -l dist
CMD ["/bin/bash"]

View File

@ -1,5 +1,5 @@
name: 📚 User Story
description: Apply for an user story to be displayed on https://vllm-ascend.readthedocs.org/user_stories/index.html
description: Apply for an user story to be displayed on https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html
title: "[User Story]: "
labels: ["user-story"]

View File

@ -0,0 +1,100 @@
name: Release Checklist
description: Generate a release checklist issue when prepare a new release.(Used for release team)
title: "[Release]: Release checklist for v"
body:
- type: textarea
attributes:
description: >
Brief info for the new release.
label: Release Checklist
value: >
**Release Version**:
**Release Branch**:
**Release Date**:
**Release Manager**:
- type: textarea
attributes:
description: >
Release notes.
label: Prepare Release Note
value: >
- [ ] Create a new issue for release feedback
- [ ] Write the release note PR.
- [ ] Update the feedback issue link in docs/source/faqs.md
- [ ] Add release note to docs/source/user_guide/release_notes.md
- [ ] Update version info in docs/source/community/versioning_policy.md
- [ ] Update contributor info in docs/source/community/contributors.md
- [ ] Update package version in docs/conf.py
- type: textarea
attributes:
description: >
Make sure the code is merged.
label: PR need Merge
value: >
- [ ] PR link1
- [ ] PR link2
- [ ] ...
- type: textarea
attributes:
description: >
Make sure the new Feature/Function is tested
label: Functional Test
value: >
- [ ] Feature1
- [ ] Bug1
- [ ] ...
- type: textarea
attributes:
description: >
Make sure the doc is updated.
label: Doc Test
value: >
- [ ] Tutorial is updated.
- [ ] User Guide is updated.
- [ ] Developer Guide is updated.
- type: textarea
attributes:
description: >
Make sure the artifacts is ready
label: Prepare Artifacts
value: >
- [ ] Docker image is ready.
- [ ] Wheel package is ready.
- type: textarea
attributes:
description: >
Start to release.
label: Release Step
value: >
- [ ] Release note PR is merged.
- [ ] Post the release on GitHub release page.
- [ ] Generate official doc page on https://app.readthedocs.org/dashboard/
- [ ] Wait for the wheel package to be available on https://pypi.org/project/vllm-ascend
- [ ] Wait for the docker image to be available on https://quay.io/ascend/vllm-ascend
- [ ] Upload 310p wheel to Github release page
- [ ] Broadcast the release news (By message, blog , etc)
- [ ] Close this issue

59
.github/format_pr_body.sh vendored Executable file
View File

@ -0,0 +1,59 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
# Adapted from vllm/.github/scripts/cleanup_pr_body.sh
#!/bin/bash
set -eux
# ensure 2 argument is passed
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <pr_number> <vllm_version> <vllm_commit>"
exit 1
fi
PR_NUMBER=$1
VLLM_VERSION=$2
VLLM_COMMIT=$3
OLD=/tmp/orig_pr_body.txt
NEW=/tmp/new_pr_body.txt
FINAL=/tmp/final_pr_body.txt
gh pr view --json body --template "{{.body}}" "${PR_NUMBER}" > "${OLD}"
cp "${OLD}" "${NEW}"
# Remove notes in pr description and add vLLM version and commit
sed -i '/<!--/,/-->/d' "${NEW}"
sed -i '/- vLLM .*$/d' "${NEW}"
{
echo ""
echo "- vLLM version: $VLLM_VERSION"
echo "- vLLM main: $VLLM_COMMIT"
} >> "${NEW}"
# Remove redundant empty lines
uniq "${NEW}" > "${FINAL}"
# Run this only if ${NEW} is different than ${OLD}
if ! cmp -s "${OLD}" "${FINAL}"; then
echo
echo "Updating PR body:"
echo
cat "${NEW}"
gh pr edit --body-file "${FINAL}" "${PR_NUMBER}"
else
echo "No changes needed"
fi

View File

@ -1,202 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: Accuracy Report
on:
workflow_dispatch:
inputs:
vllm-ascend-branch:
description: 'vllm-ascend branch:'
required: true
type: choice
options:
- main
- v0.7.3-dev
models:
description: 'models:'
required: true
type: choice
options:
- all
- Qwen/Qwen2.5-7B-Instruct
- Qwen/Qwen2.5-VL-7B-Instruct
- Qwen/Qwen3-8B-Base
default: 'all'
jobs:
download_reports:
runs-on: ubuntu-latest
strategy:
matrix:
model: ${{ fromJSON(
(github.event.inputs.models == 'all' &&
'["Qwen/Qwen2.5-7B-Instruct","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
(github.event.inputs.models == 'Qwen/Qwen2.5-7B-Instruct' &&
'["Qwen/Qwen2.5-7B-Instruct"]') ||
(github.event.inputs.models == 'Qwen/Qwen2.5-VL-7B-Instruct' &&
'["Qwen/Qwen2.5-VL-7B-Instruct"]') ||
(github.event.inputs.models == 'Qwen/Qwen3-8B-Base' &&
'["Qwen/Qwen3-8B-Base"]')
) }}
version: [0, 1]
exclude:
- model: 'Qwen/Qwen2.5-VL-7B-Instruct'
version: 1
fail-fast: false
name: Download ${{ matrix.model }} V${{ matrix.version }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: ${{ github.event.inputs.vllm-ascend-branch }}
- name: Get base model name
id: get_basename
run: |
model_base_name=$(basename "${{ matrix.model }}")
echo "model_base_name=$model_base_name" >> $GITHUB_OUTPUT
shell: bash
- name: Query artifact run id
id: get_run_id
run: |
ARTIFACT_PATTERN="${{ github.event.inputs.vllm-ascend-branch }}-${{ steps.get_basename.outputs.model_base_name }}-V${{ matrix.version }}-report"
echo "Querying artifacts with pattern: $ARTIFACT_PATTERN"
ARTIFACT_JSON=$(gh api --paginate /repos/${{ github.repository }}/actions/artifacts || echo "{}")
RUN_ID=$(echo "$ARTIFACT_JSON" | \
jq -s -r --arg pattern "$ARTIFACT_PATTERN" \
'[.[].artifacts[]] | map(select(.name | test($pattern))) | sort_by(.created_at) | last | .workflow_run.id // empty')
if [ -z "$RUN_ID" ]; then
echo "::warning::No artifact found matching pattern $ARTIFACT_PATTERN. Skipping download."
echo "runid=" >> $GITHUB_OUTPUT
else
echo "Found matching artifact with run ID: $RUN_ID"
echo "runid=$RUN_ID" >> $GITHUB_OUTPUT
fi
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Download Artifact
if: ${{ steps.get_run_id.outputs.runid != '' }}
uses: actions/download-artifact@v4
with:
name: ${{ github.event.inputs.vllm-ascend-branch }}-${{ steps.get_basename.outputs.model_base_name }}-V${{ matrix.version }}-report
path: ./docs/source/developer_guide/evaluation/accuracy_report_bak
github-token: ${{ secrets.GITHUB_TOKEN }}
repository: ${{ github.repository }}
run-id: ${{ steps.get_run_id.outputs.runid }}
- name: Upload reports artifact
if: ${{ steps.get_run_id.outputs.runid != '' }}
uses: actions/upload-artifact@v4
with:
name: report-${{ steps.get_basename.outputs.model_base_name }}-v${{ matrix.version }}
path: ./docs/source/developer_guide/evaluation/accuracy_report_bak/*.md
retention-days: 90
create_pr:
runs-on: ubuntu-latest
needs: download_reports
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: ${{ github.event.inputs.vllm-ascend-branch }}
- name: Setup workspace
run: mkdir -p ./accuracy/accuracy_report
- name: Download only current run reports
uses: actions/download-artifact@v4
with:
path: ./docs/source/developer_guide/evaluation/accuracy_report
pattern: report-*
github-token: ${{ secrets.GITHUB_TOKEN }}
run-id: ${{ github.run_id }}
- name: Delete old report
run: |
find ./docs/source/developer_guide/evaluation/accuracy_report -maxdepth 1 -type f -name '*.md' ! -name 'index.md' -delete
find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 2 -type f -name '*.md' -exec mv -f {} ./docs/source/developer_guide/evaluation/accuracy_report \;
find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 1 -type d -empty -delete
- name: Generate step summary
if: ${{ always() }}
run: |
for report in ./docs/source/developer_guide/evaluation/accuracy_report/*.md; do
filename=$(basename "$report")
# skip index.md
if [ "$filename" = "index.md" ]; then
continue
fi
if [ -f "$report" ]; then
{
echo -e "\n\n---\n"
echo "## 📄 Report File: $(basename $report)"
cat "$report"
} >> "$GITHUB_STEP_SUMMARY"
fi
done
- name: Update accuracy_report/index.md
run: |
REPORT_DIR="./docs/source/developer_guide/evaluation/accuracy_report"
INDEX_MD="$REPORT_DIR/index.md"
{
echo "# Accuracy Report"
echo ""
echo "::: {toctree}"
echo ":caption: Accuracy Report"
echo ":maxdepth: 1"
for report in "$REPORT_DIR"/*.md; do
filename="$(basename "$report" .md)"
if [ "$filename" != "index" ]; then
echo "$filename"
fi
done
echo ":::"
} > "$INDEX_MD"
- name: Create Pull Request
uses: peter-evans/create-pull-request@v7
with:
token: ${{ secrets.PR_TOKEN }}
base: ${{ github.event.inputs.vllm-ascend-branch }}
branch: auto-pr/accuracy-report
commit-message: "Update accuracy reports for ${{ github.event.inputs.vllm-ascend-branch }}"
add-paths: ./docs/source/developer_guide/evaluation/accuracy_report/*.md
title: "[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-branch }}"
body: |
The accuracy results running on NPU Altlas A2 have changed, updating reports for:
${{
github.event.inputs.models == 'all'
&& 'All models (Qwen2.5-7B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base)'
|| github.event.inputs.models
}}
- [Workflow run][1]
[1]: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}

View File

@ -22,6 +22,9 @@
name: Benchmarks / accuracy
on:
schedule:
# Runs every 6 hours
- cron: '0 */6 * * *'
pull_request:
types: [ labeled ]
workflow_dispatch:
@ -34,8 +37,8 @@ on:
# Current supported vLLM versions
options:
- main
- v0.9.0.1
- v0.9.0
- v0.9.2
- v0.9.1
- v0.7.3
vllm-ascend-version:
description: 'vllm-ascend version:'
@ -43,6 +46,7 @@ on:
type: choice
options:
- main
- v0.9.1-dev
- v0.7.3-dev
models:
description: 'model:'
@ -50,9 +54,9 @@ on:
type: choice
options:
- all
- Qwen/Qwen2.5-7B-Instruct
- Qwen/Qwen2.5-VL-7B-Instruct
- Qwen/Qwen3-8B-Base
- Qwen/Qwen3-30B-A3B
default: 'all'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
@ -74,56 +78,56 @@ jobs:
${{
(contains(github.event.pull_request.labels.*.name, 'accuracy-test') ||
contains(github.event.pull_request.labels.*.name, 'vl-accuracy-test') ||
contains(github.event.pull_request.labels.*.name, 'moe-accuracy-test') ||
contains(github.event.pull_request.labels.*.name, 'dense-accuracy-test')) &&
contains(github.event.pull_request.labels.*.name, 'ready-for-test') ||
github.event_name == 'workflow_dispatch'
github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
}}
runs-on: >-
${{
(matrix.model_name == 'Qwen/Qwen2.5-VL-7B-Instruct' && 'linux-arm64-npu-4') ||
(matrix.model_name == 'Qwen/Qwen3-30B-A3B' && 'linux-arm64-npu-4') ||
'linux-arm64-npu-2'
}}
strategy:
matrix:
vllm_use_version: [0, 1]
# the accuracy test will run:
# 1. workflow_dispatch with models input
# - all: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
# - specified but not all: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
# - all: Qwen/Qwen3-30B-A3B, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
# - specified but not all: Qwen/Qwen3-30B-A3B, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
# 2. PR labeled with "*-accuracy-test"
# - accuracy-test: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct
# - dense-accuracy-test: Qwen/Qwen2.5-7B-Instruct
# - accuracy-test: Qwen/Qwen3-8B-Base, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-30B-A3B
# - dense-accuracy-test: Qwen/Qwen3-8B-Base
# - vl-accuracy-test: Qwen/Qwen2.5-VL-7B-Instruct
# - moe-accuracy-test: Qwen/Qwen3-30B-A3B
model_name: ${{ fromJSON(
(github.event_name == 'schedule' &&
'["Qwen/Qwen3-30B-A3B","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
(github.event.inputs.models == 'all' &&
'["Qwen/Qwen2.5-7B-Instruct","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
(github.event.inputs.models == 'Qwen/Qwen2.5-7B-Instruct' &&
'["Qwen/Qwen2.5-7B-Instruct"]') ||
'["Qwen/Qwen3-30B-A3B","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
(github.event.inputs.models == 'Qwen/Qwen3-30B-A3B' &&
'["Qwen/Qwen3-30B-A3B"]') ||
(github.event.inputs.models == 'Qwen/Qwen2.5-VL-7B-Instruct' &&
'["Qwen/Qwen2.5-VL-7B-Instruct"]') ||
(github.event.inputs.models == 'Qwen/Qwen3-8B-Base' &&
'["Qwen/Qwen3-8B-Base"]') ||
contains(github.event.pull_request.labels.*.name, 'accuracy-test') &&
'["Qwen/Qwen2.5-7B-Instruct","Qwen/Qwen2.5-VL-7B-Instruct"]' ||
'["Qwen/Qwen3-8B-Base","Qwen/Qwen2.5-VL-7B-Instruct", "Qwen/Qwen3-30B-A3B"]' ||
contains(github.event.pull_request.labels.*.name, 'dense-accuracy-test') &&
'["Qwen/Qwen2.5-7B-Instruct"]' ||
'["Qwen/Qwen3-8B-Base"]' ||
contains(github.event.pull_request.labels.*.name, 'vl-accuracy-test') &&
'["Qwen/Qwen2.5-VL-7B-Instruct"]'
'["Qwen/Qwen2.5-VL-7B-Instruct"]' ||
contains(github.event.pull_request.labels.*.name, 'moe-accuracy-test') &&
'["Qwen/Qwen3-30B-A3B"]'
) }}
# Remove exclude after https://github.com/vllm-project/vllm-ascend/issues/1044 resolved
exclude:
- model_name: Qwen/Qwen2.5-VL-7B-Instruct
vllm_use_version: 1
fail-fast: false
name: ${{ matrix.model_name }} accuracy V${{ matrix.vllm_use_version }}
name: ${{ matrix.model_name }} accuracy
container:
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
HF_ENDPOINT: https://hf-mirror.com
HF_TOKEN: ${{ secrets.HF_TOKEN }}
DATASET_SOURCE: ModelScope
VLLM_USE_MODELSCOPE: True
USE_MODELSCOPE_HUB: 1
# 1. If version specified (work_dispatch), do specified branch accuracy test
# 2. If no version (labeled PR), do accuracy test by default ref:
# The branch, tag or SHA to checkout. When checking out the repository that
@ -142,11 +146,11 @@ jobs:
- name: Config mirrors
run: |
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Install system dependencies
run: |
@ -159,12 +163,29 @@ jobs:
repository: vllm-project/vllm
path: ./vllm-empty
# Please also update this when bump matched version
ref: ${{ github.event.inputs.vllm-version || 'v0.9.0' }}
ref: ${{ github.event.inputs.vllm-version || 'v0.9.2' }}
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: VLLM_TARGET_DEVICE=empty pip install -e .
- name: Resolve vllm-ascend version
run: |
VERSION_INPUT="${{ github.event.inputs.vllm-ascend-version }}"
if [[ "$VERSION_INPUT" == "main" ]]; then
TAGS=$(git ls-remote --tags --sort=-v:refname https://github.com/vllm-project/vllm-ascend "v*" | cut -f2 | sed 's|refs/tags/||')
LATEST_TAG=$(echo "$TAGS" | head -n1)
if [[ -z "$LATEST_TAG" ]]; then
RESOLVED_VERSION="main"
else
RESOLVED_VERSION="$LATEST_TAG"
fi
else
RESOLVED_VERSION="$VERSION_INPUT"
fi
echo "GHA_VLLM_ASCEND_VERSION=$RESOLVED_VERSION" >> $GITHUB_ENV
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
with:
@ -174,13 +195,32 @@ jobs:
- name: Install vllm-project/vllm-ascend
working-directory: ./vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -e .
pip install -v -e .
- name: Get vLLM commit hash and URL
working-directory: ./vllm-empty
run: |
VLLM_COMMIT=$(git rev-parse --short=7 HEAD)
echo "VLLM_COMMIT=$VLLM_COMMIT" >> $GITHUB_ENV
- name: Get vLLM-Ascend commit hash and URL
working-directory: ./vllm-ascend
run: |
VLLM_ASCEND_COMMIT=$(git rev-parse --short=7 HEAD)
echo "VLLM_ASCEND_COMMIT=$VLLM_ASCEND_COMMIT" >> $GITHUB_ENV
- name: Print resolved hashes
run: |
echo "vLLM : ${{ env.VLLM_COMMIT }}"
echo "vLLM-Ascend: ${{ env.VLLM_ASCEND_COMMIT }}"
- name: Install lm-eval, ray, and datasets
run: |
pip install lm-eval
pip install lm-eval==0.4.8
- name: Collect version info
run: |
@ -201,7 +241,6 @@ jobs:
pip show torch | grep "Version:" | awk '{print "GHA_TORCH_VERSION="$2}'
pip show torch_npu | grep "Version:" | awk '{print "GHA_TORCH_NPU_VERSION="$2}'
pip show vllm | grep "Version:" | awk '{print "GHA_VLLM_VERSION="$2}' | sed 's/+.*//'
echo "GHA_VLLM_ASCEND_VERSION=${{ github.event.inputs.vllm-ascend-version || github.ref }}"
} >> "$GITHUB_ENV"
- name: Print versions
@ -212,15 +251,14 @@ jobs:
echo "vLLM: ${{ env.GHA_VLLM_VERSION }}"
echo "vLLM Ascend: ${{ env.GHA_VLLM_ASCEND_VERSION }}"
- name: Run Accuracy Test for V${{ matrix.vllm_use_version }}
- name: Run Accuracy Test
id: report
working-directory: ./benchmarks
env:
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
VLLM_USE_V1: ${{ matrix.vllm_use_version }}
run: |
model_base_name=$(basename ${{ matrix.model_name }})
markdown_name="${model_base_name}-V${{ matrix.vllm_use_version }}"
markdown_name="${model_base_name}"
echo "markdown_name=$markdown_name"
echo "markdown_name=$markdown_name" >> $GITHUB_OUTPUT
mkdir -p ./accuracy
@ -232,7 +270,9 @@ jobs:
--cann_version "${{ env.GHA_CANN_VERSION }}" \
--torch_npu_version "${{ env.GHA_TORCH_NPU_VERSION }}" \
--torch_version "${{ env.GHA_TORCH_VERSION }}" \
--vllm_version "${{ env.GHA_VLLM_VERSION }}"
--vllm_version "${{ env.GHA_VLLM_VERSION }}" \
--vllm_commit "${{ env.VLLM_COMMIT }}" \
--vllm_ascend_commit "${{ env.VLLM_ASCEND_COMMIT }}" \
- name: Generate step summary
if: ${{ always() }}
@ -244,12 +284,122 @@ jobs:
SAFE_VLLM_ASCEND_VERSION="${GHA_VLLM_ASCEND_VERSION//\//-}"
echo "SAFE_VLLM_ASCEND_VERSION=$SAFE_VLLM_ASCEND_VERSION" >> "$GITHUB_ENV"
- name: Upload Report for V${{ matrix.vllm_use_version }}
if: ${{ github.event_name == 'workflow_dispatch' }}
- name: Check report first line for failure
id: check_report
run: |
REPORT_PATH="./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md"
echo "Scanning $REPORT_PATH for ❌ …"
if grep -q '❌' "$REPORT_PATH"; then
echo "contains_fail=true" >> $GITHUB_OUTPUT
else
echo "contains_fail=false" >> $GITHUB_OUTPUT
fi
- name: Upload Report
if: ${{ github.event_name == 'workflow_dispatch' && steps.check_report.outputs.contains_fail == 'false' }}
uses: actions/upload-artifact@v4
with:
name: "${{ env.SAFE_VLLM_ASCEND_VERSION }}-${{ steps.report.outputs.markdown_name }}-report"
name: "report-${{ env.SAFE_VLLM_ASCEND_VERSION }}-${{ steps.report.outputs.markdown_name }}"
path: ./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md
if-no-files-found: warn
retention-days: 90
overwrite: true
create_pr:
runs-on: ubuntu-latest
needs: accuracy_tests
if: ${{ github.event_name == 'workflow_dispatch' }}
env:
UPSTREAM_REPO: vllm-project/vllm-ascend
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: vllm-ascend-ci/vllm-ascend
token: ${{ secrets.PAT_TOKEN }}
ref: main
- name: Add upstream remote
run: |
git remote add upstream https://github.com/${{ env.UPSTREAM_REPO }}.git
git fetch upstream
git remote -v
- name: Set Git user info dynamically
run: |
git config user.name "${{ github.actor }}"
git config user.email "${{ github.actor }}@users.noreply.github.com"
- name: Create or switch to branch
run: |
TIMESTAMP=$(date +%Y%m%d%H%M%S)
BRANCH_NAME="auto-pr/accuracy-report-${TIMESTAMP}"
echo "BRANCH_NAME=${BRANCH_NAME}" >> $GITHUB_ENV
git checkout -B "${BRANCH_NAME}" upstream/${{ github.event.inputs.vllm-ascend-version }}
- name: Download only current run reports
uses: actions/download-artifact@v4
with:
path: ./docs/source/developer_guide/evaluation/accuracy_report
pattern: report-*
github-token: ${{ secrets.GITHUB_TOKEN }}
run-id: ${{ github.run_id }}
- name: Delete old report
run: |
find ./docs/source/developer_guide/evaluation/accuracy_report -maxdepth 1 -type f -name '*.md' ! -name 'index.md' -delete
find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 2 -type f -name '*.md' -exec mv -f {} ./docs/source/developer_guide/evaluation/accuracy_report \;
find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 1 -type d -empty -delete
- name: Update accuracy_report/index.md
run: |
REPORT_DIR="./docs/source/developer_guide/evaluation/accuracy_report"
INDEX_MD="$REPORT_DIR/index.md"
{
echo "# Accuracy Report"
echo ""
echo ":::{toctree}"
echo ":caption: Accuracy Report"
echo ":maxdepth: 1"
for report in "$REPORT_DIR"/*.md; do
filename="$(basename "$report" .md)"
if [ "$filename" != "index" ]; then
echo "$filename"
fi
done
echo ":::"
} > "$INDEX_MD"
- name: push accuracy report
env:
GITHUB_TOKEN: ${{ secrets.PAT_TOKEN }}
run: |
git add ./docs/source/developer_guide/evaluation/accuracy_report/*.md
git commit -s -m "[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-version }}"
git push -f origin "${{ env.BRANCH_NAME }}"
- name: Create PR in upstream via API
uses: actions/github-script@v7
with:
github-token: ${{ secrets.PAT_TOKEN }}
script: |
const pr = await github.rest.pulls.create({
owner: 'vllm-project',
repo: 'vllm-ascend',
head: `vllm-ascend-ci:${{ env.BRANCH_NAME }}`,
base: '${{ github.event.inputs.vllm-ascend-version }}',
title: `[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-version }}`,
body: `The accuracy results running on NPU Altlas A2 have changed, updating reports for:
${{
github.event.inputs.models == 'all'
&& 'All models (Qwen/Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base)'
|| github.event.inputs.models
}}
- [Workflow run][1]
[1]: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}`
});
core.info(`Created PR #${pr.data.number}`);

View File

@ -1,53 +0,0 @@
#
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Adapted from vllm-project/vllm/blob/main/.github
#
name: Lint GitHub Actions workflows
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/*.ya?ml'
- '.github/workflows/actionlint.*'
- '.github/workflows/matchers/actionlint.json'
env:
LC_ALL: en_US.UTF-8
defaults:
run:
shell: bash
permissions:
contents: read
jobs:
actionlint:
runs-on: ubuntu-latest
steps:
- name: "Checkout"
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: "Run actionlint"
env:
SHELLCHECK_OPTS: --exclude=SC2046,SC2006,SC2086
run: |
echo "::add-matcher::.github/workflows/matchers/actionlint.json"
tools/actionlint.sh -color

63
.github/workflows/format_pr_body.yaml vendored Normal file
View File

@ -0,0 +1,63 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: format / pr body
on:
# The PR updated when PR opened and push new commits
pull_request_target:
types: [opened, synchronize]
branches:
- 'main'
permissions:
pull-requests: write
jobs:
update-description:
name: update vLLM version
runs-on: ubuntu-latest
steps:
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: ./vllm-empty
- name: Get vLLM version
working-directory: ./vllm-empty
run: |
VLLM_COMMIT=$(git rev-parse HEAD)
echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> $GITHUB_ENV
- name: Checkout repository
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python
uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
- name: Get vLLM release version
run: |
VLLM_VERSION=$(python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"')
echo "VLLM_VERSION=$VLLM_VERSION" >> $GITHUB_ENV
- name: Update PR description
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
bash .github/format_pr_body.sh "${{ github.event.number }}" "${{ env.VLLM_VERSION }}" "${{ env.VLLM_COMMIT }}"

View File

@ -0,0 +1,117 @@
name: 'image / openEuler / 310p'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main-310p-openeuler / vllm-ascend:*-dev-310p-openeuler
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-310p-openeuler / vllm-ascend:v1.2.3rc1-310p-openeuler
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_310p_openeuler.yml'
- 'Dockerfile.310p.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_310p_openeuler.yml'
- 'Dockerfile.310p.openEuler'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-310p-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-310p-openeuler
# - pre/post/dev: v0.7.1rc1-310p-openeuler/v0.7.1rc1-310p-openeuler/v0.7.1rc1.dev1-310p-openeuler/v0.7.1.post1-310p-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-310p-openeuler
type=ref,event=pr,suffix=-310p-openeuler
type=pep440,pattern={{raw}},suffix=-310p-openeuler
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 310p
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
file: Dockerfile.310p.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

113
.github/workflows/image_310p_ubuntu.yml vendored Normal file
View File

@ -0,0 +1,113 @@
name: 'image / Ubuntu / 310p'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main-310p / vllm-ascend:*-dev-310p
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-310p / vllm-ascend:v1.2.3rc1-310p
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_310p_ubuntu.yml'
- 'Dockerfile.310p'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_310p_ubuntu.yml'
- 'Dockerfile.310p'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-310p
# - pre/post/dev: v0.7.1rc1-310p/v0.7.1rc1-310p/v0.7.1rc1.dev1-310p/v0.7.1.post1-310p, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-310p
type=ref,event=pr,suffix=-310p
type=pep440,pattern={{raw}},suffix=-310p
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 310p
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile.310p
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

117
.github/workflows/image_a3_openeuler.yml vendored Normal file
View File

@ -0,0 +1,117 @@
name: 'image / openEuler / a3'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-a3-openeuler / vllm-ascend:v1.2.3rc1-a3-openeuler
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_a3_openeuler.yml'
- 'Dockerfile.a3.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_a3_openeuler.yml'
- 'Dockerfile.a3.openEuler'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-a3-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-a3-openeuler
# - pre/post/dev: v0.7.1rc1-a3-openeuler/v0.7.1rc1-a3-openeuler/v0.7.1rc1.dev1-a3-openeuler/v0.7.1.post1-a3-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-a3-openeuler
type=ref,event=pr,suffix=-a3-openeuler
type=pep440,pattern={{raw}},suffix=-a3-openeuler
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push a3
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
file: Dockerfile.a3.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

113
.github/workflows/image_a3_ubuntu.yml vendored Normal file
View File

@ -0,0 +1,113 @@
name: 'image / Ubuntu / a3'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-a3|vllm-ascend:v1.2.3rc1-a3
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_a3_ubuntu.yml'
- 'Dockerfile.a3'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_a3_ubuntu.yml'
- 'Dockerfile.a3'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-a3 is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-a3
# - pre/post/dev: v0.7.1rc1-a3/v0.7.1rc1-a3/v0.7.1rc1.dev1-a3/v0.7.1.post1-a3, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-a3
type=ref,event=pr,suffix=-a3
type=pep440,pattern={{raw}},suffix=-a3
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push a3
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile.a3
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@ -1,4 +1,4 @@
name: 'image'
name: 'image / openEuler'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
@ -6,10 +6,9 @@ name: 'image'
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - commits are merge into main/*-dev ==> vllm-ascend:main-openeuler / vllm-ascend:*-dev-openeuler
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-openeuler|latest / vllm-ascend:v1.2.3rc1-openeuler
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-openeuler / vllm-ascend:v1.2.3rc1-openeuler
on:
pull_request:
branches:
@ -19,6 +18,12 @@ on:
- '.github/workflows/image_openeuler.yml'
- 'Dockerfile.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
@ -33,9 +38,13 @@ on:
jobs:
build:
name: vllm-ascend openEuler image
runs-on: ubuntu-latest
name: vllm-ascend image build
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
steps:
- uses: actions/checkout@v4
@ -55,7 +64,7 @@ jobs:
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-openeuler, latest
# - v0.7.1 --> v0.7.1-openeuler
# - pre/post/dev: v0.7.1rc1-openeuler/v0.7.1rc1-openeuler/v0.7.1rc1.dev1-openeuler/v0.7.1.post1-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
@ -63,6 +72,8 @@ jobs:
type=ref,event=branch,suffix=-openeuler
type=ref,event=pr,suffix=-openeuler
type=pep440,pattern={{raw}},suffix=-openeuler
flavor:
latest=true
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
@ -84,10 +95,15 @@ jobs:
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push
- name: Build and push 910b
uses: docker/build-push-action@v6
with:
platforms: linux/amd64,linux/arm64
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
@ -97,3 +113,4 @@ jobs:
file: Dockerfile.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@ -1,4 +1,4 @@
name: 'image'
name: 'image / Ubuntu'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
@ -9,7 +9,7 @@ name: 'image'
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3|latest / vllm-ascend:v1.2.3rc1
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3 / vllm-ascend:v1.2.3rc1
on:
pull_request:
branches:
@ -19,6 +19,12 @@ on:
- '.github/workflows/image_ubuntu.yml'
- 'Dockerfile'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
@ -33,7 +39,7 @@ on:
jobs:
build:
name: vllm-ascend Ubuntu image
name: vllm-ascend image build
runs-on: ubuntu-latest
steps:
@ -63,6 +69,8 @@ jobs:
type=ref,event=branch
type=ref,event=pr
type=pep440,pattern={{raw}}
flavor:
latest=true
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
@ -84,15 +92,22 @@ jobs:
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push
- name: Build and push 910b
uses: docker/build-push-action@v6
with:
platforms: linux/amd64,linux/arm64
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@ -20,9 +20,10 @@ name: 'Benchmarks / Performance'
on:
schedule:
# Run at 02:00 everyday
- cron: '00 18 * * *'
# Run benchmarks at 20:00 and 03:00 Beijing time (UTC+8)
- cron: "0 12 * * *"
- cron: "0 19 * * *"
workflow_dispatch:
# Allow manual triggering of the workflow
@ -45,13 +46,15 @@ jobs:
test:
if: ${{ contains(github.event.pull_request.labels.*.name, 'performance-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
name: Benchmarks/vLLM=${{ matrix.vllm_branch }}, vLLM-Ascend=${{ matrix.vllm_ascend_branch }}
name: Benchmarks/vLLM=${{ matrix.vllm_branch }}, vLLM-Ascend=${{ matrix.vllm_ascend_branch }}, use_v1=${{ matrix.vllm_use_v1 }}
runs-on: 'linux-arm64-npu-static-8'
strategy:
matrix:
include:
- vllm_branch: v0.9.0
- vllm_branch: v0.9.2
vllm_ascend_branch: main
vllm_use_v1: 1
max-parallel: 1
container:
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
volumes:
@ -67,10 +70,10 @@ jobs:
--device /dev/devmm_svm
--device /dev/hisi_hdc
env:
HF_ENDPOINT: https://hf-mirror.com
HF_TOKEN: ${{ secrets.HF_TOKEN }}
VLLM_USE_MODELSCOPE: True
ES_OM_DOMAIN: ${{ secrets.ES_OM_DOMAIN }}
ES_OM_AUTHORIZATION: ${{ secrets.ES_OM_AUTHORIZATION }}
VLLM_USE_V1: ${{ matrix.vllm_use_v1 }}
steps:
- name: Check npu and CANN info
run: |
@ -79,6 +82,8 @@ jobs:
- name: Config mirrors
run: |
# keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
- name: Install system dependencies
@ -109,7 +114,10 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install "transformers<=4.52.4"
pip install -e .
pip install -r benchmarks/requirements-bench.txt
@ -140,8 +148,8 @@ jobs:
- name: Install elastic_tool
if: github.event_name != 'pull_request'
run: |
pip install escli-tool==0.2.1
pip install escli-tool==0.2.3
- name: Collect pr info from vllm-project/vllm-ascend
if: github.event_name != 'pull_request'
run: |
@ -159,17 +167,19 @@ jobs:
cp -r benchmarks/* /github/home/benchmarks/
- name: Run benchmark iteration
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
if: github.event_name != 'pull_request'
run: |
while IFS= read -r line || [[ -n "$line" ]]; do
commit_id=${line%% *}
commit_title=${line#* }
commit_time=$(git show -s --format=%cd $commit_hash --date=iso-strict)
commit_time_no_tz=${commit_time::19}
git checkout $commit_id
commit_time=$(git show -s --format=%cd $commit_hash --date=iso-strict)
commit_time_no_tz=${commit_time::19}
pip install -e .
echo "------------------------"
echo "commit_id: $commit_id"
echo "commit_title: $commit_title"
@ -177,17 +187,21 @@ jobs:
echo "vllm branch: ${{ matrix.vllm_branch }}"
echo "vllm-ascend branch: ${{ matrix.vllm_ascend_branch }}"
echo "------------------------"
cd /github/home
bash benchmarks/scripts/run-performance-benchmarks.sh
# send the result to es
if [[ "${{ github.event_name }}" != "pull request" ]]; then
escli add --vllm_branch ${{ matrix.vllm_branch }} \
--vllm_ascend_branch ${{ matrix.vllm_ascend_branch }} \
--commit_id $commit_id \
--commit_title "$commit_title" \
--created_at "$commit_time_no_tz" \
--res_dir ./benchmarks/results
rm -rf ./benchmarks/results
ERROR_MSG=""
if ! bash benchmarks/scripts/run-performance-benchmarks.sh; then
ERROR_MSG="Benchmark failed to run"
fi
# send the result to es
escli add --vllm_branch ${{ matrix.vllm_branch }} \
--vllm_ascend_branch ${{ matrix.vllm_ascend_branch }} \
--commit_id $commit_id \
--commit_title "$commit_title" \
--created_at "$commit_time_no_tz" \
--res_dir ./benchmarks/results \
--error "$ERROR_MSG" \
--extra_feat '{"VLLM_USE_V1": "${{ matrix.vllm_use_v1 }}"}'
rm -rf ./benchmarks/results
cd -
done < commit_log.txt

37
.github/workflows/pre-commit.yml vendored Normal file
View File

@ -0,0 +1,37 @@
name: pre-commit
on:
workflow_call:
permissions:
contents: read
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
with:
python-version: "3.10"
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
- run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: ./vllm-empty
- name: Install vllm
working-directory: vllm-empty
run: |
pip install -r requirements/build.txt --extra-index-url https://download.pytorch.org/whl/cpu
VLLM_TARGET_DEVICE=empty pip install .
- name: Install vllm-ascend dev
run: |
pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
env:
SHELLCHECK_OPTS: "--exclude=SC2046,SC2006,SC2086" # Exclude SC2046, SC2006, SC2086 for actionlint
with:
extra_args: --all-files --hook-stage manual

View File

@ -32,20 +32,8 @@ on:
- 'CMakeLists.txt'
- 'csrc/**'
push:
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/release_code.yml'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
jobs:
build:

View File

@ -18,6 +18,9 @@
name: build / wheel
on:
schedule:
# Runs at 23:00 UTC (7:00 AM Beijing) every day
- cron: '0 23 * * *'
pull_request:
branches:
- 'main'
@ -33,21 +36,8 @@ on:
- 'CMakeLists.txt'
- 'csrc/**'
push:
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/release_whl.yml'
- '.github/Dockerfile.buildwheel'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
jobs:
build:
@ -55,7 +45,11 @@ jobs:
strategy:
matrix:
os: [ubuntu-24.04, ubuntu-24.04-arm]
python-version: ['3.9', '3.10', '3.11']
# PR only trigger latest version
python-version: ${{ fromJSON(
(github.event_name == 'pull_request' && '["3.11"]') ||
'["3.9", "3.10", "3.11"]'
) }}
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -71,22 +65,51 @@ jobs:
--build-arg PY_VERSION=${{ matrix.python-version }} \
-t wheel:v1 .
docker run --rm \
-u $(id -u):$(id -g) \
-v $(pwd):/outpwd \
wheel:v1 \
bash -c "cp -r /workspace/vllm-ascend/dist /outpwd"
ls dist
- name: Archive wheel
uses: actions/upload-artifact@v4
with:
name: vllm-ascend-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
path: dist/*
- name: Set up Python ${{ matrix.python-version }}
if: startsWith(github.ref, 'refs/tags/')
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: ${{ matrix.python-version }}
- name: Repair wheels with auditwheel
run: |
python3 -m pip install auditwheel
python3 -m pip install patchelf
mkdir -p dist/repaired
for whl in dist/*.whl; do
auditwheel repair "$whl" -w dist/repaired/ \
--exclude libplatform.so \
--exclude libregister.so \
--exclude libge_common_base.so \
--exclude libc10.so \
--exclude libc_sec.so \
--exclude "libascend*.so" \
--exclude "libtorch*.so"
done
rm -f dist/*.whl
mv dist/repaired/*.whl dist/
rmdir dist/repaired
ls dist
- name: Verify automatic platform tags
run: |
cd dist
for wheel in *.whl; do
echo "verification file: $wheel"
auditwheel show "$wheel"
done
- name: Archive wheel
uses: actions/upload-artifact@v4
with:
name: vllm-ascend-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
path: dist/*
- name: Release
if: startsWith(github.ref, 'refs/tags/')

View File

@ -1,49 +0,0 @@
#
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Adapted from vllm-project/vllm/blob/main/.github
#
name: Lint shell scripts
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '**/*.sh'
- '.github/workflows/shellcheck.yml'
env:
LC_ALL: en_US.UTF-8
defaults:
run:
shell: bash
permissions:
contents: read
jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- name: "Checkout"
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: "Check shell scripts"
run: |
tools/shellcheck.sh

View File

@ -30,8 +30,8 @@ on:
- 'tests/e2e/common.sh'
- 'tests/e2e/run_doctests.sh'
schedule:
# Runs every 4 hours
- cron: '0 */4 * * *'
# Runs every 12 hours
- cron: '0 */12 * * *'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
@ -46,7 +46,7 @@ jobs:
# Each version should be tested
fail-fast: false
matrix:
vllm_verison: [main, v0.7.3-dev, main-openeuler, v0.7.3-dev-openeuler]
vllm_verison: [v0.9.1-dev, v0.9.1-dev-openeuler, main, main-openeuler]
name: vLLM Ascend test
runs-on: linux-arm64-npu-1
container:
@ -65,34 +65,19 @@ jobs:
cd /vllm-workspace/vllm
git --no-pager log -1 || true
- name: Config OS mirrors - Ubuntu
if: ${{ !endsWith(matrix.vllm_verison, '-openeuler') }}
run: |
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
apt-get update -y
apt install git curl -y
- name: Config OS mirrors - openEuler
if: ${{ endsWith(matrix.vllm_verison, '-openeuler') }}
run: |
yum update -y
yum install git curl -y
- name: Config pip mirrors
run: |
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Run vllm-ascend/tests/e2e/run_doctests.sh
run: |
# PWD: /__w/vllm-ascend/vllm-ascend
# Address old branch like v0.7.3:
if [ ! -d /vllm-workspace/vllm-ascend/tests/e2e ]; then
echo "Warning: the doctest path doesn't exists, copy now"
cp -r tests/e2e /vllm-workspace/vllm-ascend/tests/
fi
# Make sure e2e tests are latest
echo "Replacing /vllm-workspace/vllm-ascend/tests/e2e ..."
rm -rf /vllm-workspace/vllm-ascend/tests/e2e
mkdir -p /vllm-workspace/vllm-ascend/tests
# Overwrite e2e and examples
cp -r tests/e2e /vllm-workspace/vllm-ascend/tests/
cp -r examples /vllm-workspace/vllm-ascend/
# Simulate container to enter directory
cd /workspace

View File

@ -18,21 +18,13 @@
name: 'test'
on:
schedule:
- cron: '0 23 * * *'
push:
branches:
- 'main'
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '*.txt'
- '**/*.py'
- '.github/workflows/vllm_ascend_test.yaml'
- '!docs/**'
- 'pytest.ini'
- '!benchmarks/**'
- 'tools/mypy.sh'
- 'mypy.ini'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
@ -41,89 +33,119 @@ defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 1 card / 4 cards test type
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint:
uses: ./.github/workflows/pre-commit.yml
changes:
runs-on: ubuntu-latest
outputs:
e2e_tracker: ${{ steps.filter.outputs.e2e_tracker }}
ut_tracker: ${{ steps.filter.outputs.ut_tracker }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
e2e_tracker:
- '.github/workflows/vllm_ascend_test.yaml'
- 'vllm_ascend/**'
- 'csrc/**'
- 'cmake/**'
- 'tests/e2e/**'
- 'CMakeLists.txt'
- 'setup.py'
- 'requirements.txt'
- 'requirements-dev.txt'
- 'requirements-lint.txt'
- 'packages.txt'
ut_tracker:
- 'tests/ut/**'
ut:
needs: [lint, changes]
name: unit test
# only trigger unit test after lint passed and the change is e2e and ut related.
if: ${{ needs.lint.result == 'success' && (needs.changes.outputs.e2e_tracker == 'true' || needs.changes.outputs.ut_tracker == 'true') }}
runs-on: ubuntu-latest
container:
image: quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
strategy:
matrix:
python-version: ["3.10"]
vllm_version: [main, v0.9.2]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
- name: Install packages
run: |
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
- name: Run codespell check
run: |
CODESPELL_EXCLUDES=('--skip' 'tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**')
CODESPELL_IGNORE_WORDS=('-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn')
codespell --toml pyproject.toml "${CODESPELL_EXCLUDES[@]}" "${CODESPELL_IGNORE_WORDS[@]}"
- name: Analysing the code with ruff
run: |
echo "::add-matcher::.github/workflows/matchers/ruff.json"
ruff check --output-format github .
- name: Run isort
run: |
isort . --check-only
- name: Running yapf
run: |
python -m pip install --upgrade pip
pip install toml
pip install yapf==0.32.0
yapf --diff --recursive .
- name: Install dependencies
run: |
pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu
apt-get update -y
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: vllm-empty
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: vllm-empty
working-directory: ./vllm-empty
run: |
pip install -r requirements/build.txt --extra-index-url https://download.pytorch.org/whl/cpu
VLLM_TARGET_DEVICE=empty pip install .
VLLM_TARGET_DEVICE=empty python3 -m pip install . --extra-index https://download.pytorch.org/whl/cpu/
python3 -m pip uninstall -y triton
- name: Mypy Check
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install vllm-project/vllm-ascend
run: |
echo "::add-matcher::.github/workflows/matchers/mypy.json"
tools/mypy.sh 1 ${{ matrix.python-version }}
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
python3 -m pip install -r requirements-dev.txt --extra-index https://download.pytorch.org/whl/cpu/
python3 -m pip install -v . --extra-index https://download.pytorch.org/whl/cpu/
- name: Run unit test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
TORCH_DEVICE_BACKEND_AUTOLOAD: 0
run: |
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
pytest -sv --cov --cov-report=xml:unittests-coverage.xml tests/ut
- name: Upload coverage to Codecov
if: ${{ matrix.vllm_version == 'main' }}
uses: codecov/codecov-action@v5
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
with:
flags: unittests
name: vllm-ascend
verbose: true
e2e:
needs: [lint]
if: ${{ needs.lint.result == 'success' }}
needs: [lint, changes]
# only trigger e2e test after lint passed and the change is e2e related with pull request.
if: ${{ github.event_name == 'pull_request' && needs.lint.result == 'success' && needs.changes.outputs.e2e_tracker == 'true' }}
strategy:
max-parallel: 2
matrix:
os: [linux-arm64-npu-1, linux-arm64-npu-4]
vllm_version: [main, v0.9.0]
concurrency:
group: >
${{
matrix.os == 'linux-arm64-npu-4'
&& github.event.pull_request.number
&& format('pr-{0}-limit-npu-4', github.event.pull_request.number)
|| format('job-{0}-{1}-{2}', matrix.os, matrix.vllm_version, github.event.pull_request.number)
}}
cancel-in-progress: false
name: vLLM Ascend test
os: [linux-arm64-npu-1]
vllm_version: [main, v0.9.2]
name: singlecard e2e test
runs-on: ${{ matrix.os }}
container:
# TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
HF_ENDPOINT: https://hf-mirror.com
HF_TOKEN: ${{ secrets.HF_TOKEN }}
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
@ -132,11 +154,11 @@ jobs:
- name: Config mirrors
run: |
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
@ -159,64 +181,106 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test for V1 Engine
- name: Run e2e test
env:
VLLM_USE_V1: 1
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
run: |
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
VLLM_USE_MODELSCOPE=True pytest -sv tests/singlecard/test_offline_inference.py
pytest -sv tests/singlecard/test_scheduler.py
# guided decoding doesn't work, fix it later
# pytest -sv tests/singlecard/test_guided_decoding.py.py
# test_ascend_config.py should be ran separately because it will regenerate the global config many times.
pytest -sv tests/singlecard/test_ascend_config.py
pytest -sv tests/singlecard/test_camem.py
pytest -sv tests/singlecard/ \
--ignore=tests/singlecard/test_offline_inference.py \
--ignore=tests/singlecard/test_scheduler.py \
--ignore=tests/singlecard/test_guided_decoding.py \
--ignore=tests/singlecard/test_ascend_config.py \
--ignore=tests/singlecard/test_camem.py
else
pytest -sv tests/multicard/test_ilama_lora_tp2.py
# To avoid oom, we need to run the test in a single process.
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_topk
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py --ignore=tests/multicard/test_offline_inference_distributed.py
fi
pytest -sv tests/e2e/singlecard/test_offline_inference.py
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/singlecard/test_guided_decoding.py
pytest -sv tests/e2e/singlecard/test_camem.py
pytest -sv tests/e2e/singlecard/test_embedding.py
pytest -sv tests/e2e/singlecard/ \
--ignore=tests/e2e/singlecard/test_offline_inference.py \
--ignore=tests/e2e/singlecard/test_ilama_lora.py \
--ignore=tests/e2e/singlecard/test_guided_decoding.py \
--ignore=tests/e2e/singlecard/test_camem.py \
--ignore=tests/e2e/singlecard/test_embedding.py \
--ignore=tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py \
--ignore=tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
# ------------------------------------ v1 spec decode test ------------------------------------ #
VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py
# TODO: revert me when test_v1_spec_decode.py::test_ngram_correctness is fixed
VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
- name: Run vllm-project/vllm-ascend test on V0 engine
env:
VLLM_USE_V1: 0
e2e-4-cards:
needs: [e2e]
if: ${{ needs.e2e.result == 'success' }}
strategy:
max-parallel: 1
matrix:
os: [linux-arm64-npu-4]
vllm_version: [main, v0.9.2]
name: multicard e2e test
runs-on: ${{ matrix.os }}
container:
# TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
VLLM_USE_MODELSCOPE=True pytest -sv tests/singlecard/test_offline_inference.py
pytest -sv tests/singlecard/test_scheduler.py
# guided decoding doesn't work, fix it later
# pytest -sv tests/singlecard/test_guided_decoding.py.py
pytest -sv tests/singlecard/test_camem.py
# test_ascend_config.py should be ran separately because it will regenerate the global config many times.
pytest -sv tests/singlecard/test_ascend_config.py
pytest -sv tests/singlecard/test_prompt_embedding.py
pytest -sv tests/singlecard/ \
--ignore=tests/singlecard/test_offline_inference.py \
--ignore=tests/singlecard/test_scheduler.py \
--ignore=tests/singlecard/test_guided_decoding.py \
--ignore=tests/singlecard/test_camem.py \
--ignore=tests/singlecard/test_ascend_config.py \
--ignore=tests/singlecard/test_prompt_embedding.py
else
pytest -sv tests/multicard/test_ilama_lora_tp2.py
# Fixme: run VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py will raise error.
# To avoid oom, we need to run the test in a single process.
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_topk
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py --ignore=tests/multicard/test_offline_inference_distributed.py
fi
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
run: |
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py
# Fixme: run VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py will raise error.
# To avoid oom, we need to run the test in a single process.
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W8A8
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_dbo
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeekV3_dbo
pytest -sv tests/e2e/multicard/test_data_parallel.py
pytest -sv tests/e2e/multicard/ --ignore=tests/e2e/multicard/test_ilama_lora_tp2.py \
--ignore=tests/e2e/multicard/test_offline_inference_distributed.py \
--ignore=tests/e2e/multicard/test_data_parallel.py

View File

@ -43,16 +43,15 @@ jobs:
max-parallel: 2
matrix:
os: [linux-arm64-npu-1, linux-arm64-npu-4]
vllm_version: [main, v0.9.0]
vllm_version: [main, v0.9.2]
name: vLLM Ascend long term test
runs-on: ${{ matrix.os }}
container:
# TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
HF_ENDPOINT: https://hf-mirror.com
HF_TOKEN: ${{ secrets.HF_TOKEN }}
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
@ -61,11 +60,11 @@ jobs:
- name: Config mirrors
run: |
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
@ -88,6 +87,8 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
@ -95,12 +96,8 @@ jobs:
- name: Run vllm-project/vllm-ascend long term test
run: |
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
# spec decode test
VLLM_USE_MODELSCOPE=True pytest -sv tests/long_term/spec_decode/e2e/test_v1_mtp_correctness.py
VLLM_USE_MODELSCOPE=True pytest -sv tests/long_term/spec_decode/e2e/test_v1_spec_decode.py
VLLM_USE_MODELSCOPE=True pytest -sv tests/long_term/spec_decode/e2e/test_mtp_correctness.py # it needs a clean process
pytest -sv tests/long_term/spec_decode --ignore=tests/long_term/spec_decode/e2e/test_mtp_correctness.py --ignore=tests/long_term/spec_decode/e2e/test_v1_spec_decode.py --ignore=tests/long_term/spec_decode/e2e/test_v1_mtp_correctness.py
pytest -sv tests/long_term/test_accuracy.py
pytest -sv tests/e2e/long_term/accuracy/accuracy_singlecard.py
else
VLLM_USE_MODELSCOPE=True pytest -sv tests/long_term/test_deepseek_v2_lite_tp2_accuracy.py
# accuracy test multi card
pytest -sv tests/e2e/long_term/accuracy/accuracy_multicard.py
fi

View File

@ -41,7 +41,11 @@ jobs:
if: ${{ contains(github.event.pull_request.labels.*.name, 'pd-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' }}
strategy:
matrix:
vllm_verison: [main, v0.9.0]
vllm_verison: [
# revert me when V1 disaggregation prefill is merged in main
# main,
v0.9.1
]
name: vLLM Ascend prefilling decoding disaggregation test
runs-on: linux-arm64-npu-static-8
@ -60,8 +64,7 @@ jobs:
--device /dev/devmm_svm
--device /dev/hisi_hdc
env:
HF_ENDPOINT: https://hf-mirror.com
HF_TOKEN: ${{ secrets.HF_TOKEN }}
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
@ -70,6 +73,7 @@ jobs:
- name: Config mirrors
run: |
# keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
apt-get update -y
@ -97,6 +101,8 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .

6
.gitignore vendored
View File

@ -196,3 +196,9 @@ kernel_meta/
# version file generated by setuptools-scm
/vllm_ascend/_version.py
# build info file generated by setup.py
/vllm_ascend/_build_info.py
/vllm_ascend/include/
# generated by CANN
fusion_result.json

141
.pre-commit-config.yaml Normal file
View File

@ -0,0 +1,141 @@
default_install_hook_types:
- pre-commit
- commit-msg
default_stages:
- pre-commit # Run locally
- manual # Run in CI
exclude: 'examples/.*' # Exclude examples from all hooks by default
repos:
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
hooks:
- id: codespell
args: [
--toml, pyproject.toml,
'--skip', 'tests/e2e/multicard/test_torchair_graph_mode.py,tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**,.github/**,typos.toml',
'-L', 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn'
]
additional_dependencies:
- tomli
- repo: https://github.com/google/yapf
rev: v0.43.0
hooks:
- id: yapf
args: [--in-place, --verbose]
# Keep the same list from yapfignore here to avoid yapf failing without any inputs
exclude: '(.github|benchmarks|examples|docs)/.*'
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.11.7
hooks:
- id: ruff
args: [--output-format, github, --fix]
- id: ruff-format
files: ^(benchmarks|examples)/.*
- repo: https://github.com/crate-ci/typos
rev: v1.32.0
hooks:
- id: typos
- repo: https://github.com/PyCQA/isort
rev: 6.0.1
hooks:
- id: isort
# - repo: https://github.com/pre-commit/mirrors-clang-format
# rev: v20.1.3
# hooks:
# - id: clang-format
# files: ^csrc/.*\.(cpp|hpp|cc|hh|cxx|hxx)$
# types_or: [c++]
# args: [--style=google, --verbose]
# - repo: https://github.com/jackdewinter/pymarkdown
# rev: v0.9.29
# hooks:
# - id: pymarkdown
# args: [fix]
- repo: https://github.com/rhysd/actionlint
rev: v1.7.7
hooks:
- id: actionlint
- repo: local
hooks:
# For local development, you can run mypy using tools/mypy.sh script if needed.
# - id: mypy-local
# name: Run mypy for local Python installation
# entry: tools/mypy.sh 0 "local"
# language: system
# types: [python]
# stages: [pre-commit] # Don't run in CI
- id: mypy-3.9 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.9
entry: tools/mypy.sh 1 "3.9"
# Use system python because vllm installation is required
language: system
types: [python]
stages: [manual] # Only run in CI
- id: mypy-3.10 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.10
entry: tools/mypy.sh 1 "3.10"
# Use system python because vllm installation is required
language: system
types: [python]
stages: [manual] # Only run in CI
- id: mypy-3.11 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.11
entry: tools/mypy.sh 1 "3.11"
# Use system python because vllm installation is required
language: system
types: [python]
stages: [manual] # Only run in CI
- id: mypy-3.12 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.12
entry: tools/mypy.sh 1 "3.12"
# Use system python because vllm installation is required
language: system
types: [python]
stages: [manual] # Only run in CI
# FIXME: enable shellcheck
# - id: shellcheck
# name: Lint shell scripts
# entry: tools/shellcheck.sh
# language: script
# types: [shell]
- id: png-lint
name: Lint PNG exports from excalidraw
entry: tools/png-lint.sh
language: script
types: [png]
- id: signoff-commit
name: Sign-off Commit
entry: bash
args:
- -c
- |
if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" "$(git rev-parse --git-path COMMIT_EDITMSG)"; then
printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> "$(git rev-parse --git-path COMMIT_EDITMSG)"
fi
language: system
verbose: true
stages: [commit-msg]
- id: check-filenames
name: Check for spaces in all filenames
entry: bash
args:
- -c
- 'git ls-files | grep " " && echo "Filenames should not contain spaces!" && exit 1 || exit 0'
language: system
always_run: true
pass_filenames: false
- id: enforce-import-regex-instead-of-re
name: Enforce import regex as re
entry: python tools/enforce_regex_import.py
language: python
types: [python]
pass_filenames: false
additional_dependencies: [regex]
# Keep `suggestion` last
- id: suggestion
name: Suggestion
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false
# Insert new entries above the `suggestion` entry

View File

@ -96,5 +96,3 @@ target_link_libraries(
target_link_options(vllm_ascend_C PRIVATE "-Wl,-rpath,$ORIGIN:$ORIGIN/lib")
install(TARGETS vllm_ascend_C vllm_ascend_kernels DESTINATION ${VLLM_ASCEND_INSTALL_PATH})

3
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,3 @@
# Contributing to vLLM Ascend
You may find information about contributing to vLLM Ascend on [Developer Guide - Contributing](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html), including step-by-step guide to help you setup development environment, contribute first PR and test locally.

View File

@ -37,7 +37,7 @@ RUN pip config set global.index-url ${PIP_INDEX_URL}
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.0
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
@ -46,7 +46,8 @@ RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm
# Install vllm-ascend
# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \

61
Dockerfile.310p Normal file
View File

@ -0,0 +1,61 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN apt-get update -y && \
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
rm -rf /var/cache/apt/* && \
rm -rf /var/lib/apt/lists/*
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
RUN pip config set global.index-url ${PIP_INDEX_URL}
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
export SOC_VERSION=ASCEND310P3 && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

58
Dockerfile.310p.openEuler Normal file
View File

@ -0,0 +1,58 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-310p-openeuler22.03-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
RUN pip config set global.index-url ${PIP_INDEX_URL}
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
export SOC_VERSION=ASCEND310P3 && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

60
Dockerfile.a3 Normal file
View File

@ -0,0 +1,60 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-a3-ubuntu22.04-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN apt-get update -y && \
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
rm -rf /var/cache/apt/* && \
rm -rf /var/lib/apt/lists/*
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
RUN pip config set global.index-url ${PIP_INDEX_URL}
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

57
Dockerfile.a3.openEuler Normal file
View File

@ -0,0 +1,57 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-a3-openeuler22.03-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
RUN pip config set global.index-url ${PIP_INDEX_URL}
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

View File

@ -34,7 +34,7 @@ COPY . /vllm-workspace/vllm-ascend/
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.0
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
@ -43,7 +43,8 @@ RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ -
python3 -m pip cache purge
# Install vllm-ascend
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \

View File

@ -19,6 +19,10 @@ vLLM Ascend Plugin
---
*Latest News* 🔥
- [2025/06] [User stories](https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html) page is now live! It kicks off with LLaMA-Factory/verl//TRL/GPUStack to demonstrate how vLLM Ascend assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
- [2025/06] [Contributors](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html) page is now live! All contributions deserve to be recorded, thanks for all contributors.
- [2025/05] We've released first official version [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)! We collaborated with the vLLM community to publish a blog post sharing our practice: [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html).
- [2025/03] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/VtxO9WXa5fC-mKqlxNUJUQ) with vLLM team! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
- [2025/02] vLLM community officially created [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) repo for running vLLM seamlessly on the Ascend NPU.
- [2024/12] We are working with the vLLM community to support [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
@ -38,15 +42,20 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
- Software:
* Python >= 3.9, < 3.12
* CANN >= 8.1.RC1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
* vLLM (the same version as vllm-ascend)
## Getting Started
Please refer to [QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details.
Please use the following recommended versions to get started quickly:
| Version | Release type | Doc |
|------------|--------------|--------------------------------------|
|v0.9.2rc1|Latest release candidate|[QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details|
|v0.7.3.post1|Latest stable version|[QuickStart](https://vllm-ascend.readthedocs.io/en/stable/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/stable/installation.html) for more details|
## Contributing
See [CONTRIBUTING](https://vllm-ascend.readthedocs.io/en/main/developer_guide/contributing.html) for more details, which is a step-by-step guide to help you set up development environment, build and test.
See [CONTRIBUTING](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html) for more details, which is a step-by-step guide to help you set up development environment, build and test.
We welcome and value any contributions and collaborations:
- Please let us know if you encounter a bug by [filing an issue](https://github.com/vllm-project/vllm-ascend/issues)
@ -65,9 +74,10 @@ Below is maintained branches:
|------------|--------------|--------------------------------------|
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.9.x branch |
| v0.7.1-dev | Unmaintained | Only doc fixed is allowed |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version, only bug fix is allowed and no new release tag any more. |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
Please refer to [Versioning policy](https://vllm-ascend.readthedocs.io/en/main/developer_guide/versioning_policy.html) for more details.
Please refer to [Versioning policy](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html) for more details.
## Weekly Meeting

View File

@ -20,6 +20,9 @@ vLLM Ascend Plugin
---
*最新消息* 🔥
- [2025/06] [用户案例](https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html)现已上线展示了LLaMA-Factory/verl/TRL/GPUStack等用户案例展示了vLLM Ascend如何帮助昇腾用户在模型微调、评估、强化学习 (RL) 以及部署等场景中提升体验。
- [2025/06] [贡献者](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html)页面现已上线!所有的贡献都值得被记录,感谢所有的贡献者。
- [2025/05] 我们发布了首个正式版本 [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)!我们与 vLLM 社区合作发布了一篇博客文章,分享了我们的实践:[Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html)。
- [2025/03] 我们和vLLM团队举办了[vLLM Beijing Meetup](https://mp.weixin.qq.com/s/CGDuMoB301Uytnrkc2oyjg)! 你可以在[这里](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF)找到演讲材料.
- [2025/02] vLLM社区正式创建了[vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)仓库让vLLM可以无缝运行在Ascend NPU。
- [2024/12] 我们正在与 vLLM 社区合作,以支持 [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
@ -39,15 +42,20 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
- 软件:
* Python >= 3.9, < 3.12
* CANN >= 8.1.RC1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
* vLLM (与vllm-ascend版本一致)
## 开始使用
请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多.
推荐您使用以下版本快速开始使用:
| Version | Release type | Doc |
|------------|--------------|--------------------------------------|
|v0.9.2rc1| 最新RC版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多|
|v0.7.3.post1| 最新正式/稳定版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/stable/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/stable/installation.html)了解更多|
## 贡献
请参考 [CONTRIBUTING]((https://vllm-ascend.readthedocs.io/en/main/developer_guide/contributing.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。
请参考 [CONTRIBUTING]((https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。
我们欢迎并重视任何形式的贡献与合作:
- 请通过[Issue](https://github.com/vllm-project/vllm-ascend/issues)来告知我们您遇到的任何Bug。
@ -65,9 +73,10 @@ vllm-ascend有主干分支和开发分支。
|------------|------------|---------------------|
| main | Maintained | 基于vLLM main分支CI看护 |
| v0.7.1-dev | Unmaintained | 只允许文档修复 |
| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护 |
| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护, 只允许Bug修复不会再发布新版本 |
| v0.9.1-dev | Maintained | 基于vLLM v0.9.1版本CI看护 |
请参阅[版本策略](https://vllm-ascend.readthedocs.io/en/main/developer_guide/versioning_policy.html)了解更多详细信息。
请参阅[版本策略](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html)了解更多详细信息。
## 社区例会

View File

@ -1,5 +1,5 @@
# Introduction
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
# Overview
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
@ -7,21 +7,21 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
- Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput.
- Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
**Benchmarking Duration**: about 800 senond for single model.
@ -38,20 +38,129 @@ Before running the benchmarks, ensure the following:
pip install -r benchmarks/requirements-bench.txt
```
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. feel free to add your own models and parameters in the JSON to run your customized benchmarks.
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
```shell
[
{
"test_name": "serving_qwen2_5vl_7B_tp1",
"qps_list": [
1,
4,
16,
"inf"
],
"server_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"trust_remote_code": "",
"max_model_len": 16384
},
"client_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"backend": "openai-chat",
"dataset_name": "hf",
"hf_split": "train",
"endpoint": "/v1/chat/completions",
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
"num_prompts": 200
}
}
]
```
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- disable_log_stats: disables logging of performance statistics.
- disable_log_requests: disables logging of individual requests.
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Client Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
- Dataset Source: Hugging Face (hf)
- Dataset Split: train
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
- Number of Prompts: 200 (the total number of prompts used during the test)
## Run benchmarks
### Use benchmark script
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
```
bash benchmarks/scripts/run-performance-benchmarks.sh
```
Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
```
|-- latency_llama8B_tp1.json
|-- serving_llama8B_tp1_sharegpt_qps_1.json
|-- serving_llama8B_tp1_sharegpt_qps_16.json
|-- serving_llama8B_tp1_sharegpt_qps_4.json
|-- serving_llama8B_tp1_sharegpt_qps_inf.json
|-- throughput_llama8B_tp1.json
.
|-- serving_qwen2_5_7B_tp1_qps_1.json
|-- serving_qwen2_5_7B_tp1_qps_16.json
|-- serving_qwen2_5_7B_tp1_qps_4.json
|-- serving_qwen2_5_7B_tp1_qps_inf.json
|-- latency_qwen2_5_7B_tp1.json
|-- throughput_qwen2_5_7B_tp1.json
```
These files contain detailed benchmarking results for further analysis.
### Use benchmark cli
For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
#### Online serving
1. Launch the server:
```shell
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
```
2. Running performance tests using cli
```shell
vllm bench serve --model Qwen2.5-VL-7B-Instruct\
--endpoint-type "openai-chat" --dataset-name hf \
--hf-split train --endpoint "/v1/chat/completions" \
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
--num-prompts 200 \
--request-rate 16
```
#### Offline
- **Throughput**
```shell
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 200 --backend vllm
```
- **Latency**
```shell
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
--load-format dummy --num-iters-warmup 5 --num-iters 15
```

View File

@ -0,0 +1,158 @@
from typing import Tuple
import numpy as np
import pytest
import torch
import torch_npu # noqa: F401
import vllm # noqa: F401
import vllm_ascend.platform # noqa: F401
def benchmark_npu(fn, num_iterations=100, num_warmup_iterations=50):
"""
Benchmark function for NPU operations
Args:
fn: Function to benchmark
num_iterations: Number of timing iterations
num_warmup_iterations: Number of warmup iterations
Returns:
float: Minimum elapsed time in seconds
"""
start = torch.npu.Event(enable_timing=True)
end = torch.npu.Event(enable_timing=True)
times = np.zeros(num_iterations + num_warmup_iterations)
# Run iterations
for i in range(num_warmup_iterations + num_iterations):
with torch.no_grad():
start.record()
fn() # Execute the function
end.record()
torch.npu.synchronize()
times[i] = start.elapsed_time(end)
# Remove warmup iterations and convert to seconds
times = times[num_warmup_iterations:]
elapsed_time = np.amin(times) / 1000
return elapsed_time
def get_masked_input_and_mask_ref(
input_: torch.Tensor,
org_vocab_start_index: int,
org_vocab_end_index: int,
num_org_vocab_padding: int,
added_vocab_start_index: int,
added_vocab_end_index: int,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Reference implementation for verification"""
org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
added_vocab_mask = (input_ >= added_vocab_start_index) & (
input_ < added_vocab_end_index
)
added_offset = (
added_vocab_start_index
- (org_vocab_end_index - org_vocab_start_index)
- num_org_vocab_padding
)
valid_offset = (org_vocab_start_index * org_vocab_mask) + (
added_offset * added_vocab_mask
)
vocab_mask = org_vocab_mask | added_vocab_mask
masked_input = vocab_mask * (input_ - valid_offset)
return masked_input, ~vocab_mask
DTYPES = [torch.int32]
SHAPES = [(3, 4, 5)]
DEVICES = [f"npu:{0}"]
SEEDS = [0]
@pytest.mark.parametrize("shape", SHAPES)
@pytest.mark.parametrize("dtype", DTYPES)
@pytest.mark.parametrize("device", DEVICES)
@pytest.mark.parametrize("seed", SEEDS)
@torch.inference_mode()
def test_get_masked_input_and_mask(
shape: Tuple[int, ...],
dtype: torch.dtype,
device: str,
seed: int,
) -> None:
# Set random seed and device
torch.manual_seed(seed)
torch.set_default_device(device)
# Generate random input tensor
input_tensor = torch.randint(0, 1000, shape, dtype=dtype)
# Test parameters
test_case = {
"org_start": 100,
"org_end": 200,
"padding": 0,
"added_start": 300,
"added_end": 400,
}
# Define reference function
def ref_fn():
return get_masked_input_and_mask_ref(
input_tensor,
test_case["org_start"],
test_case["org_end"],
test_case["padding"],
test_case["added_start"],
test_case["added_end"],
)
# Define custom function
def custom_fn():
return torch.ops._C.get_masked_input_and_mask(
input_tensor,
test_case["org_start"],
test_case["org_end"],
test_case["padding"],
test_case["added_start"],
test_case["added_end"],
)
# Get results for correctness testing
ref_masked_input, ref_mask = ref_fn()
custom_masked_input, custom_mask = custom_fn()
# Benchmark both implementations
ref_time = benchmark_npu(ref_fn)
custom_time = benchmark_npu(custom_fn)
# Print performance results
print("\nPerformance Results:")
print(f"Reference implementation: {ref_time * 1000:.3f} ms")
print(f"Custom implementation: {custom_time * 1000:.3f} ms")
print(f"Speedup: {ref_time / custom_time:.2f}x")
# Compare results for correctness
ref_masked_input = ref_masked_input.to(dtype)
print("\nResults comparison:")
print("custom_masked_input:", custom_masked_input)
print("ref_masked_input:", ref_masked_input)
print("custom_mask:", custom_mask)
print("ref_mask:", ref_mask)
torch.testing.assert_close(
custom_masked_input,
ref_masked_input,
rtol=1e-5,
atol=1e-5,
msg=f"Masked input mismatch for case: {test_case}",
)
torch.testing.assert_close(
custom_mask,
ref_mask,
rtol=1e-5,
atol=1e-5,
msg=f"Mask mismatch for case: {test_case}",
)

View File

@ -1,5 +1,4 @@
pandas
datasets
modelscope
libcst
tabulate

View File

@ -49,36 +49,43 @@ def read_markdown(file):
def results_to_json(latency, throughput, serving):
return json.dumps({
'latency': latency.to_dict(),
'throughput': throughput.to_dict(),
'serving': serving.to_dict()
})
return json.dumps(
{
"latency": latency.to_dict(),
"throughput": throughput.to_dict(),
"serving": serving.to_dict(),
}
)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Process the results of the benchmark tests.")
description="Process the results of the benchmark tests."
)
parser.add_argument(
"--results_folder",
type=str,
default="../results/",
help="The folder where the benchmark results are stored.")
help="The folder where the benchmark results are stored.",
)
parser.add_argument(
"--output_folder",
type=str,
default="../results/",
help="The folder where the benchmark results are stored.")
parser.add_argument("--markdown_template",
type=str,
default="./perf_result_template.md",
help="The template file for the markdown report.")
parser.add_argument("--tag",
default="main",
help="Tag to be used for release message.")
parser.add_argument("--commit_id",
default="",
help="Commit ID to be used for release message.")
help="The folder where the benchmark results are stored.",
)
parser.add_argument(
"--markdown_template",
type=str,
default="./perf_result_template.md",
help="The template file for the markdown report.",
)
parser.add_argument(
"--tag", default="main", help="Tag to be used for release message."
)
parser.add_argument(
"--commit_id", default="", help="Commit ID to be used for release message."
)
args = parser.parse_args()
results_folder = (CUR_PATH / args.results_folder).resolve()
@ -87,7 +94,6 @@ if __name__ == "__main__":
# collect results
for test_file in results_folder.glob("*.json"):
with open(test_file) as f:
raw_result = json.loads(f.read())
@ -111,7 +117,8 @@ if __name__ == "__main__":
for perc in [10, 25, 50, 75, 90, 99]:
# Multiply 1000 to convert the time unit from s to ms
raw_result.update(
{f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]})
{f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
)
raw_result["avg_latency"] = raw_result["avg_latency"] * 1000
# add the result to raw_result
@ -129,55 +136,53 @@ if __name__ == "__main__":
continue
print(f"Skipping {test_file}")
serving_results.sort(key=lambda x: (len(x['test_name']), x['test_name']))
serving_results.sort(key=lambda x: (len(x["test_name"]), x["test_name"]))
latency_results = pd.DataFrame.from_dict(latency_results)
serving_results = pd.DataFrame.from_dict(serving_results)
throughput_results = pd.DataFrame.from_dict(throughput_results)
raw_results_json = results_to_json(latency_results, throughput_results,
serving_results)
raw_results_json = results_to_json(
latency_results, throughput_results, serving_results
)
# remapping the key, for visualization purpose
if not latency_results.empty:
latency_results = latency_results[list(
latency_column_mapping.keys())].rename(
columns=latency_column_mapping)
latency_results = latency_results[list(latency_column_mapping.keys())].rename(
columns=latency_column_mapping
)
if not serving_results.empty:
serving_results = serving_results[list(
serving_column_mapping.keys())].rename(
columns=serving_column_mapping)
serving_results = serving_results[list(serving_column_mapping.keys())].rename(
columns=serving_column_mapping
)
if not throughput_results.empty:
throughput_results = throughput_results[list(
throughput_results_column_mapping.keys())].rename(
columns=throughput_results_column_mapping)
throughput_results = throughput_results[
list(throughput_results_column_mapping.keys())
].rename(columns=throughput_results_column_mapping)
processed_results_json = results_to_json(latency_results,
throughput_results,
serving_results)
processed_results_json = results_to_json(
latency_results, throughput_results, serving_results
)
# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
tablefmt='pipe',
showindex=False)
serving_md_table = tabulate(serving_results,
headers='keys',
tablefmt='pipe',
showindex=False)
throughput_md_table = tabulate(throughput_results,
headers='keys',
tablefmt='pipe',
showindex=False)
latency_md_table = tabulate(
latency_results, headers="keys", tablefmt="pipe", showindex=False
)
serving_md_table = tabulate(
serving_results, headers="keys", tablefmt="pipe", showindex=False
)
throughput_md_table = tabulate(
throughput_results, headers="keys", tablefmt="pipe", showindex=False
)
# document the result
print(output_folder)
with open(output_folder / "benchmark_results.md", "w") as f:
results = read_markdown(markdown_template)
results = results.format(
latency_tests_markdown_table=latency_md_table,
throughput_tests_markdown_table=throughput_md_table,
serving_tests_markdown_table=serving_md_table,
benchmarking_results_in_json_string=processed_results_json)
benchmarking_results_in_json_string=processed_results_json,
)
f.write(results)

View File

@ -1,68 +0,0 @@
from argparse import ArgumentParser
import libcst as cst
import libcst.matchers as m
# Patch the benchmark_dataset.py file to set streaming=False in load_dataset calls
# TDOO(Potabk): Remove this patch when the issue is fixed in the upstream
class StreamingFalseTransformer(cst.CSTTransformer):
def __init__(self):
self.in_target_class = False
self.in_target_func = False
def visit_ClassDef(self, node):
if node.name.value == "HuggingFaceDataset":
self.in_target_class = True
def leave_ClassDef(self, original_node, updated_node):
self.in_target_class = False
return updated_node
def visit_FunctionDef(self, node):
if self.in_target_class and node.name.value == "load_data":
self.in_target_func = True
def leave_FunctionDef(self, original_node, updated_node):
self.in_target_func = False
return updated_node
def leave_Call(self, original_node, updated_node):
if self.in_target_class and self.in_target_func:
if m.matches(updated_node.func, m.Name("load_dataset")):
new_args = []
for arg in updated_node.args:
if arg.keyword and arg.keyword.value == "streaming":
new_arg = arg.with_changes(value=cst.Name("False"))
new_args.append(new_arg)
else:
new_args.append(arg)
return updated_node.with_changes(args=new_args)
return updated_node
def patch_file(path):
with open(path, "r", encoding="utf-8") as f:
source = f.read()
module = cst.parse_module(source)
modified = module.visit(StreamingFalseTransformer())
with open(path, "w", encoding="utf-8") as f:
f.write(modified.code)
print(f"Patched: {path}")
if __name__ == '__main__':
parser = ArgumentParser(
description=
"Patch benchmark_dataset.py to set streaming=False in load_dataset calls"
)
parser.add_argument("--path",
type=str,
help="Path to the benchmark_dataset.py file")
args = parser.parse_args()
patch_file(args.path)

View File

@ -1,5 +1,5 @@
#!/bin/bash
set -e
check_npus() {
# shellcheck disable=SC2155
@ -54,13 +54,20 @@ json2args() {
}
wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
timeout 1200 bash -c '
until curl -s -X GET localhost:8000/health; do
echo "Waiting for vllm server to start..."
sleep 1
done' && return 0 || return 1
local waited=0
local timeout_sec=1200
while (( waited < timeout_sec )); do
if curl -s -X GET localhost:8000/health > /dev/null; then
return 0
fi
echo "Waiting for vllm server to start..."
sleep 1
((waited++))
done
echo "Timeout waiting for server"
return 1
}
get_cur_npu_id() {
@ -114,7 +121,7 @@ run_latency_tests() {
latency_params=$(echo "$params" | jq -r '.parameters')
latency_args=$(json2args "$latency_params")
latency_command="python3 vllm_benchmarks/benchmark_latency.py \
latency_command="vllm bench latency \
--output-json $RESULTS_FOLDER/${test_name}.json \
$latency_args"
@ -157,7 +164,7 @@ run_throughput_tests() {
throughput_params=$(echo "$params" | jq -r '.parameters')
throughput_args=$(json2args "$throughput_params")
throughput_command="python3 vllm_benchmarks/benchmark_throughput.py \
throughput_command="vllm bench throughput \
--output-json $RESULTS_FOLDER/${test_name}.json \
$throughput_args"
@ -243,7 +250,7 @@ run_serving_tests() {
new_test_name=$test_name"_qps_"$qps
client_command="python3 vllm_benchmarks/benchmark_serving.py \
client_command="vllm bench serve \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
@ -271,17 +278,10 @@ cleanup_on_error() {
rm -rf $RESULTS_FOLDER
}
get_benchmarks_scripts() {
git clone -b main --depth=1 https://github.com/vllm-project/vllm.git && \
mv vllm/benchmarks vllm_benchmarks
rm -rf ./vllm
}
main() {
START_TIME=$(date +%s)
check_npus
# dependencies
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)
@ -294,12 +294,10 @@ main() {
export VLLM_LOG_LEVEL="WARNING"
# set env
export HF_ENDPOINT="https://hf-mirror.com"
export VLLM_USE_MODELSCOPE=True
# prepare for benchmarking
cd benchmarks || exit 1
get_benchmarks_scripts
python3 scripts/patch_benchmark_dataset.py --path vllm_benchmarks/benchmark_dataset.py
trap cleanup EXIT
QUICK_BENCHMARK_ROOT=./

View File

@ -21,108 +21,159 @@ import gc
import json
import multiprocessing
import sys
import time
from multiprocessing import Queue
import lm_eval
import torch
UNIMODAL_MODEL_NAME = ["Qwen/Qwen2.5-7B-Instruct", "Qwen/Qwen3-8B-Base"]
# URLs for version information in Markdown report
VLLM_URL = "https://github.com/vllm-project/vllm/commit/"
VLLM_ASCEND_URL = "https://github.com/vllm-project/vllm-ascend/commit/"
# Model and task configurations
UNIMODAL_MODEL_NAME = ["Qwen/Qwen3-8B-Base", "Qwen/Qwen3-30B-A3B"]
UNIMODAL_TASK = ["ceval-valid", "gsm8k"]
MULTIMODAL_NAME = ["Qwen/Qwen2.5-VL-7B-Instruct"]
MULTIMODAL_TASK = ["mmmu_val"]
batch_size_dict = {"ceval-valid": 1, "mmlu": 1, "gsm8k": "auto", "mmmu_val": 1}
# Batch size configurations per task
BATCH_SIZE = {"ceval-valid": 1, "mmlu": 1, "gsm8k": "auto", "mmmu_val": 1}
MODEL_RUN_INFO = {
"Qwen/Qwen2.5-7B-Instruct":
("export MODEL_ARGS='pretrained={model}, max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
"lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
),
"Qwen/Qwen3-8B-Base":
("export MODEL_ARGS='pretrained={model}, max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
"lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
),
"Qwen/Qwen2.5-VL-7B-Instruct":
("export MODEL_ARGS='pretrained={model}, max_model_len=8192,dtype=auto,tensor_parallel_size=4,max_images=2'\n"
"lm_eval --model vllm-vlm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --batch_size 1"),
# Model type mapping (vllm for text, vllm-vlm for vision-language)
MODEL_TYPE = {
"Qwen/Qwen3-8B-Base": "vllm",
"Qwen/Qwen3-30B-A3B": "vllm",
"Qwen/Qwen2.5-VL-7B-Instruct": "vllm-vlm",
}
# Command templates for running evaluations
MODEL_RUN_INFO = {
"Qwen/Qwen3-30B-A3B": (
"export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True'\n"
"lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
),
"Qwen/Qwen3-8B-Base": (
"export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
"lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
),
"Qwen/Qwen2.5-VL-7B-Instruct": (
"export MODEL_ARGS='pretrained={model},max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2'\n"
"lm_eval --model vllm-vlm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --batch_size 1"
),
}
def run_accuracy_unimodal(queue, model, dataset):
# Evaluation metric filters per task
FILTER = {
"gsm8k": "exact_match,flexible-extract",
"ceval-valid": "acc,none",
"mmmu_val": "acc,none",
}
# Expected accuracy values for models
EXPECTED_VALUE = {
"Qwen/Qwen3-30B-A3B": {"ceval-valid": 0.83, "gsm8k": 0.85},
"Qwen/Qwen3-8B-Base": {"ceval-valid": 0.82, "gsm8k": 0.83},
"Qwen/Qwen2.5-VL-7B-Instruct": {"mmmu_val": 0.51},
}
PARALLEL_MODE = {
"Qwen/Qwen3-8B-Base": "TP",
"Qwen/Qwen2.5-VL-7B-Instruct": "TP",
"Qwen/Qwen3-30B-A3B": "EP",
}
# Execution backend configuration
EXECUTION_MODE = {
"Qwen/Qwen3-8B-Base": "ACLGraph",
"Qwen/Qwen2.5-VL-7B-Instruct": "ACLGraph",
"Qwen/Qwen3-30B-A3B": "ACLGraph",
}
# Model arguments for evaluation
MODEL_ARGS = {
"Qwen/Qwen3-8B-Base": "pretrained=Qwen/Qwen3-8B-Base,max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6",
"Qwen/Qwen2.5-VL-7B-Instruct": "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2",
"Qwen/Qwen3-30B-A3B": "pretrained=Qwen/Qwen3-30B-A3B,max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True",
}
# Whether to apply chat template formatting
APPLY_CHAT_TEMPLATE = {
"Qwen/Qwen3-8B-Base": True,
"Qwen/Qwen2.5-VL-7B-Instruct": True,
"Qwen/Qwen3-30B-A3B": False,
}
# Few-shot examples handling as multi-turn dialogues.
FEWSHOT_AS_MULTITURN = {
"Qwen/Qwen3-8B-Base": True,
"Qwen/Qwen2.5-VL-7B-Instruct": True,
"Qwen/Qwen3-30B-A3B": False,
}
# Relative tolerance for accuracy checks
RTOL = 0.03
ACCURACY_FLAG = {}
def run_accuracy_test(queue, model, dataset):
"""Run accuracy evaluation for a model on a dataset in separate process"""
try:
model_args = f"pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6"
results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=dataset,
apply_chat_template=True,
fewshot_as_multiturn=True,
batch_size=batch_size_dict[dataset],
num_fewshot=5,
)
print(f"Success: {model} on {dataset}")
eval_params = {
"model": MODEL_TYPE[model],
"model_args": MODEL_ARGS[model],
"tasks": dataset,
"apply_chat_template": APPLY_CHAT_TEMPLATE[model],
"fewshot_as_multiturn": FEWSHOT_AS_MULTITURN[model],
"batch_size": BATCH_SIZE[dataset],
}
if MODEL_TYPE[model] == "vllm":
eval_params["num_fewshot"] = 5
results = lm_eval.simple_evaluate(**eval_params)
print(f"Success: {model} on {dataset} ")
measured_value = results["results"]
queue.put(measured_value)
except Exception as e:
print(f"Error in run_accuracy_unimodal: {e}")
print(f"Error in run_accuracy_test: {e}")
queue.put(e)
sys.exit(1)
finally:
torch.npu.empty_cache()
if "results" in locals():
del results
gc.collect()
def run_accuracy_multimodal(queue, model, dataset):
try:
model_args = f"pretrained={model},max_model_len=8192,dtype=auto,tensor_parallel_size=4,max_images=2"
results = lm_eval.simple_evaluate(
model="vllm-vlm",
model_args=model_args,
tasks=dataset,
apply_chat_template=True,
fewshot_as_multiturn=True,
batch_size=batch_size_dict[dataset],
)
print(f"Success: {model} on {dataset}")
measured_value = results["results"]
queue.put(measured_value)
except Exception as e:
print(f"Error in run_accuracy_multimodal: {e}")
queue.put(e)
sys.exit(1)
finally:
torch.npu.empty_cache()
gc.collect()
time.sleep(5)
def generate_md(model_name, tasks_list, args, datasets):
run_cmd = MODEL_RUN_INFO[model_name].format(model=model_name,
datasets=datasets)
"""Generate Markdown report with evaluation results"""
# Format the run command
run_cmd = MODEL_RUN_INFO[model_name].format(model=model_name, datasets=datasets)
model = model_name.split("/")[1]
preamble = f"""# 🎯 {model} Accuracy Test
<div>
<strong>vLLM version:</strong> vLLM: {args.vllm_version}, vLLM Ascend: {args.vllm_ascend_version} <br>
</div>
<div>
<strong>Software Environment:</strong> CANN: {args.cann_version}, PyTorch: {args.torch_version}, torch-npu: {args.torch_npu_version} <br>
</div>
<div>
<strong>Hardware Environment</strong>: Atlas A2 Series <br>
</div>
<div>
<strong>Datasets</strong>: {datasets} <br>
</div>
<div>
<strong>Command</strong>:
```bash
{run_cmd}
```
</div>
<div>&nbsp;</div>
# Version information section
version_info = (
f"**vLLM Version**: vLLM: {args.vllm_version} "
f"([{args.vllm_commit}]({VLLM_URL + args.vllm_commit})), "
f"vLLM Ascend: {args.vllm_ascend_version} "
f"([{args.vllm_ascend_commit}]({VLLM_ASCEND_URL + args.vllm_ascend_commit})) "
)
# Report header with system info
preamble = f"""# {model}
{version_info}
**Software Environment**: CANN: {args.cann_version}, PyTorch: {args.torch_version}, torch-npu: {args.torch_npu_version}
**Hardware Environment**: Atlas A2 Series
**Datasets**: {datasets}
**Parallel Mode**: {PARALLEL_MODE[model_name]}
**Execution Mode**: {EXECUTION_MODE[model_name]}
**Command**:
```bash
{run_cmd}
```
"""
header = (
@ -131,6 +182,7 @@ def generate_md(model_name, tasks_list, args, datasets):
)
rows = []
rows_sub = []
# Process results for each task
for task_dict in tasks_list:
for key, stats in task_dict.items():
alias = stats.get("alias", key)
@ -153,25 +205,48 @@ def generate_md(model_name, tasks_list, args, datasets):
n_shot = "5"
else:
n_shot = "0"
row = (f"| {task_name:<37} "
f"| {flt:<6} "
f"| {n_shot:6} "
f"| {metric:<6} "
f"| ↑ {value:>5.4f} "
f"| ± {stderr:>5.4f} |")
flag = ACCURACY_FLAG.get(task_name, "")
row = (
f"| {task_name:<37} "
f"| {flt:<6} "
f"| {n_shot:6} "
f"| {metric:<6} "
f"| {flag}{value:>5.4f} "
f"| ± {stderr:>5.4f} |"
)
if not task_name.startswith("-"):
rows.append(row)
rows_sub.append("<details>" + "\n" + "<summary>" + task_name +
" details" + "</summary>" + "\n" * 2 + header)
rows_sub.append(
"<details>"
+ "\n"
+ "<summary>"
+ task_name
+ " details"
+ "</summary>"
+ "\n" * 2
+ header
)
rows_sub.append(row)
rows_sub.append("</details>")
md = preamble + "\n" + header + "\n" + "\n".join(rows) + "\n" + "\n".join(
rows_sub) + "\n"
# Combine all Markdown sections
md = (
preamble
+ "\n"
+ header
+ "\n"
+ "\n".join(rows)
+ "\n"
+ "\n".join(rows_sub)
+ "\n"
)
print(md)
return md
def safe_md(args, accuracy, datasets):
"""
Safely generate and save Markdown report from accuracy results.
"""
data = json.loads(json.dumps(accuracy))
for model_key, tasks_list in data.items():
md_content = generate_md(model_key, tasks_list, args, datasets)
@ -181,37 +256,50 @@ def safe_md(args, accuracy, datasets):
def main(args):
"""Main evaluation workflow"""
accuracy = {}
accuracy[args.model] = []
result_queue: Queue[float] = multiprocessing.Queue()
if args.model in UNIMODAL_MODEL_NAME:
datasets = ",".join(UNIMODAL_TASK)
for dataset in UNIMODAL_TASK:
p = multiprocessing.Process(target=run_accuracy_unimodal,
args=(result_queue, args.model,
dataset))
p.start()
datasets = UNIMODAL_TASK
else:
datasets = MULTIMODAL_TASK
datasets_str = ",".join(datasets)
# Evaluate model on each dataset
for dataset in datasets:
accuracy_expected = EXPECTED_VALUE[args.model][dataset]
p = multiprocessing.Process(
target=run_accuracy_test, args=(result_queue, args.model, dataset)
)
p.start()
p.join()
if p.is_alive():
p.terminate()
p.join()
result = result_queue.get()
print(result)
accuracy[args.model].append(result)
if args.model in MULTIMODAL_NAME:
datasets = ",".join(MULTIMODAL_TASK)
for dataset in MULTIMODAL_TASK:
p = multiprocessing.Process(target=run_accuracy_multimodal,
args=(result_queue, args.model,
dataset))
p.start()
p.join()
result = result_queue.get()
print(result)
accuracy[args.model].append(result)
gc.collect()
torch.npu.empty_cache()
time.sleep(10)
result = result_queue.get()
print(result)
if (
accuracy_expected - RTOL
< result[dataset][FILTER[dataset]]
< accuracy_expected + RTOL
):
ACCURACY_FLAG[dataset] = ""
else:
ACCURACY_FLAG[dataset] = ""
accuracy[args.model].append(result)
print(accuracy)
safe_md(args, accuracy, datasets)
safe_md(args, accuracy, datasets_str)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
multiprocessing.set_start_method("spawn", force=True)
# Initialize argument parser
parser = argparse.ArgumentParser(
description="Run model accuracy evaluation and generate report"
)
parser.add_argument("--output", type=str, required=True)
parser.add_argument("--model", type=str, required=True)
parser.add_argument("--vllm_ascend_version", type=str, required=False)
@ -219,8 +307,7 @@ if __name__ == "__main__":
parser.add_argument("--torch_npu_version", type=str, required=False)
parser.add_argument("--vllm_version", type=str, required=False)
parser.add_argument("--cann_version", type=str, required=False)
parser.add_argument("--vllm_commit", type=str, required=False)
parser.add_argument("--vllm_ascend_commit", type=str, required=False)
args = parser.parse_args()
# TODO(yikun):
# 1. add a exit 1 if accuracy is not as expected
# 2. Add ✅, ❌ to markdown if accuracy is not as expected
main(args)

View File

@ -9,5 +9,15 @@
"num_iters_warmup": 5,
"num_iters": 15
}
},
{
"test_name": "latency_qwen2_5_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
}
]

View File

@ -18,7 +18,7 @@
},
"client_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"backend": "openai-chat",
"endpoint_type": "openai-chat",
"dataset_name": "hf",
"hf_split": "train",
"endpoint": "/v1/chat/completions",
@ -44,7 +44,31 @@
},
"client_parameters": {
"model": "Qwen/Qwen3-8B",
"backend": "vllm",
"endpoint_type": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_qwen2_5_7B_tp1",
"qps_list": [
1,
4,
16,
"inf"
],
"server_parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"endpoint_type": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200

View File

@ -22,6 +22,17 @@
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
"num_prompts": 200
}
},
{
"test_name": "throughput_qwen2_5_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200,
"backend": "vllm"
}
}
]

View File

@ -1,6 +1,5 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -13,5 +12,19 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
import vllm_ascend.patch.platform.patch_0_9_0.patch_distributed # noqa
coverage:
status:
# non-voting, new code must be fully tested
patch:
default:
target: 100%
# non-voting
informational: true
# non-voting
project:
default:
# non-voting
informational: true

View File

@ -1,241 +0,0 @@
/*
* Copyright (c) China Merchants Bank Co., Ltd. 2025. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "kernel_operator.h"
constexpr int32_t BUFFER_NUM = 1;
class KernelAdvanceStep{
public:
__aicore__ inline KernelAdvanceStep() {}
__aicore__ inline void Init(int32_t tasks_per_core,
int32_t num_queries,
__gm__ int64_t* input_tokens_ptr,
__gm__ int64_t* sampled_token_ids_ptr,
__gm__ int64_t* input_positions_ptr,
__gm__ int32_t* seq_lens_ptr,
__gm__ int32_t* slot_mapping_ptr)
{
this->tasks_per_core = tasks_per_core;
this->start_id = this->tasks_per_core * AscendC::GetBlockIdx();
this->end_id = this->tasks_per_core * (AscendC::GetBlockIdx() + 1) - 1;
// actual task nums of each core
this->actual_task_per_core = tasks_per_core;
if(this->end_id >= num_queries) {
this->actual_task_per_core = num_queries - this->start_id;
this->end_id = num_queries - 1;
}
int32_t offset_this_core = this->tasks_per_core * AscendC::GetBlockIdx();
// init outQues
pipe.InitBuffer(outQueInputTokens, BUFFER_NUM, this->actual_task_per_core * sizeof(int64_t));
pipe.InitBuffer(outQueInputPos, BUFFER_NUM, this->actual_task_per_core * sizeof(int64_t));
pipe.InitBuffer(outQueSeqLen, BUFFER_NUM, this->actual_task_per_core * sizeof(int32_t));
pipe.InitBuffer(outQueSlotMapping, BUFFER_NUM, this->actual_task_per_core * sizeof(int32_t));
// init inQues
pipe.InitBuffer(inQueSeqLen,BUFFER_NUM, this->actual_task_per_core * sizeof(int32_t));
pipe.InitBuffer(inQueSampledTokenIds,BUFFER_NUM, this->actual_task_per_core * sizeof(int64_t));
// init GlobalMemory
inputTokensGm.SetGlobalBuffer((__gm__ int64_t *)input_tokens_ptr + offset_this_core, this->actual_task_per_core);
sampledTokenIdsGm.SetGlobalBuffer((__gm__ int64_t *)sampled_token_ids_ptr + offset_this_core, this->actual_task_per_core);
inputPositionsGm.SetGlobalBuffer((__gm__ int64_t *)input_positions_ptr + offset_this_core, this->actual_task_per_core);
seqLensGm.SetGlobalBuffer((__gm__ int32_t *)seq_lens_ptr + offset_this_core, this->actual_task_per_core);
slotMappingGm.SetGlobalBuffer((__gm__ int32_t *)slot_mapping_ptr + offset_this_core, this->actual_task_per_core);
}
__aicore__ inline void Process(int64_t block_size, __gm__ int32_t* block_tables_ptr, int64_t block_tables_stride)
{
// no need for tilling or pipeline parallel within each core, as the amount of data processed is very small
CopyIn();
Update(block_size, block_tables_ptr, block_tables_stride);
CopyOut();
}
private:
__aicore__ inline void CopyIn()
{
AscendC::LocalTensor<int32_t> seqLenLocalIn = inQueSeqLen.AllocTensor<int32_t>();
AscendC::LocalTensor<int64_t> sampledTokenIdsLocal = inQueSampledTokenIds.AllocTensor<int64_t>();
AscendC::DataCopyExtParams copyParams32{1, static_cast<uint32_t>(this->actual_task_per_core * sizeof(int32_t)), 0, 0, 0}; // blockLen = tasks_per_core * 32 / 8 个字节int32为4字节)
AscendC::DataCopyExtParams copyParams64{1, static_cast<uint32_t>(this->actual_task_per_core * sizeof(int64_t)), 0, 0, 0}; // blockLen = tasks_per_core * 64 / 8 个字节int64为8字节
// calculate the nums that need padded
// so that the total length becomes a multiple of 32 bytes which is a requirement of DataCopy Function.
uint8_t remainNum32 =this->actual_task_per_core * sizeof(int32_t) % 32;
uint8_t needPadElements32 = remainNum32 == 0 ? remainNum32 : (32 - remainNum32) / sizeof(int32_t);
AscendC::DataCopyPadExtParams<int32_t> padParams32{true, 0, needPadElements32, 0};
// calculate the nums that need padded
// so that the total length becomes a multiple of 32 bytes which is a requirement of DataCopy Function.
uint8_t remainNum64 =this->actual_task_per_core * sizeof(int64_t) % 32;
uint8_t needPadElements64 = remainNum64 == 0 ? remainNum64 : (32 - remainNum64) / sizeof(int64_t);
AscendC::DataCopyPadExtParams<int64_t> padParams64{true, 0, needPadElements64, 0};
AscendC::DataCopyPad(seqLenLocalIn, seqLensGm, copyParams32, padParams32);
AscendC::DataCopyPad(sampledTokenIdsLocal, sampledTokenIdsGm, copyParams64, padParams64);
inQueSeqLen.EnQue(seqLenLocalIn);
inQueSampledTokenIds.EnQue(sampledTokenIdsLocal);
}
__aicore__ inline void Update(int64_t block_size, __gm__ int32_t* block_tables_ptr, int64_t block_tables_stride)
{
// input
AscendC::LocalTensor<int32_t> seqLenLocalIn = inQueSeqLen.DeQue<int32_t>();
AscendC::LocalTensor<int64_t> sampledTokenIdsLocal = inQueSampledTokenIds.DeQue<int64_t>();
// output
AscendC::LocalTensor<int64_t> inputTokensLocal = outQueInputTokens.AllocTensor<int64_t>();
AscendC::LocalTensor<int64_t> inputPosLocal = outQueInputPos.AllocTensor<int64_t>();
AscendC::LocalTensor<int32_t> seqLenLocalOut = outQueSeqLen.AllocTensor<int32_t>();
AscendC::LocalTensor<int32_t> slotMappingLocal = outQueSlotMapping.AllocTensor<int32_t>();
auto unary_params = AscendC::UnaryRepeatParams(1, 1, 8, 8);
//Use "for" instead of AscendC::Adds function because AscendC::Adds does not work
//when srcLocalMemory has different datatype from dstLocalMemory
for(int i=0; i < this->actual_task_per_core; i++) {
inputTokensLocal.SetValue(i, sampledTokenIdsLocal.GetValue(i));
inputPosLocal.SetValue(i, seqLenLocalIn.GetValue(i));
}
AscendC::Adds<int32_t, false>(seqLenLocalOut, seqLenLocalIn, 1, (uint64_t)0, 1, unary_params);
// Gather blockTables with dim=1, block_index. No Ascend Function available, use "for" instead.
for(int cur_query_id = this->start_id, i = 0; i < this->actual_task_per_core; cur_query_id++, i++) {
__gm__ int32_t const* seq_block_tables_ptr = block_tables_ptr + block_tables_stride * cur_query_id;
int block_index = inputPosLocal.GetValue(i) / block_size;
int block_offset = inputPosLocal.GetValue(i) % block_size;
int slot_num = seq_block_tables_ptr[block_index] * block_size + block_offset;
// Update slot_mapping
slotMappingLocal.SetValue(i,slot_num);
}
outQueInputTokens.EnQue(inputTokensLocal);
outQueInputPos.EnQue(inputPosLocal);
outQueSeqLen.EnQue(seqLenLocalOut);
outQueSlotMapping.EnQue(slotMappingLocal);
inQueSampledTokenIds.FreeTensor(sampledTokenIdsLocal);
inQueSeqLen.FreeTensor(seqLenLocalIn);
}
__aicore__ inline void CopyOut()
{
AscendC::DataCopyExtParams copyParams32{1, static_cast<uint32_t>(this->actual_task_per_core * sizeof(int32_t)),0,0,0};
AscendC::DataCopyExtParams copyParams64{1, static_cast<uint32_t>(this->actual_task_per_core * sizeof(int64_t)),0,0,0};
AscendC::LocalTensor<int64_t> inputTokensLocal = outQueInputTokens.DeQue<int64_t>();
AscendC::DataCopyPad(inputTokensGm, inputTokensLocal, copyParams64);
outQueInputTokens.FreeTensor(inputTokensLocal);
AscendC::LocalTensor<int64_t> inputPosLocal = outQueInputPos.DeQue<int64_t>();
AscendC::DataCopyPad(inputPositionsGm, inputPosLocal, copyParams64);
outQueInputPos.FreeTensor(inputPosLocal);
AscendC::LocalTensor<int32_t> seqLenLocalOut = outQueSeqLen.DeQue<int32_t>();
AscendC::DataCopyPad(seqLensGm, seqLenLocalOut, copyParams32);
outQueSeqLen.FreeTensor(seqLenLocalOut);
AscendC::LocalTensor<int32_t> slotMappingLocal = outQueSlotMapping.DeQue<int32_t>();
AscendC::DataCopyPad(slotMappingGm, slotMappingLocal, copyParams32);
outQueSlotMapping.FreeTensor(slotMappingLocal);
}
private:
AscendC::TPipe pipe;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueInputTokens, outQueInputPos,
outQueSeqLen, outQueSlotMapping;
AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueSeqLen,
inQueSampledTokenIds,
inQueBlockTables;
AscendC::GlobalTensor<int64_t> inputTokensGm, sampledTokenIdsGm, inputPositionsGm ;
AscendC::GlobalTensor<int32_t> seqLensGm, slotMappingGm, blockTablesGm;
int32_t tasks_per_core, start_id, end_id, actual_task_per_core;
};
extern "C" __global__ __aicore__ void AdvanceStepFlashAttnKernel(
int64_t num_seqs,
int64_t num_queries,
int64_t block_size,
__gm__ int64_t* input_tokens_ptr,
__gm__ int64_t* sampled_token_ids_ptr,
__gm__ int64_t* input_positions_ptr,
__gm__ int32_t* seq_lens_ptr,
__gm__ int32_t* slot_mapping_ptr,
__gm__ int32_t* block_tables_ptr,
int64_t block_tables_stride,
int32_t tasks_per_core
)
{
int start_id = tasks_per_core * AscendC::GetBlockIdx();
// no task for this core.
if(start_id >= num_queries) {
return;
}
KernelAdvanceStep advanceStep;
advanceStep.Init(tasks_per_core, num_queries, input_tokens_ptr, sampled_token_ids_ptr, input_positions_ptr, seq_lens_ptr, slot_mapping_ptr);
advanceStep.Process(block_size,block_tables_ptr,block_tables_stride);
}
namespace vllm_ascend
{
extern void launch_advance_step_flashattn(
void* stream,
int64_t num_seqs,
int64_t num_queries,
int64_t block_size,
int64_t* input_tokens_ptr,
int64_t* sampled_token_ids_ptr,
int64_t* input_positions_ptr,
int32_t* seq_lens_ptr,
int32_t* slot_mapping_ptr,
int32_t* block_tables_ptr,
int64_t block_tables_stride)
{
int32_t num_cores = 20;
if(num_cores > num_queries) {
num_cores = num_queries;
}
// task num processed of each core
int32_t tasks_per_core = (num_queries + num_cores - 1) / num_cores;
AdvanceStepFlashAttnKernel<<<num_cores, nullptr, stream>>>(
num_seqs,
num_queries,
block_size,
input_tokens_ptr,
sampled_token_ids_ptr,
input_positions_ptr,
seq_lens_ptr,
slot_mapping_ptr,
block_tables_ptr,
block_tables_stride,
tasks_per_core);
}
}

View File

@ -0,0 +1,378 @@
/*
* Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
*/
#include "kernel_operator.h"
#include "kernel_tensor_impl.h"
#include "kernel_type.h"
#include "types.h"
#include "utils.h"
using vllm_ascend::AccType;
template<typename scalar_t>
class GetMaskedInputAndMask {
public:
__aicore__ inline GetMaskedInputAndMask() {}
__aicore__ inline ~GetMaskedInputAndMask() {
pipe.Reset();
}
__aicore__ inline void Init(
__gm__ scalar_t* input,
__gm__ scalar_t* masked_input,
__gm__ bool* mask_out,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index,
const int64_t size)
{
// Initialize basic parameters
input_ = input;
masked_input_ = masked_input;
mask_out_ = mask_out;
org_vocab_start_index_ = org_vocab_start_index;
org_vocab_end_index_ = org_vocab_end_index;
size_ = ((size + 31) / 32) * 32;
added_offset_ = added_vocab_start_index -
(org_vocab_end_index - org_vocab_start_index) -
num_org_vocab_padding;
added_vocab_start_index_ = added_vocab_start_index;
added_vocab_end_index_ = added_vocab_end_index;
// Initialize global tensors
inputGlobal.SetGlobalBuffer(input);
maskedOutputGlobal.SetGlobalBuffer(masked_input);
maskOutGlobal.SetGlobalBuffer(mask_out);
// Initialize queues
pipe.InitBuffer(inQueue, 1, size_ * sizeof(scalar_t));
pipe.InitBuffer(outQueue, 1, size_ * sizeof(scalar_t));
pipe.InitBuffer(maskQueue, 1, size_ * sizeof(bool));
// Initialize calculation buffers
// NOTE: calc_buf_1 and calc_buf_2 are also used for int16 casting on older archs.
pipe.InitBuffer(calc_buf_1, size_ * sizeof(float));
pipe.InitBuffer(calc_buf_2, size_ * sizeof(float));
// Initialize result queues
pipe.InitBuffer(result_ge_que, BUFFER_NUM, size_ * sizeof(float));
pipe.InitBuffer(result_le_que, BUFFER_NUM, size_ * sizeof(float));
pipe.InitBuffer(result_org_mask_que, BUFFER_NUM, size_ * sizeof(float));
pipe.InitBuffer(result_add_mask_que, BUFFER_NUM, size_ * sizeof(float));
// Initialize temporary buffers
pipe.InitBuffer(start_buf, size_ * sizeof(float));
pipe.InitBuffer(end_buf, size_ * sizeof(float));
pipe.InitBuffer(inputFloat_buf, size_ * sizeof(float)); // Also used for half intermediate in casting
pipe.InitBuffer(validOffset_buf, size_ * sizeof(float));
pipe.InitBuffer(vocabMask_buf_, size_ * sizeof(int8_t));
pipe.InitBuffer(ones_buf_, size_ * sizeof(float));
}
__aicore__ inline void Process()
{
CopyIn();
Compute();
CopyOut();
}
private:
__aicore__ inline void CopyIn()
{
AscendC::LocalTensor<scalar_t> inputLocal = inQueue.AllocTensor<scalar_t>();
AscendC::DataCopy(inputLocal, inputGlobal, size_);
inQueue.EnQue(inputLocal);
}
__aicore__ inline void CompareWithValue(
AscendC::LocalTensor<int8_t>& result,
const AscendC::LocalTensor<float>& input,
const AscendC::LocalTensor<float>& compare_value,
bool is_greater_equal) {
AscendC::LocalTensor<float> compute_buf = calc_buf_1.Get<float>();
if (is_greater_equal) {
AscendC::Max(compute_buf, input, compare_value, size_);
AscendC::Sub(compute_buf, compare_value, compute_buf, size_);
} else {
AscendC::Max(compute_buf, input, compare_value, size_);
AscendC::Sub(compute_buf, compute_buf, compare_value, size_);
}
AscendC::Abs(compute_buf, compute_buf, size_);
AscendC::Mins(compute_buf, compute_buf, MIN_ACCURACY_FP32, size_);
AscendC::Muls(compute_buf, compute_buf, MAX_MUL_1_FP32, size_);
AscendC::Muls(compute_buf, compute_buf, MAX_MUL_1_FP32, size_);
AscendC::Muls(compute_buf, compute_buf, MAX_MUL_2_FP32, size_);
AscendC::Adds(compute_buf, compute_buf, NEGATIVE_ONE_FP32, size_);
AscendC::Abs(compute_buf, compute_buf, size_);
AscendC::LocalTensor<half> compute_buf_fp16 = calc_buf_2.Get<half>();
AscendC::Cast(compute_buf_fp16, compute_buf, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(result, compute_buf_fp16, AscendC::RoundMode::CAST_NONE, size_);
}
__aicore__ inline void ComputeRangeMask(
AscendC::LocalTensor<int8_t>& range_mask,
const AscendC::LocalTensor<float>& input,
const float start_value,
const float end_value) {
AscendC::LocalTensor<float> start_value_tensor = start_buf.Get<float>();
AscendC::LocalTensor<float> end_value_tensor = end_buf.Get<float>();
AscendC::Duplicate(start_value_tensor, start_value, size_);
AscendC::Duplicate(end_value_tensor, end_value, size_);
AscendC::LocalTensor<int8_t> ge_result = result_ge_que.AllocTensor<int8_t>();
AscendC::LocalTensor<int8_t> lt_result = result_le_que.AllocTensor<int8_t>();
CompareWithValue(ge_result, start_value_tensor, input, true);
CompareWithValue(lt_result, input, end_value_tensor, false);
#if (__CCE_AICORE__ >= 220)
AscendC::And(range_mask, ge_result, lt_result, size_);
#else
{
// WORKAROUND for older arch
// No direct int8->int16 cast. Use half as intermediate.
// No direct int8 And. Use int16 And.
AscendC::LocalTensor<int16_t> ge_result_i16 = calc_buf_1.Get<int16_t>();
AscendC::LocalTensor<int16_t> lt_result_i16 = calc_buf_2.Get<int16_t>();
AscendC::LocalTensor<int16_t> range_mask_i16 = ge_result_i16;
// Use a temporary buffer for half type
AscendC::LocalTensor<half> tmp_half = inputFloat_buf.Get<half>();
// 1. Cast inputs: int8_t -> half -> int16_t
AscendC::Cast(tmp_half, ge_result, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(ge_result_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(tmp_half, lt_result, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(lt_result_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
// 2. Perform And on int16_t tensors
AscendC::And(range_mask_i16, ge_result_i16, lt_result_i16, size_);
// 3. Cast result back: int16_t -> half -> int8_t
AscendC::Cast(tmp_half, range_mask_i16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(range_mask, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
}
#endif
}
__aicore__ inline void Compute() {
AscendC::LocalTensor<scalar_t> inputLocal = inQueue.DeQue<scalar_t>();
AscendC::LocalTensor<scalar_t> maskedLocal = outQueue.AllocTensor<scalar_t>();
AscendC::LocalTensor<int8_t> maskLocal = maskQueue.AllocTensor<int8_t>();
AscendC::LocalTensor<float> inputFloat = inputFloat_buf.Get<float>();
AscendC::Cast(inputFloat, inputLocal, AscendC::RoundMode::CAST_NONE, size_);
AscendC::LocalTensor<int8_t> orgVocabMask = result_org_mask_que.AllocTensor<int8_t>();
ComputeRangeMask(orgVocabMask,
inputFloat,
static_cast<float>(org_vocab_start_index_),
static_cast<float>(org_vocab_end_index_));
AscendC::LocalTensor<int8_t> addedVocabMask = result_add_mask_que.AllocTensor<int8_t>();
ComputeRangeMask(addedVocabMask,
inputFloat,
static_cast<float>(added_vocab_start_index_),
static_cast<float>(added_vocab_end_index_));
AscendC::LocalTensor<float> validOffset = validOffset_buf.Get<float>();
AscendC::LocalTensor<float> constOrgStartIndex = start_buf.Get<float>();
AscendC::Duplicate(constOrgStartIndex, float(org_vocab_start_index_), size_);
AscendC::LocalTensor<half> orgVocabMask_fp16;
AscendC::LocalTensor<float> orgVocabMask_fp32;
AscendC::Cast(orgVocabMask_fp16, orgVocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(orgVocabMask_fp32, orgVocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Mul(validOffset, constOrgStartIndex, orgVocabMask_fp32, size_);
AscendC::LocalTensor<float> addedOffset;
AscendC::LocalTensor<float> addedOffsetTensor = end_buf.Get<float>();
AscendC::Duplicate(addedOffsetTensor, float(added_offset_), size_);
AscendC::LocalTensor<half> addedVocabMask_fp16;
AscendC::LocalTensor<float> addedVocabMask_fp32;
AscendC::Cast(addedVocabMask_fp16, addedVocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(addedVocabMask_fp32, addedVocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Mul(addedOffset, addedOffsetTensor, addedVocabMask_fp32, size_);
AscendC::Add(validOffset, validOffset, addedOffset, size_);
AscendC::LocalTensor<int8_t> vocabMask = vocabMask_buf_.Get<int8_t>();
#if (__CCE_AICORE__ >= 220)
AscendC::Or(vocabMask,
orgVocabMask,
addedVocabMask,
size_);
#else
{
// WORKAROUND for older arch
// No direct int8->int16 cast. Use half as intermediate.
// No direct int8 Or. Use int16 Or.
AscendC::LocalTensor<int16_t> orgVocabMask_i16 = calc_buf_1.Get<int16_t>();
AscendC::LocalTensor<int16_t> addedVocabMask_i16 = calc_buf_2.Get<int16_t>();
AscendC::LocalTensor<int16_t> vocabMask_i16 = orgVocabMask_i16;
// Use a temporary buffer for half type. inputFloat_buf is free now.
AscendC::LocalTensor<half> tmp_half = inputFloat_buf.Get<half>();
// 1. Cast inputs: int8_t -> half -> int16_t
AscendC::Cast(tmp_half, orgVocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(orgVocabMask_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(tmp_half, addedVocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(addedVocabMask_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
// 2. Perform Or on int16_t tensors
AscendC::Or(vocabMask_i16, orgVocabMask_i16, addedVocabMask_i16, size_);
// 3. Cast result back: int16_t -> half -> int8_t
AscendC::Cast(tmp_half, vocabMask_i16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(vocabMask, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
}
#endif
AscendC::Sub(inputFloat, inputFloat, validOffset, size_);
AscendC::LocalTensor<half> vocabMask_fp16;
AscendC::LocalTensor<float> vocabMask_fp32;
AscendC::Cast(vocabMask_fp16, vocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(vocabMask_fp32, vocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Mul(inputFloat, inputFloat, vocabMask_fp32, size_);
AscendC::Cast(maskedLocal, inputFloat, AscendC::RoundMode::CAST_CEIL, size_);
outQueue.EnQue(maskedLocal);
AscendC::LocalTensor<float> ones_tensor = ones_buf_.Get<float>();
AscendC::Duplicate(ones_tensor, (float)1, size_);
AscendC::LocalTensor<float> maskLocal_fp32;
AscendC::Sub(maskLocal_fp32, ones_tensor, vocabMask_fp32, size_);
AscendC::LocalTensor<half> maskLocal_fp16;
AscendC::Cast(maskLocal_fp16, maskLocal_fp32, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(maskLocal, maskLocal_fp16, AscendC::RoundMode::CAST_NONE, size_);
maskQueue.EnQue(maskLocal);
inQueue.FreeTensor(inputLocal);
}
__aicore__ inline void CopyOut()
{
AscendC::LocalTensor<scalar_t> maskedLocal = outQueue.DeQue<scalar_t>();
AscendC::LocalTensor<bool> maskLocal = maskQueue.DeQue<bool>();
AscendC::DataCopy(maskedOutputGlobal, maskedLocal, size_);
AscendC::DataCopy(maskOutGlobal, maskLocal, size_);
outQueue.FreeTensor(maskedLocal);
maskQueue.FreeTensor(maskLocal);
}
private:
static constexpr int32_t BUFFER_NUM = 2;
AscendC::TPipe pipe;
AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueue;
AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue, maskQueue;
AscendC::GlobalTensor<scalar_t> inputGlobal, maskedOutputGlobal;
AscendC::GlobalTensor<bool> maskOutGlobal;
AscendC::TBuf<AscendC::TPosition::VECCALC> calc_buf_1;
AscendC::TBuf<AscendC::TPosition::VECCALC> calc_buf_2;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_ge_que;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_le_que;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_org_mask_que;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_add_mask_que;
// Temporary buffers
AscendC::TBuf<AscendC::TPosition::VECCALC> start_buf;
AscendC::TBuf<AscendC::TPosition::VECCALC> end_buf;
AscendC::TBuf<AscendC::TPosition::VECCALC> inputFloat_buf;
AscendC::TBuf<AscendC::TPosition::VECCALC> validOffset_buf;
AscendC::TBuf<AscendC::TPosition::VECCALC> vocabMask_buf_;
AscendC::TBuf<AscendC::TPosition::VECCALC> ones_buf_;
__gm__ scalar_t *input_, *masked_input_;
__gm__ bool *mask_out_;
int64_t size_;
int64_t org_vocab_start_index_, org_vocab_end_index_;
int64_t added_vocab_start_index_, added_vocab_end_index_;
int64_t added_offset_;
static constexpr float MIN_ACCURACY_FP32 = 1.1754943508222875e-38;
static constexpr float MAX_MUL_1_FP32 = 1125899906842624;
static constexpr float MAX_MUL_2_FP32 = 67108864;
static constexpr float NEGATIVE_ONE_FP32 = -1.0f;
};
extern "C" __global__ __aicore__ void get_masked_input_and_mask_kernel(
__gm__ int32_t* input,
__gm__ int32_t* masked_input,
__gm__ bool* mask_out,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index,
const int64_t size,
const uint32_t loop_cnt,
const uint32_t aiv_num)
{
{
GetMaskedInputAndMask<int32_t> op{};
for (int64_t i = AscendC::GetBlockIdx(); i < loop_cnt; i += aiv_num) {
op.Init(input + i * size/loop_cnt,
masked_input + i * size/loop_cnt,
mask_out + i * size/loop_cnt,
org_vocab_start_index, org_vocab_end_index,
num_org_vocab_padding, added_vocab_start_index,
added_vocab_end_index, size/loop_cnt);
op.Process();
}
} // op destructor called here
}
namespace vllm_ascend {
void get_masked_input_and_mask_impl(
void* stream,
void* input,
void* masked_input,
void* mask_out,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index,
const int64_t size,
const uint32_t loop_cnt,
const uint32_t aiv_num)
{
get_masked_input_and_mask_kernel<<<aiv_num, nullptr, stream>>>(
static_cast<int32_t*>(input),
static_cast<int32_t*>(masked_input),
static_cast<bool*>(mask_out),
org_vocab_start_index,
org_vocab_end_index,
num_org_vocab_padding,
added_vocab_start_index,
added_vocab_end_index,
size,
loop_cnt,
aiv_num);
}
} // namespace vllm_ascend

View File

@ -30,7 +30,11 @@ using vllm_ascend::local_mem_copy;
template <typename scalar_t, bool isNeox> class RotaryEmbedding {
// NOTE(ganyi): we use 512B as load stride for pipe, need to find another way to
// retrieve this size from runtime for more Soc support
static int constexpr loadSize = 512;
#if (__CCE_AICORE__ >= 220)
static int constexpr loadSize = 512;
#else
static int constexpr loadSize = 1024 * 4;
#endif
using dst_t = scalar_t;
using acc_t = typename AccType<scalar_t>::type;
// only half tensor have cast instruct to int8, hardcode acc_dst_t as half
@ -326,7 +330,9 @@ private:
// Declare all the kernel entry here
ROPE_CUSTOM_KERNEL_DECLARE(half)
ROPE_CUSTOM_KERNEL_DECLARE(bfloat16_t)
#if (__CCE_AICORE__ >= 220)
ROPE_CUSTOM_KERNEL_DECLARE(bfloat16_t)
#endif
namespace vllm_ascend {
@ -342,7 +348,7 @@ namespace vllm_ascend {
reinterpret_cast<TYPE *>(cosSinCache), rotDim, queryStride, keyStride, dstQueryStride, dstKeyStride, \
numHeads, numKvHeads, headSize, numTokens, loopCnt, blockDim);
// maximum number for runtime to launch a ascendc kernel.
// maximum number for runtime to launch a ascendc kernel.
// we use this to constrain the maximum number of block size
static const int64_t maxParallelSize = 65535;
@ -357,9 +363,13 @@ extern void rotary_embedding_impl(AscendType type, bool isNeox, void *stream, in
int blockDim = maxParallelSize > numTokens ? numTokens : maxParallelSize;
if (type == AscendType::FP16) {
ROTARY_EMBEDDING_KERNEL_CALL(half);
} else if (type == AscendType::BF16) {
}
#if (__CCE_AICORE__ >= 220)
else if (type == AscendType::BF16) {
ROTARY_EMBEDDING_KERNEL_CALL(bfloat16_t);
} else {
}
#endif
else {
return;
}
}

View File

@ -20,9 +20,11 @@ namespace vllm_ascend {
template <typename scalar_t> struct AccType;
#if (__CCE_AICORE__ >= 220)
template <> struct AccType<bfloat16_t> {
using type = float;
using type = float;
};
#endif
template <> struct AccType<half> {
using type = half;

View File

@ -31,6 +31,20 @@ namespace vllm_ascend {
const int headSize, const int64_t numTokens, const uint32_t loopCnt,
uint32_t aivNum);
extern void get_masked_input_and_mask_impl(
void* stream,
void* input,
void* masked_input,
void* mask_out,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index,
const int64_t size,
const uint32_t loop_cnt,
const uint32_t aiv_num);
torch::Tensor weak_ref_tensor(torch::Tensor& tensor) {
if (!tensor.is_privateuseone()) {
throw std::runtime_error("Tensor must be on NPU device");
@ -46,16 +60,4 @@ namespace vllm_ascend {
auto new_tensor = at_npu::native::from_blob(data_ptr, sizes, strides, options);
return new_tensor;
}
extern void launch_advance_step_flashattn(
void* stream,
int64_t num_seqs,
int64_t num_queries,
int64_t block_size,
int64_t* input_tokens_ptr,
int64_t* sampled_token_ids_ptr,
int64_t* input_positions_ptr,
int32_t* seq_lens_ptr,
int32_t* slot_mapping_ptr,
int32_t* block_tables_ptr,
int64_t block_tables_stride);
}

View File

@ -99,85 +99,110 @@ std::tuple<at::Tensor, at::Tensor> rotary_embedding(at::Tensor &positions, at::T
return {query_dst, key_dst};
}
void verify_tensor(std::string const& name, at::Tensor const& t,
int64_t const size_0, int64_t const size_1,
c10::ScalarType const type) {
bool size_0_cond = true;
if (size_0 != -1) {
size_0_cond = t.size(0) == size_0;
}
std::tuple<at::Tensor, at::Tensor> get_masked_input_and_mask(
at::Tensor &input,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index)
/*
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L161-L198
Embedding parallelized in the vocabulary dimension.
bool size_1_cond = true;
if (size_1 != -1) {
size_1_cond = t.size(1) == size_1;
}
Adapted from torch.nn.Embedding, note that we pad the vocabulary size to
make sure it is divisible by the number of model parallel GPUs.
bool is_contiguous = t.is_contiguous();
bool same_type = t.dtype() == type;
In order to support various loading methods, we ensure that LoRA-added
embeddings are always at the end of TP-sharded tensors. In other words,
we shard base embeddings and LoRA embeddings separately (both padded),
and place them in the same tensor.
In this example, we will have the original vocab size = 1010,
added vocab size = 16 and padding to 64. Therefore, the total
vocab size with padding will be 1088 (because we first pad 1010 to
1024, add 16, and then pad to 1088).
Therefore, the tensor format looks like the following:
TP1, rank 0 (no sharding):
|< --------BASE-------- >|< -BASE PADDING-- >|< -----LORA------ >|< -LORA PADDING-- >|
corresponding token_id: | 0 | 1 | ... | 1009 | -1 | ... | -1 | 1010 | ... | 1015 | -1 | ... | -1 |
index: | 0 | 1 | ... | 1009 | 1010 | ... | 1023 | 1024 | ... | 1039 | 1040 | ... | 1087 |
bool pass = size_0_cond && size_1_cond && is_contiguous && same_type;
if (!pass) {
TORCH_CHECK(false, "tensor: name = ", name, ", shape = ", t.sizes(),
" is_cont = ", t.is_contiguous(), ", type = ", t.dtype(),
" is not as expected: shape = [", size_0, ", ", size_1,
"], type = ", type);
}
}
TP2, rank 0:
|< --------------------BASE--------------------- >|< -----LORA------ >|< -LORA PADDING- >|
corresponding token_id: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 1000 | ... | 1015 | -1 | ... | -1 |
index: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 512 | ... | 527 | 520 | ... | 543 |
TP2, rank 1:
|< -----------BASE----------- >|< -BASE PADDING- >|< -----------LORA PADDING----------- >|
corresponding token_id: | 512 | 513 | 514 | ... | 1009 | -1 | ... | -1 | -1 | ... | -1 | -1 | ... | -1 |
index: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 512 | ... | 519 | 520 | ... | 543 |
Parameters:
org_vocab_start_index //base embeddings start
org_vocab_end_index //base embeddings end
num_org_vocab_padding //base embeddings padding
added_vocab_start_index //LoRA embeddings start
added_vocab_end_index //LoRA embeddings end
*/
{
// Input validation
TORCH_CHECK(input.dim() >= 1, "input must have at least 1 dimension");
TORCH_CHECK(org_vocab_start_index >= 0, "org_vocab_start_index must be non-negative");
TORCH_CHECK(org_vocab_end_index >= org_vocab_start_index, "org_vocab_end_index must be greater than org_vocab_start_index");
TORCH_CHECK(num_org_vocab_padding >= 0, "num_org_vocab_padding must be non-negative");
TORCH_CHECK(added_vocab_start_index >= org_vocab_end_index, "added_vocab_start_index must be greater than org_vocab_end_index");
TORCH_CHECK(added_vocab_end_index >= added_vocab_start_index, "added_vocab_end_index must be greater than added_vocab_start_index");
// Get total number of elements
int64_t size = input.numel();
void advance_step_flashattn_ascendc(
int64_t num_seqs, int64_t num_queries, int64_t block_size,
at::Tensor& input_tokens,
at::Tensor& sampled_token_ids,
at::Tensor& input_positions,
at::Tensor& seq_lens,
at::Tensor& slot_mapping,
at::Tensor& block_tables
){
// Verify all tensors
verify_tensor("input_tokens", input_tokens, num_seqs, -1, at::kLong);
verify_tensor("sampled_token_ids", sampled_token_ids, num_queries, 1,at::kLong);
verify_tensor("input_positions", input_positions, num_seqs, -1, at::kLong);
verify_tensor("seq_lens", seq_lens, num_seqs, -1, at::kInt);
verify_tensor("slot_mapping", slot_mapping, num_seqs, -1, at::kInt);
verify_tensor("block_tables", block_tables, num_seqs, -1, at::kInt);
int64_t* input_tokens_ptr = input_tokens.data_ptr<int64_t>();
int64_t* sampled_token_ids_ptr = sampled_token_ids.data_ptr<int64_t>();
int64_t* input_positions_ptr = input_positions.data_ptr<int64_t>();
int32_t* seq_lens_ptr = seq_lens.data_ptr<int32_t>();
int32_t* slot_mapping_ptr = slot_mapping.data_ptr<int32_t>();
int32_t* block_tables_ptr = block_tables.data_ptr<int32_t>();
int32_t device_id;
aclrtGetDevice(&device_id);
auto npu_stream = c10_npu::getCurrentNPUStream(device_id);
aclrtStream stream = npu_stream.stream();
// aclrtStream stream = c10_npu::getCurrentNPUStream().stream();
// Create output tensors
at::Tensor masked_input = at::empty_like(input);
at::Tensor mask = at::empty_like(input).to(at::kBool);
// Get data pointers
void *input_ptr = input.data_ptr();
void *masked_input_ptr = masked_input.data_ptr();
void *mask_ptr = mask.data_ptr();
// Get current stream
aclrtStream stream = c10_npu::getCurrentNPUStream().stream();
// Get scalar type
at::ScalarType scalar_type = input.scalar_type();
// Create and configure OpCommand
at_npu::native::OpCommand cmd;
cmd.Name("advance_step_flashattn_ascendc");
cmd.SetCustomHandler([stream, num_seqs, num_queries,
block_size, input_tokens_ptr, sampled_token_ids_ptr,
input_positions_ptr, seq_lens_ptr, slot_mapping_ptr,
block_tables_ptr, block_tables]() -> int {
launch_advance_step_flashattn(stream,
num_seqs,
num_queries,
block_size,
input_tokens_ptr,
sampled_token_ids_ptr,
input_positions_ptr,
seq_lens_ptr,
slot_mapping_ptr,
block_tables_ptr,
block_tables.stride(0));
cmd.Name("get_masked_input_and_mask");
cmd.SetCustomHandler([scalar_type, size, stream,
input_ptr, masked_input_ptr, mask_ptr,
org_vocab_start_index, org_vocab_end_index,
num_org_vocab_padding, added_vocab_start_index,
added_vocab_end_index]() -> int {
// Get platform info
fe::PlatFormInfos platform_infos;
int device_id = 0;
fe::PlatformInfoManager::GeInstance().GetRuntimePlatformInfosByDevice(device_id, platform_infos);
uint32_t aivNum = platform_infos.GetCoreNumByType("aiv");
uint32_t loop_cnt = (size + aivNum - 1) / aivNum;
// Call implementation
get_masked_input_and_mask_impl(
stream,
input_ptr,
masked_input_ptr,
mask_ptr,
org_vocab_start_index,
org_vocab_end_index,
num_org_vocab_padding,
added_vocab_start_index,
added_vocab_end_index,
size,
loop_cnt,
aivNum);
return 0;
});
cmd.Run();
return ;
return {masked_input, mask};
}
} // namespace vllm_ascend
@ -194,11 +219,15 @@ TORCH_LIBRARY_EXPAND(_C, ops)
" Tensor! key, int head_size,"
" Tensor cos_sin_cache, bool is_neox) -> (Tensor query, Tensor key)");
ops.impl("rotary_embedding", torch::kPrivateUse1, &vllm_ascend::rotary_embedding);
ops.def(
"advance_step_flashattn_ascendc(int num_seqs, int num_queries, int block_size,"
" Tensor! input_tokens, Tensor! sampled_token_ids, Tensor! input_positions,"
" Tensor! seq_lens, Tensor! slot_mapping, Tensor! block_tables) -> ()");
ops.impl("advance_step_flashattn_ascendc", torch::kPrivateUse1, &vllm_ascend::advance_step_flashattn_ascendc);
"get_masked_input_and_mask(Tensor input, "
" int org_vocab_start_index, "
" int org_vocab_end_index, "
" int num_org_vocab_padding, "
" int added_vocab_start_index, "
" int added_vocab_end_index) -> (Tensor masked_input, Tensor mask)");
ops.impl("get_masked_input_and_mask", torch::kPrivateUse1, &vllm_ascend::get_masked_input_and_mask);
}
REGISTER_EXTENSION(_C)

View File

@ -1,2 +1,2 @@
pytest-asyncio
pytest-mock

Binary file not shown.

After

Width:  |  Height:  |  Size: 115 KiB

View File

@ -7,6 +7,7 @@
| Xiyuan Wang| [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
| Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 |
| Yi Gan| [@ganyi1996ppo](https://github.com/ganyi1996ppo) | 2025/02 |
| Shoujian Zheng| [@jianzs](https://github.com/jianzs) | 2025/06 |
## Contributors
@ -16,6 +17,23 @@ Updated on 2025-06-10:
| Number | Contributor | Date | Commit ID |
|:------:|:-----------:|:-----:|:---------:|
| 83 | [@ZhengWG](https://github.com/) | 2025/7/7 | [3a469de](https://github.com/vllm-project/vllm-ascend/commit/9c886d0a1f0fc011692090b0395d734c83a469de) |
| 82 | [@wm901115nwpu](https://github.com/) | 2025/7/7 | [a2a47d4](https://github.com/vllm-project/vllm-ascend/commit/f08c4f15a27f0f27132f4ca7a0c226bf0a2a47d4) |
| 81 | [@Agonixiaoxiao](https://github.com/) | 2025/7/2 | [6f84576](https://github.com/vllm-project/vllm-ascend/commit/7fc1a984890bd930f670deedcb2dda3a46f84576) |
| 80 | [@zhanghw0354](https://github.com/zhanghw0354) | 2025/7/2 | [d3df9a5](https://github.com/vllm-project/vllm-ascend/commit/9fb3d558e5b57a3c97ee5e11b9f5dba6ad3df9a5) |
| 79 | [@GDzhu01](https://github.com/GDzhu01) | 2025/6/28 | [de256ac](https://github.com/vllm-project/vllm-ascend/commit/b308a7a25897b88d4a23a9e3d583f4ec6de256ac) |
| 78 | [@leo-pony](https://github.com/leo-pony) | 2025/6/26 | [3f2a5f2](https://github.com/vllm-project/vllm-ascend/commit/10253449120307e3b45f99d82218ba53e3f2a5f2) |
| 77 | [@zeshengzong](https://github.com/zeshengzong) | 2025/6/26 | [3ee25aa](https://github.com/vllm-project/vllm-ascend/commit/192dbbcc6e244a8471d3c00033dc637233ee25aa) |
| 76 | [@sharonyunyun](https://github.com/sharonyunyun) | 2025/6/25 | [2dd8666](https://github.com/vllm-project/vllm-ascend/commit/941269a6c5bbc79f6c1b6abd4680dc5802dd8666) |
| 75 | [@Pr0Wh1teGivee](https://github.com/Pr0Wh1teGivee) | 2025/6/25 | [c65dd40](https://github.com/vllm-project/vllm-ascend/commit/2fda60464c287fe456b4a2f27e63996edc65dd40) |
| 74 | [@xleoken](https://github.com/xleoken) | 2025/6/23 | [c604de0](https://github.com/vllm-project/vllm-ascend/commit/4447e53d7ad5edcda978ca6b0a3a26a73c604de0) |
| 73 | [@lyj-jjj](https://github.com/lyj-jjj) | 2025/6/23 | [5cbd74e](https://github.com/vllm-project/vllm-ascend/commit/5177bef87a21331dcca11159d3d1438075cbd74e) |
| 72 | [@farawayboat](https://github.com/farawayboat)| 2025/6/21 | [bc7d392](https://github.com/vllm-project/vllm-ascend/commit/097e7149f75c0806774bc68207f0f6270bc7d392)
| 71 | [@yuancaoyaoHW](https://github.com/yuancaoyaoHW) | 2025/6/20 | [7aa0b94](https://github.com/vllm-project/vllm-ascend/commit/00ae250f3ced68317bc91c93dc1f1a0977aa0b94)
| 70 | [@songshanhu07](https://github.com/songshanhu07) | 2025/6/18 | [5e1de1f](https://github.com/vllm-project/vllm-ascend/commit/2a70dbbdb8f55002de3313e17dfd595e1de1f)
| 69 | [@wangyanhui-cmss](https://github.com/wangyanhui-cmss) | 2025/6/12| [40c9e88](https://github.com/vllm-project/vllm-ascend/commit/2a5fb4014b863cee6abc3009f5bc5340c9e88) |
| 68 | [@chenwaner](https://github.com/chenwaner) | 2025/6/11 | [c696169](https://github.com/vllm-project/vllm-ascend/commit/e46dc142bf1180453c64226d76854fc1ec696169) |
| 67 | [@yzim](https://github.com/yzim) | 2025/6/11 | [aaf701b](https://github.com/vllm-project/vllm-ascend/commit/4153a5091b698c2270d160409e7fee73baaf701b) |
| 66 | [@Yuxiao-Xu](https://github.com/Yuxiao-Xu) | 2025/6/9 | [6b853f1](https://github.com/vllm-project/vllm-ascend/commit/6b853f15fe69ba335d2745ebcf14a164d0bcc505) |
| 65 | [@ChenTaoyu-SJTU](https://github.com/ChenTaoyu-SJTU) | 2025/6/7 | [20dedba](https://github.com/vllm-project/vllm-ascend/commit/20dedba5d1fc84b7ae8b49f9ce3e3649389e2193) |
| 64 | [@zxdukki](https://github.com/zxdukki) | 2025/6/7 | [87ebaef](https://github.com/vllm-project/vllm-ascend/commit/87ebaef4e4e519988f27a6aa378f614642202ecf) |

View File

@ -0,0 +1,19 @@
# User Stories
Read case studies on how users and developers solves real, everyday problems with vLLM Ascend
- [LLaMA-Factory](./llamafactory.md) is an easy-to-use and efficient platform for training and fine-tuning large language models, it supports vLLM Ascend to speed up inference since [LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739), gain 2x performance enhancement of inference.
- [Huggingface/trl](https://github.com/huggingface/trl) is a cutting-edge library designed for post-training foundation models using advanced techniques like SFT, PPO and DPO, it uses vLLM Ascend since [v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) to support RLHF on Ascend NPU.
- [MindIE Turbo](https://pypi.org/project/mindie-turbo) is an LLM inference engine acceleration plug-in library developed by Huawei on Ascend hardware, which includes self-developed large language model optimization algorithms and optimizations related to the inference engine framework. It supports vLLM Ascend since [2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev/mindie-turbo-0001.html).
- [GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU cluster manager for running AI models. It supports vLLM Ascend since [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2), see more GPUStack performance evaluation info on [link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew).
- [verl](https://github.com/volcengine/verl) is a flexible, efficient and production-ready RL training library for large language models (LLMs), uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0), see more info on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html).
:::{toctree}
:caption: More details
:maxdepth: 1
llamafactory
:::

View File

@ -0,0 +1,19 @@
# LLaMA-Factory
**About / Introduction**
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
**The Business Challenge**
LLaMA-Factory used transformers to perform inference on Ascend NPU, but the speed was slow.
**Solving Challenges and Benefits with vLLM Ascend**
With the joint efforts of LLaMA-Factory and vLLM Ascend ([LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)), the performance of LLaMA-Factory in the model inference stage has been significantly improved. According to the test results, the inference speed of LLaMA-Factory has been increased to 2x compared to the transformers version.
**Learn more**
See more about LLaMA-Factory and how it uses vLLM Ascend for inference on the Ascend NPU in the following documentation: [LLaMA-Factory Ascend NPU Inference](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html).

View File

@ -22,6 +22,8 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | MindIE Turbo |
|-------------|--------------|------------------|-------------|--------------------|--------------|
| v0.9.2rc1 | v0.9.2 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1.post1.dev20250619 | |
| v0.9.1rc1 | v0.9.1 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1.post1.dev20250528 | |
| v0.9.0rc2 | v0.9.0 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | |
| v0.9.0rc1 | v0.9.0 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | |
| v0.8.5rc1 | v0.8.5.post1 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | |
@ -35,6 +37,8 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
| Date | Event |
|------------|-------------------------------------------|
| 2025.07.11 | Release candidates, v0.9.2rc1 |
| 2025.06.22 | Release candidates, v0.9.1rc1 |
| 2025.06.10 | Release candidates, v0.9.0rc2 |
| 2025.06.09 | Release candidates, v0.9.0rc1 |
| 2025.05.29 | v0.7.x post release, v0.7.3.post1 |
@ -72,8 +76,8 @@ Usually, each minor version of vLLM (such as 0.7) will correspond to a vLLM Asce
| Branch | Status | Note |
|------------|--------------|--------------------------------------|
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.9.x branch |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.0 and 0.9.1 version |
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.9.2 branch |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version |
| v0.7.1-dev | Unmaintained | Replaced by v0.7.3-dev |

View File

@ -23,6 +23,7 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import json
import os
# import sys
@ -64,17 +65,19 @@ myst_substitutions = {
# the branch of vllm, used in vllm clone
# - main branch: 'main'
# - vX.Y.Z branch: 'vX.Y.Z'
'vllm_version': 'v0.9.0',
'vllm_version': 'v0.9.2',
# the branch of vllm-ascend, used in vllm-ascend clone and image tag
# - main branch: 'main'
# - vX.Y.Z branch: latest vllm-ascend release tag
'vllm_ascend_version': 'v0.9.0rc2',
'vllm_ascend_version': 'v0.9.2rc1',
# the newest release version of vllm-ascend and matched vLLM, used in pip install.
# This value should be updated when cut down release.
'pip_vllm_ascend_version': "0.9.0rc2",
'pip_vllm_version': "0.9.0",
'pip_vllm_ascend_version': "0.9.2rc1",
'pip_vllm_version': "0.9.2",
# CANN image tag
'cann_image_tag': "8.1.rc1-910b-ubuntu22.04-py3.10",
# vllm version in ci
'ci_vllm_version': 'v0.9.2',
}
# Add any paths that contain templates here, relative to this directory.
@ -133,3 +136,7 @@ if READTHEDOCS_VERSION_TYPE == "tag":
def setup(app):
pass
if __name__ == "__main__":
print(json.dumps(myst_substitutions))

View File

@ -4,7 +4,7 @@
It's recommended to set up a local development environment to build and test
before you submit a PR.
### Prepare environment and build
### Setup development environment
Theoretically, the vllm-ascend build is only supported on Linux because
`vllm-ascend` dependency `torch_npu` only supports Linux.
@ -12,73 +12,64 @@ Theoretically, the vllm-ascend build is only supported on Linux because
But you can still set up dev env on Linux/Windows/macOS for linting and basic
test as following commands:
#### Run lint locally
```bash
# Choose a base dir (~/vllm-project/) and set up venv
cd ~/vllm-project/
python3 -m venv .venv
source ./.venv/bin/activate
# Clone vllm code and install
git clone https://github.com/vllm-project/vllm.git
# Clone vllm-ascend and install
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
# Install lint requirement and enable pre-commit hook
pip install -r requirements-lint.txt
# Run lint (You need install pre-commits deps via proxy network at first time)
bash format.sh
```
#### Run CI locally
After complete "Run lint" setup, you can run CI locally:
```{code-block} bash
:substitutions:
cd ~/vllm-project/
# Run CI need vLLM installed
git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/build.txt
VLLM_TARGET_DEVICE="empty" pip install .
cd ..
# Clone vllm-ascend and install
git clone https://github.com/vllm-project/vllm-ascend.git
# Install requirements
cd vllm-ascend
# install system requirement
apt install -y gcc g++ cmake libnuma-dev
# install project requirement
# For Linux:
pip install -r requirements-dev.txt
# For non Linux:
cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
# Then you can run lint and mypy test
bash format.sh
# Run ci:
bash format.sh ci
```
# Build:
# - only supported on Linux (torch_npu available)
# pip install -e .
# - build without deps for debugging in other OS
# pip install -e . --no-deps
# - build without custom ops
# COMPILE_CUSTOM_KERNELS=0 pip install -e .
#### Submit the commit
```bash
# Commit changed files using `-s`
git commit -sm "your commit info"
```
### Testing
🎉 Congratulations! You have completed the development environment setup.
Although vllm-ascend CI provide integration test on [Ascend](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml), you can run it
locally. The simplest way to run these integration tests locally is through a container:
### Test locally
```bash
# Under Ascend NPU environment
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
export IMAGE=vllm-ascend-dev-image
export CONTAINER_NAME=vllm-ascend-dev
export DEVICE=/dev/davinci1
# The first build will take about 10 mins (10MB/s) to download the base image and packages
docker build -t $IMAGE -f ./Dockerfile .
# You can also specify the mirror repo via setting VLLM_REPO to speedup
# docker build -t $IMAGE -f ./Dockerfile . --build-arg VLLM_REPO=https://gitee.com/mirrors/vllm
docker run --rm --name $CONTAINER_NAME --network host --device $DEVICE \
--device /dev/davinci_manager --device /dev/devmm_svm \
--device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-ti $IMAGE bash
cd vllm-ascend
pip install -r requirements-dev.txt
pytest tests/
```
You can refer to [Testing](./testing.md) doc to help you setup testing environment and running tests locally.
## DCO and Signed-off-by
@ -111,3 +102,10 @@ If the PR spans more than one category, please include all relevant prefixes.
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
:::{toctree}
:caption: Index
:maxdepth: 1
testing
:::

View File

@ -0,0 +1,280 @@
# Testing
This secition explains how to write e2e tests and unit tests to verify the implementation of your feature.
## Setup test environment
The fastest way to setup test environment is to use the main branch container image:
:::::{tab-set}
:sync-group: e2e
::::{tab-item} Local (CPU)
:selected:
:sync: cpu
You can run the unit tests on CPU with the following steps:
```{code-block} bash
:substitutions:
cd ~/vllm-project/
# ls
# vllm vllm-ascend
# Use mirror to speedup download
# docker pull quay.nju.edu.cn/ascend/cann:|cann_image_tag|
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm --name vllm-ascend-ut \
-v $(pwd):/vllm-project \
-v ~/.cache:/root/.cache \
-ti $IMAGE bash
# (Optional) Configure mirror to speedup download
sed -i 's|ports.ubuntu.com|mirrors.huaweicloud.com|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.huaweicloud.com/repository/pypi/simple/
# For torch-npu dev version or x86 machine
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
apt-get update -y
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
# Install vllm
cd /vllm-project/vllm
VLLM_TARGET_DEVICE=empty python3 -m pip -v install .
# Install vllm-ascend
cd /vllm-project/vllm-ascend
# [IMPORTANT] Import LD_LIBRARY_PATH to enumerate the CANN environment under CPU
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
python3 -m pip install -r requirements-dev.txt
python3 -m pip install -v .
```
::::
::::{tab-item} Single card
:sync: single
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:main
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
After starting the container, you should install the required packages:
```bash
# Prepare
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# Install required packages
pip install -r requirements-dev.txt
```
::::
::::{tab-item} Multi cards
:sync: multi
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:main
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
After starting the container, you should install the required packages:
```bash
cd /vllm-workspace/vllm-ascend/
# Prepare
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# Install required packages
pip install -r requirements-dev.txt
```
::::
:::::
## Running tests
### Unit test
There are several principles to follow when writing unit tests:
- The test file path should be consistent with source file and start with `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
- The vLLM Ascend test are using unittest framework, see [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPU, so you must mock the device-related function to host.
- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
- You can run the unit tests using `pytest`:
:::::{tab-set}
:sync-group: e2e
::::{tab-item} Local (CPU)
:selected:
:sync: cpu
```bash
# Run unit tests
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
```
::::
::::{tab-item} Single card
:sync: single
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
pytest -sv tests/ut
# Run single test
pytest -sv tests/ut/test_ascend_config.py
```
::::
::::{tab-item} Multi cards test
:sync: multi
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
pytest -sv tests/ut
# Run single test
pytest -sv tests/ut/test_ascend_config.py
```
::::
:::::
### E2E test
Although vllm-ascend CI provide [e2e test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
locally.
:::::{tab-set}
:sync-group: e2e
::::{tab-item} Local (CPU)
:sync: cpu
You can't run e2e test on CPU.
::::
::::{tab-item} Single card
:selected:
:sync: single
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
# Run a certain test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py
# Run a certain case in test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models
```
::::
::::{tab-item} Multi cards test
:sync: multi
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/
# Run a certain test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_dynamic_npugraph_batchsize.py
# Run a certain case in test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models
```
::::
:::::
This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
#### E2E test example:
- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)
- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
The CI resource is limited, you might need to reduce layer number of the model, below is an example of how to generate a reduced layer model:
1. Fork the original model repo in modelscope, we need all the files in the repo except for weights.
2. Set `num_hidden_layers` to the expected number of layers, e.g., `{"num_hidden_layers": 2,}`
3. Copy the following python script as `generate_random_weight.py`. Set the relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and `DIST_MODEL_PATH` as needed:
```python
import torch
from transformers import AutoTokenizer, AutoConfig
from modeling_deepseek import DeepseekV3ForCausalLM
from modelscope import snapshot_download
MODEL_LOCAL_PATH = "~/.cache/modelscope/models/vllm-ascend/DeepSeek-V3-Pruning"
DIST_DTYPE = torch.bfloat16
DIST_MODEL_PATH = "./random_deepseek_v3_with_2_hidden_layer"
config = AutoConfig.from_pretrained(MODEL_LOCAL_PATH, trust_remote_code=True)
model = DeepseekV3ForCausalLM(config)
model = model.to(DIST_DTYPE)
model.save_pretrained(DIST_MODEL_PATH)
```
### Run doctest
vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to run all doctests in the doc files.
The doctest is a good way to make sure the docs are up to date and the examples are executable, you can run it locally as follows:
```bash
# Run doctest
/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh
```
This will reproduce the same environment as the CI: [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).

View File

@ -1,17 +1,10 @@
# Evaluation
# Accuracy
:::{toctree}
:caption: Accuracy
:maxdepth: 1
using_evalscope
using_lm_eval
using_opencompass
using_evalscope
accuracy_report/index
:::
:::{toctree}
:caption: Performance
:maxdepth: 1
performance_benchmark
profile_execute_duration
:::

View File

@ -0,0 +1,9 @@
# Feature Guide
This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
:::{toctree}
:caption: Feature Guide
:maxdepth: 1
patch
:::

View File

@ -0,0 +1,82 @@
# Patch in vLLM Ascend
vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM and vLLM Ascend is different, and the hardware limitation in some case, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to address the change for vLLM.
## Principle
We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend originally. In vLLM Ascend, we have the basic principle for Patch strategy:
1. Less is more. Please do not patch unless it's the only way currently.
2. Once a patch is added, it's required to describe the future plan for removing the patch.
3. Anytime, clean the patch code is welcome.
## How it works
In `vllm_ascend/patch`, you can see the code structure as follows:
```
vllm_ascend
├── patch
│ ├── platform
│ │ ├── patch_0_9_2
│ │ ├── patch_common
│ │ ├── patch_main
│ ├── worker
│ │ ├── patch_0_9_2
│ │ ├── patch_common
│ │ ├── patch_main
└───────────
```
- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
- For online mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker_v1::NPUWorker::__init__` when the vLLM worker process is initialized.
- For both online and offline mode, vLLM engine core process calls the worker patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
In both **platform** and **worker** folder, there are several patch modules. They are used for patching different version of vLLM.
- `patch_0_9_2`: This module is used for patching vLLM 0.9.2. The version is always the nearest version of vLLM. Once vLLM is released, we will drop this patch module and bump to a new version. For example, `patch_0_9_2` is used for patching vLLM 0.9.2.
- `patch_main`: This module is used for patching the code in vLLM main branch.
- `patch_common`: This module is used for patching both vLLM 0.9.2 and vLLM main branch.
## How to write a patch
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.9.2 and main of vLLM.
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
4. Write your patch code in the new file. Here is an example:
```python
import vllm
def patch_destroy_model_parallel():
# your patch code
...
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
```
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
```
# ** File: <The patch file name> **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `<The target patch module in vLLM>`
# Why:
# <Describe the reason why we need to patch>
# How
# <Describe the way to patch>
# Related PR (if no, explain why):
# <Add a link to the related PR in vLLM. If there is no related PR, explain why>
# Future Plan:
# <Describe the future plan to remove the patch>
```
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
## Limitation
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.n, the version of vLLM may be change to v0.9.nxxx, in this case, the patch for v0.9.n in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.9.2 should work.

View File

@ -0,0 +1,258 @@
# Adding a New Model
This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to
[vllm official doc: Adding a New Model](https://docs.vllm.ai/en/stable/contributing/model/) first.
## Step 1: Implementing Models with `torch` and `torch_npu`
This section provides instructions for implementing new models compatible with vllm and vllm-ascend.
**Before starting:**
- Verify whether your model already exists in vllm's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
- Use existing models' implementation as templates to accelerate your development.
### Method 1: Implementing New Models from Scratch
Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
**Key implementation requirements:**
1. Place model files in `vllm_ascend/models/` directory.
2. Standard module structure for decoder-only LLMs (please checkout vllm's implementations for other kinds of model):
- `*ModelForCausalLM` (top-level wrapper)
- `*Model` (main architecture)
- `*DecoderLayer` (transformer block)
- `*Attention` and `*MLP` (specific computation unit)
:::{note}
`*` denotes your model's unique identifier.
:::
3. Critical Implementation Details:
All modules must include a `prefix` argument in `__init__()`.
**Required interfaces:**
| Module Type | Required Methods |
| :------------------- | :---------------------------------------- |
| `*ModelForCausalLM` | `get_input_embeddings`, `compute_logits`, `load_weights` |
| `*Model` | `get_input_embeddings`, `load_weights` |
4. Attention Backend Integration:
Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py`).
5. Tensor Parallelism:
Use vllm's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py`):
```python
from collections.abc import Iterable
from typing import Optional, Union
import torch
from torch import nn
from vllm.attention import Attention
from vllm.config import VllmConfig
from vllm.sequence import IntermediateTensors
from vllm.model_executor.sampling_metadata import SamplingMetadata
class CustomAttention(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.attn = Attention(prefix=f"{prefix}.attn")
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# Implement attention logic
...
class CustomDecoderLayer(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn")
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# Implement decoder layer
...
class CustomModel(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.layers = nn.ModuleList([
CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}")
for i in range(vllm_config.model_config.hf_config.num_hidden_layers)
])
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
...
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
...
def load_weights(self,
weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
...
class CustomModelForCausalLM(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
self.model = CustomModel(vllm_config, prefix=f"{prefix}.model")
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
...
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
...
def compute_logits(self,
hidden_states: torch.Tensor,
sampling_metadata: SamplingMetadata) -> torch.Tensor:
...
def load_weights(self,
weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
...
```
### Method 2: Customizing Existing vLLM Models
For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
```python
from typing import List, Optional
import torch
from vllm.attention import AttentionMetadata
from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM
from vllm.sequence import IntermediateTensors
class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
# Define merged weights for quantization/efficiency
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
"experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
}
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: Optional[List[torch.Tensor]] = None,
attn_metadata: Optional[AttentionMetadata] = None,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
# Custom forward logic
hidden_states = self.model(
input_ids,
positions,
kv_caches,
attn_metadata,
intermediate_tensors,
inputs_embeds
)
return hidden_states
```
:::{note}
For a complete implementation reference, see: `vllm_ascend/models/deepseek_v2.py`.
:::
## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM
vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.
To integrate your implemented model from `vllm_ascend/models/` directory:
1. Import your model implementation in `vllm_ascend/models/__init__.py` using relative imports.
2. Register the model wrapper class via `vllm.ModelRegistry.register_model()` function.
**Reference Registration Template** (an example of registering new models in `vllm_ascend/models/__init__.py`):
```python
from vllm import ModelRegistry
def register_model():
from .custom_model import CustomModelForCausalLM # New custom model
from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM # Customized Deepseek
# For NEW architectures: Register with unique name
ModelRegistry.register_model(
"CustomModelForCausalLM", # Must match config.json's 'architectures'
"vllm_ascend.models.custom_model:CustomModelForCausalLM"
)
# For MODIFIED architectures: Use original name
ModelRegistry.register_model(
"DeepseekV2ForCausalLM", # Original architecture identifier in vLLM
"vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM "
)
```
:::{note}
The first argument of `vllm.ModelRegistry.register_model()` indicates the unique architecture identifier which must match `architectures` in `config.json` of the model.
```json
{
"architectures": [
"CustomModelForCausalLM"
],
}
```
:::
## Step 3: Verification
### Case 1: Overriding Existing vLLM Model Architecture
If you're registering a customized model architecture based on vllm's existing implementation (overriding vllm's original class), when executing vllm offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
```bash
Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
```
### Case 2: Registering New Model Architecture
If you're registering a novel model architecture not present in vllm (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
```python
logger.info(f"model_arch: {model_arch} has been registered here!")
```
After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).
```bash
model_arch: CustomModelForCausalLM has been registered here!
```
This log output confirms your novel model architecture has been successfully registered in vllm.
## Step 4: Testing
After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.
Find more details at:
- [Accuracy test guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/evaluation/index.html)
- [Performance benchmark guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/performance/performance_benchmark.html)
## Step 5: Updating Supported Models Doc
At last, if all the steps above are completed, you should add the new model into our [Supported Models](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html) doc.

View File

@ -0,0 +1,3 @@
# Adding a New Multi-Modal Model
**_Comming soon ..._**

View File

@ -0,0 +1,10 @@
# Modeling
This section provides tutorials of how to implement and register a new model into vllm-ascend.
:::{toctree}
:caption: Modeling
:maxdepth: 1
adding_a_new_model
adding_a_new_multimodal_model
:::

View File

@ -0,0 +1,8 @@
# Performance
:::{toctree}
:caption: Performance
:maxdepth: 1
performance_benchmark
profile_execute_duration
:::

View File

@ -9,6 +9,11 @@ The execution duration of each stage (including pre/post-processing, model forwa
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
```
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
```
## Example Output
```

View File

@ -3,19 +3,19 @@
## Version Specific FAQs
- [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
- [[v0.9.0rc2] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1115)
- [[v0.9.2rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1742)
## General FAQs
### 1. What devices are currently supported?
Currently, **ONLY Atlas A2 series** (Ascend-cann-kernels-910b) are supported:
Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b) and Atlas 300I(Ascend-cann-kernels-310p) series are supported:
- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 Inference series (Atlas 800I A2)
- Atlas 300I Inference series (Atlas 300I Duo)
Below series are NOT supported yet:
- Atlas 300I Duo、Atlas 300I Pro (Ascend-cann-kernels-310p) might be supported on 2025.Q2
- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet
@ -35,7 +35,7 @@ docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
### 3. What models does vllm-ascend supports?
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html).
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_models.html).
### 4. How to get in touch with our community?
@ -48,7 +48,7 @@ There are many channels that you can communicate with our community developers /
### 5. What features does vllm-ascend V1 supports?
Find more details [<u>here</u>](https://github.com/vllm-project/vllm-ascend/issues/414).
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).
### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
@ -69,7 +69,7 @@ If all above steps are not working, feel free to submit a GitHub issue.
### 7. How does vllm-ascend perform?
Currently, only some models are improved. Such as `Qwen2 VL`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
### 8. How vllm-ascend work with vllm?
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
@ -84,9 +84,9 @@ Currently, w8a8 quantization is already supported by vllm-ascend originally on v
### 11. How to run w8a8 DeepSeek model?
Please following the [quantization inferencing tutorail](https://vllm-ascend.readthedocs.io/en/main/tutorials/multi_npu_quantization.html) and replace model to DeepSeek.
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
### 12. There is not output in log when loading models using vllm-ascend, How to solve it?
### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
@ -94,9 +94,9 @@ If you're using vllm 0.7.3 version, this is a known progress bar display issue i
vllm-ascend is tested by functional test, performance test and accuracy test.
- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit testson vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/suppoted_features.html) via e2e test
- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit testson vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) via e2e test
- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website like [vllm](https://simon-mo-workspace.observablehq.cloud/vllm-dashboard-v0/perf) does to show the performance test results for each pull request
- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website to show the performance test results for each pull request
- **Accuracy test**: we're working on adding accuracy test to CI as well.
@ -114,7 +114,7 @@ In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynam
- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
### 15. Failed to enable NPU graph mode when running DeepSeek?
### 16. Failed to enable NPU graph mode when running DeepSeek?
You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
@ -123,3 +123,47 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
[rank0]: RuntimeError: EZ9999: Inner Error!
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
```
### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
### 18. How to generate determinitic results when using vllm-ascend?
There are several factors that affect output certainty:
1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
2. Set the following enveriments parameters:
```bash
export LCCL_DETERMINISTIC = 1
export HCCL_DETERMINISTIC = 1
export ATB_MATMUL_SHUFFLE_K_ENABLE = 0
export ATB_LLM_LCOC_ENABLE = 0
```
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.

View File

@ -43,11 +43,9 @@ faqs
:::{toctree}
:caption: User Guide
:maxdepth: 1
user_guide/suppoted_features
user_guide/supported_models
user_guide/env_vars
user_guide/additional_config
user_guide/graph_mode.md
user_guide/support_matrix/index
user_guide/configuration/index
user_guide/feature_guide/index
user_guide/release_notes
:::
@ -55,9 +53,11 @@ user_guide/release_notes
:::{toctree}
:caption: Developer Guide
:maxdepth: 1
developer_guide/contributing
developer_guide/versioning_policy
developer_guide/contribution/index
developer_guide/feature_guide/index
developer_guide/evaluation/index
developer_guide/performance/index
developer_guide/modeling/index
:::
% How to involve vLLM Ascend
@ -66,11 +66,6 @@ developer_guide/evaluation/index
:maxdepth: 1
community/governance
community/contributors
:::
% User stories about vLLM Ascend project
:::{toctree}
:caption: User Story
:maxdepth: 1
user_stories/index
community/versioning_policy
community/user_stories/index
:::

View File

@ -9,11 +9,11 @@ This document describes how to install vllm-ascend manually.
- A hardware with Ascend NPU. It's usually the Atlas 800 A2 series.
- Software:
| Software | Supported version | Note |
|-----------|-------------------|----------------------------------------|
| CANN | >= 8.1.RC1 | Required for vllm-ascend and torch-npu |
| torch-npu | >= 2.5.1 | Required for vllm-ascend |
| torch | >= 2.5.1 | Required for torch-npu and vllm |
| Software | Supported version | Note |
|---------------|----------------------------------|-------------------------------------------|
| CANN | >= 8.1.RC1 | Required for vllm-ascend and torch-npu |
| torch-npu | >= 2.5.1.post1.dev20250619 | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
| torch | >= 2.5.1 | Required for torch-npu and vllm |
You have 2 way to install:
- **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
@ -78,17 +78,17 @@ source vllm-ascend-env/bin/activate
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs 'numpy<2.0.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
# Download and install the CANN package.
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
./Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run --full
source /usr/local/Ascend/ascend-toolkit/set_env.sh
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
./Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run --install
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
./Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run --install
@ -116,17 +116,23 @@ Once it's done, you can start to set up `vllm` and `vllm-ascend`.
:selected:
:sync: pip
First install system dependencies:
First install system dependencies and config pip mirror:
```bash
apt update -y
apt install -y gcc g++ cmake libnuma-dev wget git
# Using apt-get with mirror
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
# Or using yum
# yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
# Config pip mirror
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
**[Optional]** Config the extra-index of `pip` if you are working on a **x86** machine, so that the torch with cpu could be found:
**[Optional]** Then config the extra-index of `pip` if you are working on a x86 machine or using torch-npu dev version:
```bash
pip config set global.extra-index-url https://download.pytorch.org/whl/cpu/
# For torch-npu dev version or x86 machine
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
```
Then you can install `vllm` and `vllm-ascend` from **pre-built wheel**:
@ -245,7 +251,8 @@ for output in outputs:
Then run:
```bash
# export VLLM_USE_MODELSCOPE=true to speed up download if huggingface is not reachable.
# Try `export VLLM_USE_MODELSCOPE=true` and `pip install modelscope`
# to speed up download if huggingface is not reachable.
python example.py
```

View File

@ -32,6 +32,8 @@ docker run --rm \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
# Install curl
apt-get update -y && apt-get install -y curl
```
::::
@ -58,6 +60,8 @@ docker run --rm \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
# Install curl
yum update -y && yum install -y curl
```
::::
:::::

View File

@ -5,7 +5,12 @@
:maxdepth: 1
single_npu
single_npu_multimodal
single_npu_audio
single_npu_qwen3_embedding
multi_npu
multi_npu_moge
multi_npu_qwen3_moe
multi_npu_quantization
single_node_300i
multi_node
:::

View File

@ -1,11 +1,19 @@
# Multi-Node (DeepSeek)
# Multi-Node-DP (DeepSeek)
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single NPU. In such cases, the model can be distributed using tensor parallelism and pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
## Getting Start
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
* **Verify Multi-Node Communication Environment**
* **Set Up and Start the Ray Cluster**
* **Start the Online Inference Service on multinode**
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which dont currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
## Verify Multi-Node Communication Environment
@ -45,24 +53,20 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Set Up and Start the Ray Cluster
### Setting Up the Basic Container
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
## Run with docker
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node.
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
Below is the example container setup command, which should be executed on **all nodes** :
```shell
# Define the image and container name
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
@ -75,121 +79,120 @@ docker run --rm \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```
### Start Ray Cluster
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
Below are the commands for the head and worker nodes:
**Head node**:
:::{note}
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
Updating the environment variables requires restarting the Ray cluster.
Before launch the inference server, ensure some environment variables are set for multi node communication
:::
Run the following scripts on two nodes respectively
**node0**
```shell
# Head node
export HCCL_IF_IP={local_ip}
export GLOO_SOCKET_IFNAME={nic_name}
export TP_SOCKET_IFNAME={nic_name}
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ray start --head --num-gpus=8
```
**Worker node**:
#!/bin/sh
:::{note}
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
:::
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"
```shell
# Worker node
export HCCL_IF_IP={local_ip}
export GLOO_SOCKET_IFNAME={nic_name}
export TP_SOCKET_IFNAME={nic_name}
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
```
:::{tip}
Before starting the Ray cluster, set the `export ASCEND_PROCESS_LOG_PATH={plog_save_path}` environment variable on each node to redirect the Ascend plog, which helps in debugging issues during multi-node execution.
:::
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
## Start the Online Inference Service on multinode
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node.
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
```shell
python -m vllm.entrypoints.openai.api_server \
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
--trust-remote-code \
--enforce-eager \
--distributed_executor_backend "ray" \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--disable-frontend-multiprocessing \
--port {port_num}
```
:::{note}
Pipeline parallelism currently requires AsyncLLMEngine, hence the `--disable-frontend-multiprocessing` is set.
:::
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
```shell
python -m vllm.entrypoints.openai.api_server \
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
--trust-remote-code \
--distributed_executor_backend "ray" \
--enforce-eager \
--tensor-parallel-size 16 \
--port {port_num}
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /root/.cache/ds_v3 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--quantization ascend \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
:::{note}
If you're running DeepSeek V3/R1, please remove `quantization_config` section in `config.json` file since it's not supported by vllm-ascend currently.
:::
**node1**
```shell
#!/bin/sh
nic_name="xxx"
local_ip="xxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
vllm serve /root/.cache/ds_v3 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address { node0 ip } \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--seed 1024 \
--quantization ascend \
--served-model-name deepseek_v3 \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
![alt text](../assets/multi_node_dp.png)
Once your server is started, you can query the model with input prompts:
```shell
curl -X POST http://127.0.0.1:{prot_num}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Deepseek/DeepSeek-V2-Lite-Chat",
"prompt": "The future of AI is",
"max_tokens": 24
}'
curl http://{ node0 ip:8004 }/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/ds_v3",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
```
If you query the server successfully, you can see the info shown below (client):
```
{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
```
Logs of the vllm server:
```
INFO: 127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-19 17:37:35 metrics.py:453 Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, NPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
## Run benchmarks
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
```shell
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \
--dataset-name random --random-input-len 128 --random-output-len 128 \
--num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
```

View File

@ -0,0 +1,235 @@
# Multi-NPU (Pangu Pro MoE)
## Run vllm-ascend on Multi-NPU
Run container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
Download the model:
```bash
git lfs install
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
```
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
```bash
vllm serve /path/to/pangu-pro-moe-model \
--tensor-parallel-size 4 \
--trust-remote-code \
--enforce-eager
```
Once your server is started, you can query the model with input prompts:
:::::{tab-set}
::::{tab-item} v1/completions
```{code-block} bash
:substitutions:
export question="你是谁?"
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
::::{tab-item} v1/chat/completions
```{code-block} bash
:substitutions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": ""},
{"role": "user", "content": "你是谁?"}
],
"max_tokens": "64",
"top_p": "0.95",
"top_k": "50",
"temperature": "0.6",
"add_special_tokens" : true
}'
```
::::
:::::
If you run this successfully, you can see the info shown below:
```json
{"id":"cmpl-2cd4223228ab4be9a91f65b882e65b32","object":"text_completion","created":1751255067,"model":"/root/.cache/pangu-pro-moe-model","choices":[{"index":0,"text":" [unused16] 好的用户问我是谁我需要根据之前的设定来回答。用户提到我是华为开发的“盘古Reasoner”属于盘古大模型系列作为智能助手帮助解答问题和提供 信息支持。现在用户再次询问,可能是在确认我的身份或者测试我的回答是否一致。\n\n首先我要确保","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":15,"total_tokens":79,"completion_tokens":64,"prompt_tokens_details":null},"kv_transfer_params":null}
```
### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
:::::{tab-set}
::::{tab-item} Graph Mode
```{code-block} python
:substitutions:
import gc
from transformers import AutoTokenizer
import torch
import os
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
tests = [
"Hello, my name is",
"The future of AI is",
]
prompts = []
for text in tests:
messages = [
{"role": "system", "content": ""}, # Optionally customize system content
{"role": "user", "content": text}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompts.append(prompt)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/path/to/pangu-pro-moe-model",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=1024,
trust_remote_code=True,
additional_config={
'torchair_graph_config': {
'enabled': True,
},
'ascend_scheduler_config':{
'enabled': True,
'enable_chunked_prefill' : False,
'chunked_prefill_enabled': False
},
})
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
::::{tab-item} Eager Mode
```{code-block} python
:substitutions:
import gc
from transformers import AutoTokenizer
import torch
import os
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
tests = [
"Hello, my name is",
"The future of AI is",
]
prompts = []
for text in tests:
messages = [
{"role": "system", "content": ""}, # Optionally customize system content
{"role": "user", "content": text}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompts.append(prompt)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/path/to/pangu-pro-moe-model",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=1024,
trust_remote_code=True,
enforce_eager=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
:::::
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```

View File

@ -1,6 +1,6 @@
# Multi-NPU (QwQ 32B W8A8)
## Run docker container:
## Run docker container
:::{note}
w8a8 quantization feature is supported by v0.8.4rc2 or higher
:::

View File

@ -0,0 +1,109 @@
# Multi-NPU (Qwen3-30B-A3B)
## Run vllm-ascend on Multi-NPU with Qwen3 MoE
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
```bash
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-30B-A3B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 4096
}'
```
### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="Qwen/Qwen3-30B-A3B",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=4096,
enable_expert_parallel=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: " Lucy. I'm from the UK and I'm 11 years old."
Prompt: 'The future of AI is', Generated text: ' a topic that has captured the imagination of scientists, philosophers, and the general public'
```

View File

@ -0,0 +1,330 @@
# Single Node (Atlas 300I series)
```{note}
This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
```
## Run vLLM on Altlas 300I series
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-310p
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference on NPU
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
:::::{tab-set}
:sync-group: inference
::::{tab-item} Qwen3-0.6B
:selected:
:sync: qwen0.6
Run the following command to start the vLLM server:
```{code-block} bash
:substitutions:
vllm serve Qwen/Qwen3-0.6B \
--tensor-parallel-size 1 \
--enforce-eager \
--dtype float16 \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
::::{tab-item} Qwen/Qwen2.5-7B-Instruct
:sync: qwen7b
Run the following command to start the vLLM server:
```{code-block} bash
:substitutions:
vllm serve Qwen/Qwen2.5-7B-Instruct \
--tensor-parallel-size 2 \
--enforce-eager \
--dtype float16 \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
::::{tab-item} Pangu-Pro-MoE-72B
:sync: pangu
Download the model:
```bash
git lfs install
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
```
Run the following command to start the vLLM server:
```{code-block} bash
:substitutions:
vllm serve /home/pangu-pro-moe-mode/ \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--dtype "float16" \
--trust-remote-code \
--enforce-eager
```
Once your server is started, you can query the model with input prompts
```bash
export question="你是谁?"
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
:::::
If you run this script successfully, you can see the results.
### Offline Inference
Run the following script (`example.py`) to execute offline inference on NPU:
:::::{tab-set}
:sync-group: inference
::::{tab-item} Qwen3-0.6B
:selected:
:sync: qwen0.6
```{code-block} python
:substitutions:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen3-0.6B",
tensor_parallel_size=1,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
::::{tab-item} Qwen2.5-7B-Instruct
:sync: qwen7b
```{code-block} python
:substitutions:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
tensor_parallel_size=2,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
::::{tab-item} Pangu-Pro-MoE-72B
:sync: pangu
Download the model:
```bash
git lfs install
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
```
```{code-block} python
:substitutions:
import gc
from transformers import AutoTokenizer
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("/home/pangu-pro-moe-mode/", trust_remote_code=True)
tests = [
"Hello, my name is",
"The future of AI is",
]
prompts = []
for text in tests:
messages = [
{"role": "system", "content": ""}, # Optionally customize system content
{"role": "user", "content": text}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # 推荐使用官方的template
prompts.append(prompt)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/home/pangu-pro-moe-mode/",
tensor_parallel_size=8,
distributed_executor_backend="mp",
enable_expert_parallel=True,
dtype="float16",
max_model_len=1024,
trust_remote_code=True,
enforce_eager=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
:::::
Run script:
```bash
python example.py
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: " Lina. I'm a 22-year-old student from China. I'm interested in studying in the US. I'm looking for a job in the US. I want to know if there are any opportunities in the US for me to work. I'm also interested in the culture and lifestyle in the US. I want to know if there are any opportunities for me to work in the US. I'm also interested in the culture and lifestyle in the US. I'm interested in the culture"
Prompt: 'The future of AI is', Generated text: " not just about the technology itself, but about how we use it to solve real-world problems. As AI continues to evolve, it's important to consider the ethical implications of its use. AI has the potential to bring about significant changes in society, but it also has the power to create new challenges. Therefore, it's crucial to develop a comprehensive approach to AI that takes into account both the benefits and the risks associated with its use. This includes addressing issues such as bias, privacy, and accountability."
```

View File

@ -42,7 +42,12 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
Run the following script to execute offline inference on a single NPU:
```python
:::::{tab-set}
::::{tab-item} Graph Mode
```{code-block} python
:substitutions:
import os
from vllm import LLM, SamplingParams
prompts = [
@ -50,7 +55,10 @@ prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen/Qwen3-8B", max_model_len=26240)
llm = LLM(
model="Qwen/Qwen3-8B",
max_model_len=26240
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
@ -58,6 +66,34 @@ for output in outputs:
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
::::
::::{tab-item} Eager Mode
```{code-block} python
:substitutions:
import os
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="Qwen/Qwen3-8B",
max_model_len=26240,
enforce_eager=True
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
::::
:::::
If you run this script successfully, you can see the info shown below:
@ -70,9 +106,11 @@ Prompt: 'The future of AI is', Generated text: ' following you. As the technolog
Run docker container to start the vLLM server on a single NPU:
:::::{tab-set}
::::{tab-item} Graph Mode
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
@ -93,6 +131,33 @@ docker run --rm \
-it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240
```
::::
::::{tab-item} Eager Mode
```{code-block} bash
:substitutions:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
```
::::
:::::
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.

View File

@ -0,0 +1,122 @@
# Single NPU (Qwen2-Audio 7B)
## Run vllm-ascend on Single NPU
### Offline Inference on Single NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
:::{note}
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::
Install packages required for audio processing:
```bash
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install librosa soundfile
```
Run the following script to execute offline inference on a single NPU:
```python
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.utils import FlexibleArgumentParser
audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]
question_per_audio_count = {
1: "What is recited in the audio?",
2: "What sport and what nursery rhyme are referenced?"
}
def prepare_inputs(audio_count: int):
audio_in_prompt = "".join([
f"Audio {idx+1}: <|audio_bos|><|AUDIO|><|audio_eos|>\n"
for idx in range(audio_count)
])
question = question_per_audio_count[audio_count]
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n"
f"{audio_in_prompt}{question}<|im_end|>\n"
"<|im_start|>assistant\n")
mm_data = {
"audio":
[asset.audio_and_sample_rate for asset in audio_assets[:audio_count]]
}
# Merge text prompt and audio data into inputs
inputs = {"prompt": prompt, "multi_modal_data": mm_data}
return inputs
def main(audio_count: int):
# NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
# lower-end GPUs.
# Unless specified, these settings have been tested to work on a single L4.
# `limit_mm_per_prompt`: the max num items for each modality per prompt.
llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
max_model_len=4096,
max_num_seqs=5,
limit_mm_per_prompt={"audio": audio_count},
enforce_eager=True)
inputs = prepare_inputs(audio_count)
sampling_params = SamplingParams(temperature=0.2,
max_tokens=64,
stop_token_ids=None)
outputs = llm.generate(inputs, sampling_params=sampling_params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
if __name__ == "__main__":
audio_count = 2
main(audio_count)
```
If you run this script successfully, you can see the info shown below:
```bash
The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'.
```
### Online Serving on Single NPU
Currently, vllm's OpenAI-compatible server doesn't support audio inputs, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).

View File

@ -57,6 +57,7 @@ llm = LLM(
model=MODEL_PATH,
max_model_len=16384,
limit_mm_per_prompt={"image": 10},
enforce_eager=True,
)
sampling_params = SamplingParams(
@ -103,13 +104,11 @@ outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
```
If you run this script successfully, you can see the info shown below:
```bash
Processed prompts: 100%|███████████████| 1/1 [00:11<00:00, 11.29s/it, est. speed input: 9.48 toks/s, output: 20.55 toks/s]
The image displays a logo consisting of two main elements: a stylized geometric design and a pair of text elements.
1. **Geometric Design**: On the left side of the image, there is a blue geometric design that appears to be made up of interconnected shapes. These shapes resemble a network or a complex polygonal structure, possibly hinting at a technological or interconnected theme. The design is monochromatic and uses only blue as its color, which could be indicative of a specific brand or company.
@ -144,7 +143,11 @@ docker run --rm \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-VL-7B-Instruct --dtype bfloat16 --max_model_len 16384 --max-num-batched-tokens 16384
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--max_model_len 16384 \
--max-num-batched-tokens 16384 \
--enforce-eager
```
:::{note}

View File

@ -0,0 +1,99 @@
# Single NPU (Qwen3-Embedding-8B)
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
## Run docker container
Take Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference
```bash
vllm serve Qwen/Qwen3-Embedding-8B --task embed
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-Embedding-8B",
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
### Offline Inference
```python
import torch
import vllm
from vllm import LLM
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery:{query}'
if __name__=="__main__":
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
model = LLM(model="Qwen/Qwen3-Embedding-8B",
task="embed",
distributed_executor_backend="mp")
outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
```
If you run this script successfully, you can see the info shown below:
```bash
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]
```

View File

@ -1,6 +1,6 @@
# Additional Configuration
addintional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.
additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.
## How to use
@ -28,9 +28,11 @@ The following table lists the additional configuration options available in vLLM
|-------------------------------| ---- |------|-----------------------------------------------------------------------------------------------|
| `torchair_graph_config` | dict | `{}` | The config options for torchair graph mode |
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
| `expert_tensor_parallel_size` | str | `0` | Expert tensor parallel size the model to use. |
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf case. |
| `expert_map_path` | str | None | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
| `expert_tensor_parallel_size` | str | `0` | Expert tensor parallel size the model to use. |
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case. |
| `expert_map_path` | str | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
| `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
The details of each config option are as follows:
@ -38,12 +40,14 @@ The details of each config option are as follows:
| Name | Type | Default | Description |
| ---- | ---- | ------- | ----------- |
| `enabled` | bool | `False` | Whether to enable torchair graph mode |
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models. |
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
| `enable_multistream_shared_expert`| bool | `False` | Whether to enable multistream shared expert |
| `enable_kv_nz`| bool | `False` | Whether to enable kvcache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
**ascend_scheduler_config**
@ -51,26 +55,27 @@ The details of each config option are as follows:
| ---- | ---- | ------- | ----------- |
| `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine|
ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `chunked_prefill_enabled: true` to ascend_scheduler_config as well.
ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.
### Example
A full example of additional configuration is as follows:
An example of additional configuration is as follows:
```
{
"torchair_graph_config": {
"enabled": true,
"use_cached_graph": true,
"enabled": True,
"use_cached_graph": True,
"graph_batch_sizes": [1, 2, 4, 8],
"graph_batch_sizes_init": false,
"enable_multistream_shared_expert": false
"graph_batch_sizes_init": False,
"enable_multistream_moe": False,
"enable_kv_nz": False
},
"ascend_scheduler_config": {
"enabled": true,
"chunked_prefill_enabled": true,
"enabled": True,
"enable_chunked_prefill": True,
},
"expert_tensor_parallel_size": 1,
"refresh": false,
"refresh": False,
}
```

View File

@ -2,7 +2,7 @@
vllm-ascend uses the following environment variables to configure the system:
:::{literalinclude} ../../../vllm_ascend/envs.py
:::{literalinclude} ../../../../vllm_ascend/envs.py
:language: python
:start-after: begin-env-vars-definition
:end-before: end-env-vars-definition

View File

@ -0,0 +1,10 @@
# Configuration Guide
This section provides a detailed configuration guide of vLLM Ascend.
:::{toctree}
:caption: Configuration Guide
:maxdepth: 1
env_vars
additional_config
:::

View File

@ -1,17 +1,18 @@
# Graph Mode Guide
```{note}
This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
```
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested in 0.9.0rc1. We'll make it stable and generalize in the next release.
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We'll make it stable and generalize in the next release.
## Getting Started
From v0.9.0rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
There are two kinds for graph mode supported by vLLM Ascend:
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.0rc1, only Qwen series models are well tested.
- **TorchAirGraph**: This is the GE graph mode. In v0.9.0rc1, only DeepSeek series models are supported.
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
- **TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek series models are supported.
## Using ACLGraph
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
@ -23,8 +24,6 @@ import os
from vllm import LLM
os.environ["VLLM_USE_V1"] = 1
model = LLM(model="Qwen/Qwen2-7B-Instruct")
outputs = model.generate("Hello, how are you?")
```
@ -45,19 +44,18 @@ offline example:
import os
from vllm import LLM
os.environ["VLLM_USE_V1"] = 1
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enable": True}})
# TorchAirGraph is only work without chunked-prefill now
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True,}})
outputs = model.generate("Hello, how are you?")
```
online example:
```shell
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enable": true}}'
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
```
You can find more detail about additional config [here](./additional_config.md)
You can find more detail about additional config [here](../configuration/additional_config.md).
## Fallback to Eager Mode
@ -69,8 +67,6 @@ offline example:
import os
from vllm import LLM
os.environ["VLLM_USE_V1"] = 1
model = LLM(model="someother_model_weight", enforce_eager=True)
outputs = model.generate("Hello, how are you?")
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

View File

@ -0,0 +1,13 @@
# Feature Guide
This section provides a detailed usage guide of vLLM Ascend features.
:::{toctree}
:caption: Feature Guide
:maxdepth: 1
graph_mode
quantization
sleep_mode
structured_output
lora
:::

View File

@ -0,0 +1,8 @@
# LoRA Adapters Guide
Like vLLM, vllm-ascend supports LoRA as well. The usage and more details can be found in [vLLM official document](https://docs.vllm.ai/en/latest/features/lora.html).
You can also refer to [this](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) to find which models support LoRA in vLLM.
## Tips
If you fail to run vllm-ascend with LoRA, you may follow [this instruction](https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html#fallback-to-eager-mode) to disable graph mode and try again.

View File

@ -0,0 +1,106 @@
# Quantization Guide
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. Well support more quantization algorithm and models in the future.
## Install modelslim
To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is available for vLLM Ascend in the future.
Install modelslim:
```bash
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
cd msit/msmodelslim
bash install.sh
pip install accelerate
```
## Quantize model
Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).
```bash
cd example/DeepSeek
python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True
```
:::{note}
You can also download the quantized model that we uploaded. Please note that these weights should be used for test only. For example, https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8
:::
Once convert action is done, there are two important files generated.
1. [config.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Please make sure that there is no `quantization_config` field in it.
2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file.
Here is the full converted model files:
```bash
.
├── config.json
├── configuration_deepseek.py
├── configuration.json
├── generation_config.json
├── quant_model_description.json
├── quant_model_weight_w8a8_dynamic-00001-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00002-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00003-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00004-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic.safetensors.index.json
├── README.md
├── tokenization_deepseek_fast.py
├── tokenizer_config.json
└── tokenizer.json
```
## Run the model
Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
### Offline inference
```python
import torch
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="{quantized_model_save_path}",
max_model_len=2048,
trust_remote_code=True,
# Enable quantization by specifying `quantization="ascend"`
quantization="ascend")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
### Online inference
```bash
# Enable quantization by specifying `--quantization ascend`
vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code
```
## FAQs
### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `modelslim-VLLM-8.1.RC1.b020_001` modelslim version. Finally, if it still doesn't work, please
submit a issue, maybe some new models need to be adapted.
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
Please convert DeepSeek series models using `modelslim-VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing configuration_deepseek.py error.

View File

@ -0,0 +1,115 @@
# Sleep Mode Guide
## Overview
Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
## Getting started
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
- Level 1 Sleep
- Action: Offloads model weights and discards the KV cache.
- Memory: Model weights are moved to CPU memory; KV cache is forgotten.
- Use Case: Suitable when reusing the same model later.
- Note: Ensure sufficient CPU memory is available to hold the model weights.
- Level 2 Sleep
- Action: Discards both model weights and KV cache.
- Memory: The content of both the model weights and kv cache is forgotten.
- Use Case: Ideal when switching to a different model or updating the current one.
Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
## Usage
The following is a simple example of how to use sleep mode.
- offline inference:
```python
import os
import torch
from vllm import LLM, SamplingParams
from vllm.utils import GiB_bytes
os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
if __name__ == "__main__":
prompt = "How are you?"
free, total = torch.npu.mem_get_info()
print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB")
# record npu memory use baseline in case other process is running
used_bytes_baseline = total - free
llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
sampling_params = SamplingParams(temperature=0, max_tokens=10)
output = llm.generate(prompt, sampling_params)
llm.sleep(level=1)
free_npu_bytes_after_sleep, total = torch.npu.mem_get_info()
print(f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB")
used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline
# now the memory usage should be less than the model weights
# (0.5B model, 1GiB weights)
assert used_bytes < 1 * GiB_bytes
llm.wake_up()
output2 = llm.generate(prompt, sampling_params)
# cmp output
assert output[0].outputs[0].text == output2[0].outputs[0].text
```
- online serving:
:::{note}
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
:::
```bash
export VLLM_SERVER_DEV_MODE="1"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export VLLM_USE_MODELSCOPE="True"
vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
# after serveing is up, post these endpoints
# sleep level 1
curl -X POST http://127.0.0.1:8000/sleep \
-H "Content-Type: application/json" \
-d '{"level": "1"}'
curl -X GET http://127.0.0.1:8000/is_sleeping
# sleep level 2
curl -X POST http://127.0.0.1:8000/sleep \
-H "Content-Type: application/json" \
-d '{"level": "2"}'
# wake up
curl -X POST http://127.0.0.1:8000/wake_up
# wake up with tag, tags must be in ["weights", "kv_cache"]
curl -X POST "http://127.0.0.1:8000/wake_up?tags=weights"
curl -X GET http://127.0.0.1:8000/is_sleeping
# after sleep and wake up, the serving is still available
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```

View File

@ -0,0 +1,163 @@
# Structured Output Guide
## Overview
### What is Structured Output?
LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also called Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
In simple terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the models output, ensuring compliance with the desired structure.
![structured decoding](./images/structured_output_1.png)
### Structured Output in vllm-ascend
Currently, vllm-ascend supports **xgrammar** and **guidance** backend for structured output with vllm v1 engine.
XGrammar introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.
## How to Use Structured Output?
### Online Inference
You can also generate structured outputs using the OpenAI's Completions and Chat API. The following parameters are supported, which must be added as extra parameters:
- `guided_choice`: the output will be exactly one of the choices.
- `guided_regex`: the output will follow the regex pattern.
- `guided_json`: the output will follow the JSON schema.
- `guided_grammar`: the output will follow the context free grammar.
Structured outputs are supported by default in the OpenAI-Compatible Server. You can choose to specify the backend to use by setting the `--guided-decoding-backend` flag to vllm serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.
Now let´s see an example for each of the cases, starting with the guided_choice, as it´s the easiest one:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="-",
)
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_body={"guided_choice": ["positive", "negative"]},
)
print(completion.choices[0].message.content)
```
The next example shows how to use the guided_regex. The idea is to generate an email address, given a simple regex template:
```python
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
}
],
extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
)
print(completion.choices[0].message.content)
```
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. For this we can use the guided_json parameter in two different ways:
- Using a JSON Schema.
- Defining a Pydantic model and then extracting the JSON Schema from it.
The next example shows how to use the guided_json parameter with a Pydantic model:
```python
from pydantic import BaseModel
from enum import Enum
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
json_schema = CarDescription.model_json_schema()
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
}
],
extra_body={"guided_json": json_schema},
)
print(completion.choices[0].message.content)
```
Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can use to define a specific format of simplified SQL queries:
```python
simplified_sql_grammar = """
root ::= select_statement
select_statement ::= "SELECT " column " from " table " where " condition
column ::= "col_1 " | "col_2 "
table ::= "table_1 " | "table_2 "
condition ::= column "= " number
number ::= "1 " | "2 "
"""
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
}
],
extra_body={"guided_grammar": simplified_sql_grammar},
)
print(completion.choices[0].message.content)
```
Find more examples [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).
### Offline Inference
To use Structured Output, we'll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:
- json
- regex
- choice
- grammar
One example for the usage of the choice parameter is shown below:
```python
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct",
guided_decoding_backend="xgrammar")
guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
outputs = llm.generate(
prompts="Classify this sentiment: vLLM is wonderful!",
sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
```
Find more examples of other usages [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).

View File

@ -1,13 +0,0 @@
## {version}
### Highlights
- {feature}
### Bug fixes
- {bug}
### Other changes
- {change}
### Known issues
- {issue}
### Upgrade Notes
- {upgrade}
### Deprecation Notes
- {deprecation}

View File

@ -1,5 +1,68 @@
# Release note
## v0.9.2rc1 - 2025.07.11
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
### Highlights
- Pooling model works with V1 engine now. You can take a try with Qwen3 embedding model [#1359](https://github.com/vllm-project/vllm-ascend/pull/1359).
- The performance on Atlas 300I series has been improved. [#1591](https://github.com/vllm-project/vllm-ascend/pull/1591)
- aclgraph mode works with Moe models now. Currently, only Qwen3 Moe is well tested. [#1381](https://github.com/vllm-project/vllm-ascend/pull/1381)
### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.5.1.post1.dev20250619`. Dont forget to update it in your environment. [#1347](https://github.com/vllm-project/vllm-ascend/pull/1347)
- The **GatherV3** error has been fixed with **aclgraph** mode. [#1416](https://github.com/vllm-project/vllm-ascend/pull/1416)
- W8A8 quantization works on Atlas 300I series now. [#1560](https://github.com/vllm-project/vllm-ascend/pull/1560)
- Fix the accuracy problem with deploy models with parallel parameters. [#1678](https://github.com/vllm-project/vllm-ascend/pull/1678)
- The pre-built wheel package now requires lower version of glibc. Users can use it by `pip install vllm-ascend` directly. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
## Other
- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331)
- A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335)
- A new env variable `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future[#1732](https://github.com/vllm-project/vllm-ascend/pull/1732)
- A batch of bugs have been fixed for Data Parallelism case [#1273](https://github.com/vllm-project/vllm-ascend/pull/1273) [#1322](https://github.com/vllm-project/vllm-ascend/pull/1322) [#1275](https://github.com/vllm-project/vllm-ascend/pull/1275) [#1478](https://github.com/vllm-project/vllm-ascend/pull/1478)
- The DeepSeek performance has been improved. [#1194](https://github.com/vllm-project/vllm-ascend/pull/1194) [#1395](https://github.com/vllm-project/vllm-ascend/pull/1395) [#1380](https://github.com/vllm-project/vllm-ascend/pull/1380)
- Ascend scheduler works with prefix cache now. [#1446](https://github.com/vllm-project/vllm-ascend/pull/1446)
- DeepSeek now works with prefix cache now. [#1498](https://github.com/vllm-project/vllm-ascend/pull/1498)
- Support prompt logprobs to recover ceval accuracy in V1 [#1483](https://github.com/vllm-project/vllm-ascend/pull/1483)
## v0.9.1rc1 - 2025.06.22
This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started.
### Highlights
- Atlas 300I series is experimental supported in this release. [#1333](https://github.com/vllm-project/vllm-ascend/pull/1333) After careful consideration, this feature **will NOT be included in v0.9.1-dev branch** taking into account the v0.9.1 release quality and the feature rapid iteration to improve performance on Atlas 300I series. We will improve this from 0.9.2rc1 and later.
- Support EAGLE-3 for speculative decoding. [#1032](https://github.com/vllm-project/vllm-ascend/pull/1032)
### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.5.1.post1.dev20250528`. Dont forget to update it in your environment. [#1235](https://github.com/vllm-project/vllm-ascend/pull/1235)
- Support Atlas 300I series container image. You can get it from [quay.io](https://quay.io/repository/vllm/vllm-ascend)
- Fix token-wise padding mechanism to make multi-card graph mode work. [#1300](https://github.com/vllm-project/vllm-ascend/pull/1300)
- Upgrade vllm to 0.9.1 [#1165]https://github.com/vllm-project/vllm-ascend/pull/1165
### Other Improvements
- Initial support Chunked Prefill for MLA. [#1172](https://github.com/vllm-project/vllm-ascend/pull/1172)
- An example of best practices to run DeepSeek with ETP has been added. [#1101](https://github.com/vllm-project/vllm-ascend/pull/1101)
- Performance improvements for DeepSeek using the TorchAir graph. [#1098](https://github.com/vllm-project/vllm-ascend/pull/1098), [#1131](https://github.com/vllm-project/vllm-ascend/pull/1131)
- Supports the speculative decoding feature with AscendScheduler. [#943](https://github.com/vllm-project/vllm-ascend/pull/943)
- Improve `VocabParallelEmbedding` custom op performance. It will be enabled in the next release. [#796](https://github.com/vllm-project/vllm-ascend/pull/796)
- Fixed a device discovery and setup bug when running vLLM Ascend on Ray [#884](https://github.com/vllm-project/vllm-ascend/pull/884)
- DeepSeek with [MC2](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/developmentguide/opdevg/ascendcbestP/atlas_ascendc_best_practices_10_0043.html) (Merged Compute and Communication) now works properly. [#1268](https://github.com/vllm-project/vllm-ascend/pull/1268)
- Fixed log2phy NoneType bug with static EPLB feature. [#1186](https://github.com/vllm-project/vllm-ascend/pull/1186)
- Improved performance for DeepSeek with DBO enabled. [#997](https://github.com/vllm-project/vllm-ascend/pull/997), [#1135](https://github.com/vllm-project/vllm-ascend/pull/1135)
- Refactoring AscendFusedMoE [#1229](https://github.com/vllm-project/vllm-ascend/pull/1229)
- Add initial user stories page (include LLaMA-Factory/TRL/verl/MindIE Turbo/GPUStack) [#1224](https://github.com/vllm-project/vllm-ascend/pull/1224)
- Add unit test framework [#1201](https://github.com/vllm-project/vllm-ascend/pull/1201)
### Known Issues
- In some cases, the vLLM process may crash with a **GatherV3** error when **aclgraph** is enabled. We are working on this issue and will fix it in the next release. [#1038](https://github.com/vllm-project/vllm-ascend/issues/1038)
- Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. [#1350](https://github.com/vllm-project/vllm-ascend/issues/1350)
### Full Changelog
https://github.com/vllm-project/vllm-ascend/compare/v0.9.0rc2...v0.9.1rc1
## v0.9.0rc2 - 2025.06.10
This release contains some quick fixes for v0.9.0rc1. Please use this release instead of v0.9.0rc1.
@ -14,14 +77,14 @@ This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [
### Highlights
- DeepSeek works with graph mode now. Follow the [official doc](https://vllm-ascend.readthedocs.io/en/latest/user_guide/graph_mode.html) to take a try. [#789](https://github.com/vllm-project/vllm-ascend/pull/789)
- DeepSeek works with graph mode now. Follow the [official doc](https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html) to take a try. [#789](https://github.com/vllm-project/vllm-ascend/pull/789)
- Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
### Core
- The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. [#814](https://github.com/vllm-project/vllm-ascend/pull/814)
- LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. [#893](https://github.com/vllm-project/vllm-ascend/pull/893)
- prefix cache and chunked prefill feature works now [#782](https://github.com/vllm-project/vllm-ascend/pull/782) [#844](https://github.com/vllm-project/vllm-ascend/pull/844)
- Prefix cache and chunked prefill feature works now [#782](https://github.com/vllm-project/vllm-ascend/pull/782) [#844](https://github.com/vllm-project/vllm-ascend/pull/844)
- Spec decode and MTP features work with V1 Engine now. [#874](https://github.com/vllm-project/vllm-ascend/pull/874) [#890](https://github.com/vllm-project/vllm-ascend/pull/890)
- DP feature works with DeepSeek now. [#1012](https://github.com/vllm-project/vllm-ascend/pull/1012)
- Input embedding feature works with V0 Engine now. [#916](https://github.com/vllm-project/vllm-ascend/pull/916)
@ -95,7 +158,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
## v0.8.5rc1 - 2025.05.06
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [here](https://vllm-ascend.readthedocs.io/en/latest/user_guide/suppoted_features.html).
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [here](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).
### Highlights
- Upgrade CANN version to 8.1.RC1 to support chunked prefill and automatic prefix caching (`--enable_prefix_caching`) when V1 is enabled [#747](https://github.com/vllm-project/vllm-ascend/pull/747)
@ -138,7 +201,7 @@ This is the second release candidate of v0.8.4 for vllm-ascend. Please follow th
## v0.8.4rc1 - 2025.04.18
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the [official documentation](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/versioning_policy.html#release-window).
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the [official documentation](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html#release-window).
### Highlights

View File

@ -0,0 +1,10 @@
# Features and models
This section provides a detailed supported matrix by vLLM Ascend.
:::{toctree}
:caption: Support Matrix
:maxdepth: 1
supported_models
supported_features
:::

View File

@ -1,4 +1,4 @@
# Supported Models
# Model Support
## Text-only Language Models

View File

@ -1,15 +0,0 @@
# xxx project uses Ascend vLLM, gain 200% performance enhancement of inference.
## About / Introduction
Draft content
## The Business Challenge
Our goal is to ...
## Solving challenges with vLLM Ascend
vLLM Ascend helped us ...
## Benefits using vLLM Ascend
## Learn more
more info about this case

Some files were not shown because too many files have changed in this diff Show More