Compare commits

...

508 Commits
v0.7.3 ... main

Author SHA1 Message Date
JohnJan 54f2b31184
[Doc] Add a doc for qwen omni (#1867)
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>

### What this PR does / why we need it?
Add FAQ note for qwen omni
Fixes https://github.com/vllm-project/vllm-ascend/issues/1760 issue1



- vLLM version: v0.9.2
- vLLM main:
b9a21e9173
2025-07-20 09:05:41 +08:00
wangxiyuan 2b726d8f90
[CI] Fix broken CI (#1889)
1. vLLM commit
45badd05d0
changed the pooling check logic which broken vLLM Ascend.
2. vLLM commit
3e04107d97
requires higher version of transformers. The transformers version bug
has been fixed by
e936e401de.
We can safe to remove the version limit now.
3. vLLM commit
217937221b
added a new input `enable_eplb` for FusedMoe Ops

This PR fix the broken CI.


- vLLM version: v0.9.2
- vLLM main:
6a971ed692

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-20 02:11:57 +08:00
leo-pony 2ee90461d0
Fix e2e data parallel test: add resource release code (#1881)
### What this PR does / why we need it?
Fix e2e data parallel test: add resource release code and give more time
to engine to pause their processing loops before exiting.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.9.2
- vLLM main:
5895afd780

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-19 11:39:48 +08:00
xleoken b824525be3
Move deepseek_v3 from deepseek_v2.py (#1793)
### What this PR does / why we need it?
Before patch, we can see
`vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM`, it seems
not friendly format.

```
WARNING 07-14 23:57:34 [registry.py:413] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 07-14 23:57:34 [registry.py:413] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 07-14 23:57:34 [registry.py:413] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
```


### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local Test.


- vLLM version: v0.9.2
- vLLM main:
bcdfb2a330

Signed-off-by: xleoken <xleoken@163.com>
2025-07-19 11:37:03 +08:00
Shanshan Shen ab68d31a24
[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
5895afd780

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-19 09:42:32 +08:00
lianyibo 53d2ea3789
[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 (#1811)
### What this PR does / why we need it?

maybe fixes
[#1728](https://github.com/vllm-project/vllm-ascend/issues/1728#issuecomment-3065083433)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Test Qwen3-32B tp=4 with: 

```bash
vllm serve --port 1234 Qwen/Qwen3-32B \
    --served-model-name Qwen3-32B \
    --tensor-parallel-size 4 \
    --swap-space 16 \
    --max-model-len 6000 \
    --load-format dummy \
    --disable-log-stats \
    --disable-log-requests \
```

Request batch_size=128 input/output token=1024

**In 0.9.2rc1**

```text
=====================================================
Total TPS with    prefill(tokens/s)         : 785.1395
Total TPS without prefill                   : 846.6809
Mean TPS with    prefill                    : 6.1339
Mean TPS without prefill                    : 6.6147
=====================================================
Mean TTFT(ms)                               : 10307.8123
Max  TTFT(ms)                               : 21423.0733
Min  TTFT(ms)                               : 362.3602
=====================================================
Mean TPOT(ms)                               : 151.3051
Max  TPOT(ms)                               : 159.4649
Min  TPOT(ms)                               : 140.899
=====================================================
Total Time(s)                               : 175.6032
Request Throughput(requests/s)              : 0.7289
=====================================================
```

**Apply this PR**

```text
=====================================================
Total TPS with    prefill(tokens/s)         : 811.0014
Total TPS without prefill                   : 876.4423
Mean TPS with    prefill                    : 6.3359
Mean TPS without prefill                    : 6.8472
=====================================================
Mean TTFT(ms)                               : 10263.8382
Max  TTFT(ms)                               : 21151.2547
Min  TTFT(ms)                               : 375.9136
=====================================================
Mean TPOT(ms)                               : 146.1686
Max  TPOT(ms)                               : 154.0957
Min  TPOT(ms)                               : 136.8879
=====================================================
Total Time(s)                               : 169.8579
Request Throughput(requests/s)              : 0.7536
=====================================================
```

The TPOT performance gap between these two sets of data is about 3%.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
2025-07-18 23:09:54 +08:00
Mengqing Cao 574fe407eb
[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841)
### What this PR does / why we need it?
We'll refator `CustomOp` in vllm-ascend from this pr on. 

Use function `CustomOp.register_oot` to achieve the customop registery,
taking `AscendQuickGELU` as an example:
```python
from vllm_ascend.ops.activation import AscendQuickGELU
CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU")
```

This is a quick adapt for `CustomOp.register_oot` mechanism from vllm
0.9.2. For further step, we can remove inherit from `QuickGELU` can
write our own `QuickGELU` at all.

Part of https://github.com/vllm-project/vllm-ascend/pull/1647



- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-18 23:07:14 +08:00
Shanshan Shen 8a91e6e59c
[Misc][V0 Deprecation] Remove V0 Related Custom Ops (#1871)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
ca4eb82bcb

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-18 23:06:03 +08:00
li chaoran 3e39d7234c
[CI] Switching to infra cache server to reduce network pressure (#1792)
### What this PR does / why we need it?
This PR introduce the infra cache server to speed up apt/pip package
installation

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
Tested locally, with this config, the network bandwith reduce from 100%
to 5% usage when a new PR was submitted.
<img width="807" height="334" alt="image"
src="https://github.com/user-attachments/assets/16f03bce-4531-4c71-ab6e-8308dc2c022c"
/>


- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

---------

Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>
2025-07-18 18:39:25 +08:00
Shanshan Shen d08ff304cd
[Misc][V0 Deprecation] Remove V0 Attention (#1835)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-18 14:10:13 +08:00
xudongLi-cmss 33ef5dc813
add unit test for func wrapper (#1863)
### What this PR does / why we need it?
test func wrapper file

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>
2025-07-18 11:05:17 +08:00
Li Wang f9dfde02fd
[Bugfix] Fix broken CI (#1848)
### What this PR does / why we need it?
- Fix broken commit by
[#20927](https://github.com/vllm-project/vllm/pull/20927)
- Fix broken commit by
[#20466](https://github.com/vllm-project/vllm/pull/20466)
- TODO: more fully adapt to the upstream reconstruction, let's first
make CI happy

- vLLM version: v0.9.2
- vLLM main:
11dfdf21bf

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-17 20:10:12 +08:00
Zhu Yi Lin 538dd357e6
Add graph mode and improve on multi_npu_moge.md (#1849)
### What this PR does / why we need it?
Add graph mode and improve on multi_npu_moge.md

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
CI passed with new existing test.


- vLLM version: v0.9.2
- vLLM main:
5a7fb3ab9e

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-07-17 17:53:37 +08:00
Shanshan Shen aeb5aa8b88
[Misc][V0 Deprecation] Add `__main__` guard to all offline examples (#1837)
### What this PR does / why we need it?
Add `__main__` guard to all offline examples.

- vLLM version: v0.9.2
- vLLM main:
76b494444f

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-17 14:13:30 +08:00
Shanshan Shen 19e37cd379
[Misc] Add `fusion_result.json` to `.gitignore` (#1836)
### What this PR does / why we need it?
Add `fusion_result.json` to `.gitignore`.



- vLLM version: v0.9.2
- vLLM main:
72ad273582

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-17 11:54:49 +08:00
Icey 875a920d4a
[Platform] Add support for Altlas A3 series (#1794)
### What this PR does / why we need it?
Add support for Ascend A3 and remove latest tag

### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas A3 series

### How was this patch tested?
CI passed with:

- remove latest tag test:
https://github.com/wxsIcey/wxs-vllm-ascend/actions/runs/16267635040/job/45926924765
- E2E image build for A3
- CI test on A3 with e2e test and longterm test
- Unit test missing because need a real A3 hardware to have a test

Closes: https://github.com/vllm-project/vllm-ascend/issues/1696


- vLLM version: v0.9.2
- vLLM main:
d0dc4cfca4

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-07-17 11:13:02 +08:00
wangxiyuan ef99fe1c54
[Test] Clean up duplicate test for ascend scheduler (#1819)
There are some duplicate tests for ascend scheduler. This PR remove them
to make the test clear.

After this PR. the singlecard e2e cost time is reduced from 47min to
46min.

- vLLM version: v0.9.2
- vLLM main:
1eb2b9c102

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-16 17:57:48 +08:00
Shanshan Shen c66b0827a7
[Misc][V0 Deprecation] Remove Pooling Model Runner (#1824)
### What this PR does / why we need it?
Remove pooling model runner.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
d31a647124

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 17:48:21 +08:00
Yikun Jiang ba7e934b21
Remove redundant empty lines in commit msg (#1814)
### What this PR does / why we need it?
Remove redundant empty lines in commit msg

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
test locally: https://github.com/Yikun/vllm-ascend/pull/48

- vLLM version: v0.9.2
- vLLM main:
d0dc4cfca4

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-16 16:50:44 +08:00
Shanshan Shen 06655002c5
[Misc][V0 Deprecation] Remove V0 Worker (#1821)
### What this PR does / why we need it?
Remove V0 worker.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
6cbc4d4bea

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 14:07:17 +08:00
Shanshan Shen b005def0a5
[Misc][V0 Deprecation] Remove Multi-Step Model Runner (#1820)
### What this PR does / why we need it?
Remove multi-step model runner.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.



- vLLM version: v0.9.2
- vLLM main:
34cda778a0

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 14:06:49 +08:00
Shanshan Shen f9e2e9bb31
[Misc][V0 Deprecation] Remove Draft Model Runner Used for V0 Spec Decode (#1810)
### What this PR does / why we need it?
Remove draft model runner used for V0 spec decode.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
34cda778a0

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 10:51:23 +08:00
Shanshan Shen f96100fad5
[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805)
### What this PR does / why we need it?
Remove V0 related codes of test, example, platform.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:58:55 +08:00
Shanshan Shen a929699e98
[Misc][V0 Deprecation] Remove multi-step worker (#1809)
### What this PR does / why we need it?
Remove multi-step worker

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:48:47 +08:00
wangxiyuan bf2549856f
[CI] Fix changes CI to recover codecov (#1799)
Add `checkout` action before `dorny/paths-filter` to make it works with
`push` case.
This is a known issue that `dorny/paths-filter` works without `checkout`
in `pull_request` case but failed in `push` case. More detail is here:
https://github.com/dorny/paths-filter/issues/60#issuecomment-1464281021

The push CI works after this PR. The test result is here:

https://github.com/wangxiyuan/vllm-ascend/actions/runs/16285606468/job/45983607539
- vLLM version: v0.9.2
- vLLM main:
d4d309409f

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 15:01:13 +08:00
wangxiyuan 787010a637
[Test] Remove VLLM_USE_V1 in example and tests (#1733)
V1 is enabled by default, no need to set it by hand now. This PR remove
the useless setting in example and tests

- vLLM version: v0.9.2
- vLLM main:
9ad0a4588b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 12:49:57 +08:00
wangxiyuan eb921d2b6f
[Doc] Fix 404 error (#1797)
Fix url 404 error in doc
- vLLM version: v0.9.2
- vLLM main:
9ad0a4588b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 11:52:38 +08:00
wangxiyuan 7bdada58eb
[Misc] Remove VLLM_USE_V1 usage in code (#1764)
We plan to remove V0 code from this version. The first step is to delete
v0 usage.

Related: https://github.com/vllm-project/vllm-ascend/issues/1620

- vLLM version: v0.9.2
- vLLM main:
61e20828da

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 11:52:16 +08:00
wangxiyuan 494b0f474f
[CI]Fix broken CI (#1773)
This PR fixed the broken CI. It require
https://github.com/vllm-project/vllm/pull/20900 merged first.

- vLLM version: v0.9.2
- vLLM main:
e8cc53af5e

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 00:54:20 +08:00
Li Wang afcfe91dfa
[Doc] Fix multi node doc (#1783)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Pin docker image to latest release
### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
1e9438e0b0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-14 17:56:57 +08:00
zhangxinyuehfad cabfb2bc31
[Test] Resolve vllm-ascend version accuracy test (#1769)
### What this PR does / why we need it?
Resolve vllm-ascend version for accuracy test

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
66f6fbd393

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-14 15:43:37 +08:00
Shanshan Shen d3c6dd985a
[Misc] Add `include` dir to `.gitignore` (#1771)
### What this PR does / why we need it?
Add `include` dir to `.gitignore`.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
66f6fbd393

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-14 12:05:29 +08:00
Li Wang 9cd4ac76a1
[CI] Remove benchmark patch and increase the scheduler frequency (#1762)
### What this PR does / why we need it?
This pr purpose to do the following things:
1. Remove `benchmark_datasets.py` patch
2. Increase the scheduler frequency to 2 times per day, due to the
recent large number of daily submissions, we need to increase the
default test time(6h)
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
247102f07f

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-13 20:00:35 +08:00
Yikun Jiang d118bf8a26
Update README.zh.md to fix typo (#1758)
### What this PR does / why we need it?


Update README.zh.md to fix typo

### Does this PR introduce _any_ user-facing change?


No

### How was this patch tested?


CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-12 14:01:34 +08:00
Yikun Jiang eff4b5791c
Recover offline_inference_npu.py to make doctest passed (#1756)
### What this PR does / why we need it?
Rename offline_inference_npu_v1.py to offline_inference_npu.py to
recover doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed









- vLLM version: v0.9.2
- vLLM main:
a8593237c0

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-12 12:36:35 +08:00
Yikun Jiang 8b3a483269
Add recommend version and refresh readme / contribution.md (#1757)
### What this PR does / why we need it?
Add recommend version and contribution.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed






- vLLM version: v0.9.2
- vLLM main:
890323dc1b

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-12 12:35:40 +08:00
wangxiyuan 3c404de1b1
[Release]Update release note (#1753)
There is still issue with pp in some case. such as aclgraph, ray. Remove
the related doc in release note

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:58:26 +08:00
wangxiyuan b5b7e0ecc7
[Doc] Add qwen3 embedding 8b guide (#1734)
1. Add the tutorials for qwen3-embedding-8b
2. Remove VLLM_USE_V1=1  in docs, it's useless any more from 0.9.2


- vLLM version: v0.9.2
- vLLM main:
5923ab9524

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:40:17 +08:00
wangxiyuan 9c560b009a
[Release] Add 0.9.2rc1 release note (#1725)
Add release note for 0.9.2rc1, we'll release soon









- vLLM version: v0.9.2
- vLLM main:
7bd4c37ae7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:36:05 +08:00
zhangxinyuehfad 1b4a2f3817
[CI] Add accuracy ci for DP and EP and TP and ETP (#1140)
### What this PR does / why we need it?

Add accuracy ci for DP and EP and TP

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
35514b682a

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-11 17:25:17 +08:00
Pr0Wh1teGivee d13fb0766e
[Perf] add patch to optimize apply_topk_topp (#1732)
### What this PR does / why we need it?
Performance optimization for apply_top_k_top_p
### Does this PR introduce _any_ user-facing change?
Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature
### How was this patch tested?
e2e & ut

















- vLLM version: v0.9.2
- vLLM main:
6a9e6b2abf

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-07-11 15:32:02 +08:00
weiguihua2 aa4240c67f
Support pipeline parallel in V1 Engine (#1700)
### What this PR does / why we need it?
This patch supports pipeline parallel in V1 Engine

### Does this PR introduce _any_ user-facing change?
Yes, users can run PP in V1

### How was this patch tested?
Manully test














- vLLM version: v0.9.2
- vLLM main:
31d5c1797f

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-07-11 15:30:51 +08:00
zhangxinyuehfad 1cd27da5fb
[Test] Remove VLLM_USE_V1 in accuracy test (#1739)
### What this PR does / why we need it?
Remove VLLM_USE_V1 in accuracy test

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-11 15:29:11 +08:00
ttanzhiqiang ee40d3d850
use npu_moe_gating_top_k_softmax (#1355)
### What this PR does / why we need it?
The optimization solution for non-deepseek select_experts is to replace
gating_topk_softmax with softmax+topk+to, which is optimized from 37us
to 14us on bf16/fp16 of qwen3-235b

- vLLM version: v0.9.2
- vLLM main:
1a4f35e2ea

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:55:06 +08:00
ttanzhiqiang 9d16c9982e
rm router logits Improve TTOP 3ms (#1407)
### What this PR does / why we need it?

The previous code is
router_logits, _ = self.gate(hidden_states)
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits = get_dp_group().all_gather(router_logits, 0)
I want to change the two all_gathers to one, reduce one all_gather
communication, and make it
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits, _ = self.gate(hidden_states)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh

gsm8k accuracy verification
<img width="1809" alt="截屏2025-06-24 21 53 24"
src="https://github.com/user-attachments/assets/47eace3b-a86b-41b4-9de8-773f57fea33b"
/>



- vLLM version: v0.9.2
- vLLM main:
77f77a951e

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:53:17 +08:00
ApsarasX 0fc9b56d40
[Perf] Improve MLA multistream performance (#1353)
### What this PR does / why we need it?
> Need to merge after PR #1322

According to benchmark results, this PR brings approximately 1%
performance gain.

#### Before Improvement
Profiling
<img width="1147" alt="截屏2025-06-22 14 54 47"
src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c"
/>

Evaluation
```
# server launch command
python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=16 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.96

# client benchmark command
python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 1536 \
        --num-prompts 200 \
        --ignore-eos \
        --model auto \
        --tokenizer /DeepSeek-R1-W8A8 \
        --port 8006 \
        --request-rate 1 \
        --max-concurrency 24 \
        --save-result \
        --skip-initial-test \
        --metric-percentiles "50,90,99"
```

```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  958.59    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2086    
Output token throughput (tok/s):         320.47    
Total Token throughput (tok/s):          1175.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          942.70    
Median TTFT (ms):                        713.87    
P50 TTFT (ms):                           713.87    
P90 TTFT (ms):                           1363.88   
P99 TTFT (ms):                           2008.73   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.96     
Median TPOT (ms):                        69.49     
P50 TPOT (ms):                           69.49     
P90 TPOT (ms):                           70.42     
P99 TPOT (ms):                           70.72     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.96     
Median ITL (ms):                         59.88     
P50 ITL (ms):                            59.88     
P90 ITL (ms):                            61.59     
P99 ITL (ms):                            68.82     
==================================================
```

#### After Improvement
Profiling
<img width="1200" alt="截屏2025-06-22 14 55 42"
src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f"
/>

Evaluation
```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  948.08    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2110    
Output token throughput (tok/s):         324.02    
Total Token throughput (tok/s):          1188.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          1019.25   
Median TTFT (ms):                        714.63    
P50 TTFT (ms):                           714.63    
P90 TTFT (ms):                           1367.31   
P99 TTFT (ms):                           2661.52   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.14     
Median TPOT (ms):                        68.68     
P50 TPOT (ms):                           68.68     
P90 TPOT (ms):                           69.33     
P99 TPOT (ms):                           70.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.14     
Median ITL (ms):                         59.04     
P50 ITL (ms):                            59.04     
P90 ITL (ms):                            60.93     
P99 ITL (ms):                            66.89     
==================================================
```
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?




- vLLM version: v0.9.2
- vLLM main:
65393ee064

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-11 08:51:17 +08:00
Mengqing Cao cc210f46e6
[AscendScheduler][Bugfix] Remove num_draft_tokens while allocating slots (#1718)
### What this PR does / why we need it?

Now there is no need to calculate `num_draft_tokens` when allocating
slots.

This PR follows the changes in vllm:
https://github.com/vllm-project/vllm/pull/20701

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test






- vLLM version: v0.9.2
- vLLM main:
cc876d0f29

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-10 18:47:45 +08:00
wangxiyuan 011fd73a48
[CI] Make CI tracker more clear (#1720)
1. enable lint check for all change
2. only run ut and e2e if it's the code change.
3. only run ut and disable e2e if the change is ut only.
4. disable wheel build for push case
5. run unit test when pr is merged
6. remove useless pytest.ini




- vLLM version: v0.9.2
- vLLM main:
fdfd409f8f

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-10 16:03:23 +08:00
wangxiyuan 3d1e6a5929
[Doc] Update user doc index (#1581)
Add user doc index to make the user guide more clear
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-10 14:26:59 +08:00
Li Wang c7446438a9
[1/N][CI] Move linting system to pre-commits hooks (#1256)
### What this PR does / why we need it?

Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml

Enable pre-commit to avoid some low level error  AMAP.

This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.

TODO: 
- clang-format(check for csrc with google style)
need clean code, disable for now 
- pymarkdown
need clean code, disable for now 
- shellcheck
need clean code, disable for now 

### Does this PR introduce _any_ user-facing change?

Only developer UX change:

https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally

```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```

### How was this patch tested?

CI passed with new added/existing test.

Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 14:17:15 +08:00
ApsarasX 643e6f5486
[Bugfix] Fix accuracy problem caused by mask pollution (#1678)
### What this PR does / why we need it?
If a small batch of short requests is sent first, forming a chunk with a
length <128, it will corrupt the `attn_mask_cache`, causing subsequent
requests that do not form a chunk to have accuracy issues.

The root cause of this problem is the use of in-place multiplication.
Modifying it to use out-of-place multiplication will resolve the
accuracy problem.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Yes.

- vLLM version: v0.9.2
- vLLM main:
ad6c2e1a0b

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-10 14:06:49 +08:00
ttanzhiqiang 60519c71bd
shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)
### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-10 12:07:05 +08:00
Yikun Jiang 997f156a51
Use ci_vllm_version when recording vLLM commit (#1689)
### What this PR does / why we need it?
Use ci_vllm_version when recording vllm commit

Followup on https://github.com/vllm-project/vllm-ascend/pull/1623

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test mannually.
$ python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"'
v0.9.2
- Test on my local repo: https://github.com/Yikun/vllm-ascend/pull/35

- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-10 11:07:27 +08:00
ApsarasX 89c1a0f006
[Bugfix] Fix memory-leak caused by dist._functional_collectives.reduce_scatter_tensor (#1380)
### What this PR does / why we need it?
In some cases, `dist._functional_collectives.reduce_scatter_tensor` can
cause its input tensor not to be released immediately after the current
layer ends. Instead, it will only be released when the GPU memory usage
of the current process reaches a certain threshold (approximately every
15 layers each time).

**Before Fix**

<img width="1441" alt="截屏2025-06-24 01 26 13"
src="https://github.com/user-attachments/assets/72d5dbb3-c8c8-4778-bf64-8db7bab8aff0"
/>

**After Fix**
<img width="1475" alt="截屏2025-06-24 01 23 43"
src="https://github.com/user-attachments/assets/6c69cfcd-a469-4ee5-b8c6-210aeb3a5bdf"
/>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.9.1
- vLLM main:
9ff2af6d2b

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-10 10:57:24 +08:00
Mengqing Cao b1c66b211f
[CI] Fix lint in CI (#1712)
### What this PR does / why we need it?
Fix lint in CI
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-10 10:47:18 +08:00
Li Wang 0c4aa2b4f1
[Doc] Add multi node data parallel doc (#1685)
### What this PR does / why we need it?
 add multi node data parallel doc
### Does this PR introduce _any_ user-facing change?
 add multi node data parallel doc
### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
805d62ca88

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 09:36:37 +08:00
leo-pony b4b19ea588
[Doc] Add multi-npu qwen3-MoE-32B Tutorials (#1419)
Signed-off-by: leo-pony <nengjunma@outlook.com>

### What this PR does / why we need it?
Add multi-npu qwen3-MoE-32B Tutorials
Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-10 09:06:51 +08:00
xleoken 3ef45d0cc2
feat: Improve the offline_inference npu v0/v1 scripts (#1669)
### What this PR does / why we need it?

Improve
- Keep the same file name format as v1, `offline_inference_npu_v0.py`,
`offline_inference_npu_v1.py`
- Use `VLLM_USE_V1` = 0/1 clearly in py scripts
- Fix some run errors in `offline_inference_npu_v1.py`, e.g.
`deepseekv3-lite-base-latest` not exists in modescope or hf.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
baed180aa0

Signed-off-by: xleoken <xleoken@163.com>
2025-07-09 17:03:53 +08:00
Shanshan Shen 6af35f60cc
[Bugfix][CI] Remove V0 Spec Decode CI (#1656)
### What this PR does / why we need it?

To solve the error in the CI of long term test:

```bash
modelscope - ERROR - Repo JackFram/llama-68m not exists on either https://www.modelscope.cn/ or https://www.modelscope.ai/
```

Replace the hf model with modelscope model.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
71d1d75b7a

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-07-09 15:53:58 +08:00
wangxiyuan b979ee353d
[Misc] Code clean up (#1679)
Make model_runner_v1 more readable

- vLLM version: v0.9.2
- vLLM main:
baed180aa0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 14:33:40 +08:00
wangxiyuan 392fd7239b
[Misc] Add attention mask (#1673)
Move attention mark from V0 to common place.
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 09:12:03 +08:00
wangxiyuan cc1588be50
[Misc] Code clean up (#1674)
Remove useless function
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:54:12 +08:00
wangxiyuan 830332ebfc
Clean up v0.9.1 code (#1672)
vllm has released 0.9.2. This PR drop 0.9.1 support.

- vLLM version: v0.9.1
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:52:24 +08:00
Icey 0d4bc03946
Fix wheel glibc version incompatibility (#1582)
### What this PR does / why we need it?
- Fixes https://github.com/vllm-project/vllm-ascend/issues/1533

### How was this patch tested?
1. Run the image
```
docker run \
    --name cann_container \
    --device /dev/davinci6 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -it  quay.io/ascend/cann:8.1.rc1-910b-openeuler22.03-py3.11 bash
```

2. Install package
torch=2.5.1
torch-npu=2.5.1.post1.dev20250619 
vllm=0.9.1

vllm-ascend=vllm_ascend-0.1.dev1+g02ac443-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.whl
Artifact download URL:
https://github.com/vllm-project/vllm-ascend/actions/runs/16039661265/artifacts/3454481370

3. Test offline script

```
from vllm import LLM, SamplingParams

import os
os.environ["VLLM_USE_V1"] = "1"

prompts = [
    "Hello, my name is",
]

llm = LLM(model="Qwen3/Qwen3-1.7B")

outputs = llm.generate(prompts)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

4. Results

![result](https://github.com/user-attachments/assets/20f9d923-00ce-4a2d-8598-9b216045705d)

- vLLM version: v0.9.2
- vLLM main:
b942c094e3

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-07-08 18:46:02 +08:00
Yikun Jiang e4e9ea02ab
Upgrade vLLM version to v0.9.2 (#1652)
### What this PR does / why we need it?

This patch upgrade vLLM version to v0.9.2, this patch didn't remove the
v0.9.1 compatible code to easy review.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
14601f5fba
- Accuracy test with 0.9.2:
https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-08 14:18:17 +08:00
NeverRaR 71de52d3a9
feat: add kv cache memory cache and skip dynamo guard (#1549)
### What this PR does / why we need it?

1、Sometimes loading torchair cache will fail because of the floating of
npu memory, so this pr add a new cache to save the old kv cache bytes to
avoid the possible crash while loading the torchair graph cache.
2、When caching is enabled and does not exist, the first compilation
introduces the overhead of Dynamo Gurad. So in this case, we will
compile them directly twice to skip them (This will bring 3-4 ms of tpot
optimization)

### Does this PR introduce _any_ user-facing change?
Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to
control kv cache floating tolerance

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:37:14 +08:00
NeverRaR df84cceca8
perf: use multicast to avoid padding decode request to prefill size (#1555)
### What this PR does / why we need it?
perf: use multicast to avoid padding decode request to prefill size

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:36:03 +08:00
wm901115nwpu f08c4f15a2
fix spell error (#1654)
Fix the spell error in code

- vLLM version: v0.9.1
- vLLM main:
923147b5e8

Signed-off-by: unicorn <unicorn@unicorns-MacBook-Pro.local>
Co-authored-by: unicorn <unicorn@unicorns-MacBook-Pro.local>
2025-07-07 20:24:42 +08:00
Mengqing Cao f2a20393a2
[CI] Fix mypy check in CI (#1655)
### What this PR does / why we need it?
Fix mypy check in CI:
https://github.com/vllm-project/vllm-ascend/actions/runs/16115919385/job/45469646509?pr=1654

Mypy failed due to the greater numpy version. We need to pin
`numpy=1.26.4` in vllm-ascend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-07 20:19:16 +08:00
Angazenn 18495f44b2
[BugFix] Fix max_num_tokens_across_dp calculation bugs in attention_v1_torchair (#1636)
### What this PR does / why we need it?
This PR fixes a bug that is caused by max_num_tokens_across_dp
calculation. In earlier version, we compute this by graph_pad_size plus
max_num_tokens(actual). This will result in different
max_num_tokens_across_dp across dp ranks. If padding related is
required, this might cause a wrong padding.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed normally.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-07-07 20:03:02 +08:00
Zheng Wengang 9c886d0a1f
[EPLB] support deepseek eplb strategy (#1196)
### What this PR does / why we need it?

This PR implements the DeepSeek Expert Parallel Load Balancing (EPLB)
strategy to optimize expert distribution in vllm-ascend. The
implementation:
- Adapts the expert-map format to work with vllm-ascend's architecture
- Provides DeepSeek-provided mechanism to balance expert workload across
devices

### Does this PR introduce _any_ user-facing change?

This PR adds a new script that allows users to:
- Generate expert map configurations based on workload analysis
- Optimize expert distribution for their specific use case

### How was this patch tested?

To use this feature:
1. First collect expert heat information during model execution
2. Run the provided script to generate the expert map configuration
3. Apply the generated configuration to your vllm-ascend deployment

User example:

```bash
# expert_load_view.pt:  dumped expert heat info file
python3 examples/eplb/eplb_strategy.py --exp_name 'deepseek_demo' \
    --input_path expert_load_view.pt  --output_path examples/eplb/results/demo \
    --num_nodes 4
```

---------

Signed-off-by: ZhengWG <zwg0606@gmail.com>
2025-07-07 17:22:08 +08:00
wangyanhui-cmss 4e29c5a808
Add ut for test_pooling_model_runner.py (#1640)
### What this PR does / why we need it?
 Add ut for test_pooling_model_runner.py

### Does this PR introduce _any_ user-facing change? N/A

### How was this patch tested?
 python -m unittest  test_pooling_model_runner.py


- vLLM version: v0.9.1
- vLLM main:
2e610deb72

---------

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-07-07 17:12:11 +08:00
Yikun Jiang 493768eb30
Record vLLM commit in PR description (#1623)
### What this PR does / why we need it?
This patch enables the vllm commits recording and also cleanup unused
commit msg note in PR.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed
- Test on https://github.com/Yikun/vllm-ascend/pull/33 and vllm commit
refreshed as expected.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-07 10:20:38 +08:00
Mengqing Cao 7efa4e92fe
[CI] Fix oom in chunk prefill (#1622)
### What this PR does / why we need it?
Add the resource clear logic to fix oom issue when testing
`tests/e2e/singlecard/core/ascend_scheduler`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-07 10:14:40 +08:00
ApsarasX c58accc15e
[Bugfix] Support Qwen3-MOE on aclgraph mode (#1381)
### What this PR does / why we need it?
Fix the shape of the `npu_moe_init_routing` input parameters to support
aclgraph mode on qwen3-moe

In addition to this PR, resolving the `gatherv3` error might be
necessary. See related PR
https://github.com/vllm-project/vllm-ascend/pull/1297
https://github.com/vllm-project/vllm-ascend/pull/1446

Thanks to @yiz-liu  for providing the idea

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested on Qwen3-30B-A3B

Closes: https://github.com/vllm-project/vllm-ascend/issues/1368

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 15:29:36 +08:00
zhangxinyuehfad 14373f65d7
[Test] Remove V0 accuracy test and enable MoE and VL test on V1 (#1574)
### What this PR does / why we need it?
Update accuracy test
1. remove accuarcy report on V0
2. add parallel and execution mode
3. add Qwen/Qwen3-30B-A3B and remove Qwen/Qwen2.5-7B-Instruct


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-06 11:10:19 +08:00
Yikun Jiang 0c1d239df4
Add unit test local cpu guide and enable base testcase (#1566)
### What this PR does / why we need it?
Use Base test and cleanup all manaul patch code
- Cleanup EPLB config to avoid tmp test file
- Use BaseTest with global cache
- Add license
- Add a doc to setup unit test in local env 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 10:42:27 +08:00
Vincent Yuan eb390545ec
[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series (#1591)
### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after #1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See #1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
2025-07-05 16:29:21 +08:00
Mengqing Cao dd22ac38b2
[CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136)
### What this PR does / why we need it?
1. run deepseek acc ut per pr --- multicard CI time increased by 9 min
2. run spec decode e2e test on v1 per pr --- singlecard CI time
increased by 3 min (partly is disabled due to not work now)
~~3. align the output of whether dbo is enabled or not~~
    The generated results with and without dbo cannot be aligned.

https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136
4. skip V0 mtp test due to failure in
https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816
5. fix some version conflicts
### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-04 18:05:45 +08:00
wangxiyuan 343955c7ac
[CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625)
This commit
78fe77534b
from vllm reverted the change for FusedMoEParallelConfig

This PR do the same to fix the CI error

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-04 17:54:33 +08:00
zhangxinyuehfad 4e910186de
[CI/UT] Unify model usage via ModelScope in CI (#1207)
### What this PR does / why we need it?
Unify Model Usage via ModelScope

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-04 10:52:17 +08:00
Angazenn a5f33590d3
[CORE]initial support for torchair with non-mla backend (#1506)
### What this PR does / why we need it?
This PR supports torchair graph mode with non-mla backend on both 800IA2
and 300I Duo platforms. The main change is to add
`attention_v1_torchair.py` to support specific attention related
operations that are required by torchair.

### Does this PR introduce _any_ user-facing change?
Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we
can also use it with pangu. Besides, we add a support model list to
control which type of models that can use torchair.

### How was this patch tested?
We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms,
and model generates answer normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:21:42 +08:00
Angazenn 9fbd8017c0
[Quantization]300I Duo support w8a8 quantization (#1560)
### What this PR does / why we need it?
This pr supports w8a8 on 300I Duo platform. The main change is to use
`npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
offline inference on 310p runs normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:12:46 +08:00
Yikun Jiang 6d7cb14a24
Fix lint in examples/offline_embed.py (#1618)
### What this PR does / why we need it?
Fix lint

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-03 21:40:29 +08:00
xleoken e511ddd67d
[Bug] Fix wrong modescope env set order (#1611)
### What this PR does / why we need it?
The `os.environ["VLLM_USE_MODELSCOPE"] = "True"` should be placed before
module imports

if not 
```
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/xleoken/projects/vllm-ascend/examples/offline_embed.py", line 48, in <module>
    model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 494, in from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1018, in create_engine_config
    model_config = self.create_model_config()
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 910, in create_model_config
    return ModelConfig(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 120, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/config.py", line 528, in __post_init__
    hf_config = get_config(self.hf_config_path or self.model,
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 321, in get_config
    config_dict, _ = PretrainedConfig.get_config_dict(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 649, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 266, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 491, in cached_files
    raise OSError(
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
[ERROR] 2025-07-03-15:27:10 (PID:333665, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local.

Signed-off-by: xleoken <xleoken@163.com>
2025-07-03 18:50:53 +08:00
wangxiyuan a45dfde283
[CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602)
Make CI happy

1.
c1909e7e8c
changed moeConfig init way
2.
48fb076cbc
changed input batch logic.

This PR address these change to vllm-ascend.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1600

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-03 18:36:17 +08:00
yupeng d96da1f00c
[DOC] Fix word spelling (#1595)
### What this PR does / why we need it?
Fix word spelling in DOC.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Signed-off-by: paulyu12 <507435917@qq.com>
2025-07-02 21:42:39 +08:00
zhanghw0354 9fb3d558e5
[Test]Add unit test for platform.py (#1476)
### What this PR does / why we need it?
According to issue #1298 , this pull request adds unit test code for
platform.py.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

---------

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: zhuyilin <809721801@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Angazenn <92204292+Angazenn@users.noreply.github.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Zhu Yi Lin <116337067+GDzhu01@users.noreply.github.com>
2025-07-02 17:46:06 +08:00
Li Wang 30bf7014d0
[Bugfix] Add func `swap_states` to fix MLA attention (#1580)
### What this PR does / why we need it?
mla attention still using the gpu_input_batch's attr:`swap_states`, which will lead to
an error `AttributeError: 'InputBatch' object has no attribute 'swap_states'`

This PR fixed the mla input patch error
### How was this patch tested?
will be tested by #1136

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-02 17:42:53 +08:00
Mengqing Cao 59237ea788
[CI/UT] Add test for chunk prefill and prefix cache on v1/AscendScheduler (#1505)
### What this PR does / why we need it?
Add test for chunked prefill and prefix cache on v1/AscendScheduler

Covered scenarios:
- `Qwen/Qwen3-0.6B-Base` and `deepseek-ai/DeepSeek-V2-Lite-Chat` ---
multicard CI time increased by 19 min
- `V1 + default scheduler` vs `V1 + default scheduler + enable prefix
cache`
- `V1 + Ascend scheduler` vs `V1 + Ascend scheduler + enable prefix
cache` vs `V1 + Ascend scheduler + enable prefix cache + enable chunked
prefill`
- `Qwen/Qwen3-0.6B-Base` --- singlecard CI time increased by 8 min
- `V1 + Ascend scheduler` vs `V1 + Ascend scheduler + enable chunked
prefill`

should rebase after #1498 and #1446
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-02 16:57:03 +08:00
Zhu Yi Lin 6b80c5acba
Fix W8A8 fused moe bug (#1529)
### What this PR does / why we need it?
1. drop some useless code for w8a8 fusedmoe
2. Add in8 kv cache check
3. Add more ut.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: zhuyilin <809721801@qq.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-02 16:40:51 +08:00
Agonixiaoxiao 7fc1a98489
add ut for kv tansfer module (#1531)
### What this PR does / why we need it?
test kv data transfer contains connect,pipe,buffer

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: lixudong <lixudong@cmss.chinamobile.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-07-02 16:14:52 +08:00
Yikun Jiang aa5fa07478
Only enable single version for wheel pr build (#1571)
### What this PR does / why we need it?
Only enable single version for wheel pr build to speedup PR triggered CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 14:50:34 +08:00
yupeng c3c8c9317c
[DOC] add LoRA user guide (#1265)
### What this PR does / why we need it?
Add LoRA user guide to DOC. The content refers to [LoRA
Adapters](https://docs.vllm.ai/en/latest/features/lora.html).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2025-07-02 14:41:31 +08:00
Li Wang f39365d2ea
[Benchmark] Fix error msg upload in performance benchmark (#1559)
### What this PR does / why we need it?

Make sure that None parameters are not passed in for `--error`
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

CI passed locally

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-02 14:06:08 +08:00
wangxiyuan 641a4e6092
[CI] Cache sampled token ids in model runner to fix CI error (#1573)
### What this PR does / why we need it?
vllm change
7f280d69c9
break vllm-ascend.

This PR Fix the broken CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/1572

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-02 12:11:14 +08:00
Pleaplusone 0e43813120
[ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546)
### What this PR does / why we need it?

This PR (adapted from
2863befce3)
updates the CachedRequestData definition to use a single instance shared
across all requests in a batch, instead of creating a new instance per
request.

Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53
[core.py:521] TypeError: 'CachedRequestData' object is not iterable`,
Modify the model_runner to fix it.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
pass ci will verify this.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 06:05:21 +08:00
Li Wang 6db7dc2c85
[Benchmark] Refactor perf script to use benchmark cli (#1524)
### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-30 23:42:04 +08:00
leo-pony 53ec583bbb
[Docs] Update Altlas 300I series doc and fix CI lint (#1537)
### What this PR does / why we need it?
- Update Altlas 300I series doc: cleanup unused parameters and enable
optimized ops
- Fix code spell CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 23:34:00 +08:00
wangxiyuan a054f0f4ca
[CI] change to new ds model (#1513)
Previous, the DeepSeek V3 Pruning weight is not correct, the moe layer
is not tested. We update a new Pruning model to enable moe layer
compute.

This PR fix the CI to address the new weight.

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-30 19:02:29 +08:00
Shanshan Shen 8013634e9c
[Structured Output] Remove redundant check for `grammar_bitmask` (#1459)
### What this PR does / why we need it?
Remove redundant check since we have check this at
https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L1450.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:39:19 +08:00
Shanshan Shen ba577dfc52
[Doc] Add Structured Output guide (#1499)
### What this PR does / why we need it?
Add Structured Output guide.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:21:44 +08:00
whx f286265791
[BugFix] Address PrefillCacheHit state to fix prefix cache accuracy bug (#1498)
When use AscendScheduler with prefix-cache enabled and chunk-prefill
disabled, there will be accuray problem because there is no branch in
mla_v1 to process this scenario. This PR fixes it.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-30 16:51:20 +08:00
Li Wang 5f8241c25c
[V1][ModelRunner] Support pooling model for v1 engine (#1359)
### What this PR does / why we need it?
Change as little existing code as possible to add v1 pooling task's
support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to
vllm-ascend, Considering the frequent changes in upstream interfaces, in
order to decouple, so i move it here
### How was this patch tested?
CI passed with new added/existing test, and I have a simple test was
first conducted locally which is adapted from
https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like
bellow:
```python
import os

import torch
from vllm import LLM


os.environ["VLLM_USE_MODELSCOPE"]="True"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]
```
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-30 16:31:12 +08:00
dependabot[bot] 790c810bf7
Bump actions/github-script from 6 to 7 (#1519)
Bumps [actions/github-script](https://github.com/actions/github-script)
from 6 to 7.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/github-script/releases">actions/github-script's
releases</a>.</em></p>
<blockquote>
<h2>v7.0.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add base-url option by <a
href="https://github.com/robandpdx"><code>@​robandpdx</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li>
<li>Expose async-function argument type by <a
href="https://github.com/viktorlott"><code>@​viktorlott</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a>,
see for details <a
href="https://github.com/actions/github-script#use-scripts-with-jsdoc-support">https://github.com/actions/github-script#use-scripts-with-jsdoc-support</a></li>
<li>Update dependencies and use Node 20 by <a
href="https://github.com/joshmgross"><code>@​joshmgross</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/425">actions/github-script#425</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/navarroaxel"><code>@​navarroaxel</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/285">actions/github-script#285</a></li>
<li><a href="https://github.com/robandpdx"><code>@​robandpdx</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li>
<li><a
href="https://github.com/viktorlott"><code>@​viktorlott</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.4.1...v7.0.0">https://github.com/actions/github-script/compare/v6.4.1...v7.0.0</a></p>
<h2>v6.4.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Add <code>@​octokit/plugin-request-log</code>, to produce debug
output for requests by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li>
<li>fix input handling by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/357">actions/github-script#357</a></li>
<li>Remove unused dependencies by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/356">actions/github-script#356</a></li>
<li>Default debug to current runner debug state by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/363">actions/github-script#363</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/mjpieters"><code>@​mjpieters</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.4.0...v6.4.1">https://github.com/actions/github-script/compare/v6.4.0...v6.4.1</a></p>
<h2>v6.4.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Bump json5 from 2.1.3 to 2.2.3 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/319">actions/github-script#319</a></li>
<li>Bump minimatch from 3.0.4 to 3.1.2 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/320">actions/github-script#320</a></li>
<li>Add node-fetch by <a
href="https://github.com/danmichaelo"><code>@​danmichaelo</code></a> in
<a
href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/jongwooo"><code>@​jongwooo</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/313">actions/github-script#313</a></li>
<li><a
href="https://github.com/austinvazquez"><code>@​austinvazquez</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/306">actions/github-script#306</a></li>
<li><a
href="https://github.com/danmichaelo"><code>@​danmichaelo</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.3.3...v6.4.0">https://github.com/actions/github-script/compare/v6.3.3...v6.4.0</a></p>
<h2>v6.3.3</h2>
<h2>What's Changed</h2>
<ul>
<li>Update <code>@actions/glob</code> to 0.3.0 by <a
href="https://github.com/nineinchnick"><code>@​nineinchnick</code></a>
in <a
href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/nineinchnick"><code>@​nineinchnick</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.3.2...v6.3.3">https://github.com/actions/github-script/compare/v6.3.2...v6.3.3</a></p>
<h2>v6.3.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Update <code>@​actions/core</code> to 1.10.0 by <a
href="https://github.com/rentziass"><code>@​rentziass</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/295">actions/github-script#295</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="60a0d83039"><code>60a0d83</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/440">#440</a>
from actions/joshmgross/v7.0.1</li>
<li><a
href="b7fb2001b4"><code>b7fb200</code></a>
Update version to 7.0.1</li>
<li><a
href="12e22ed06b"><code>12e22ed</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/439">#439</a>
from actions/joshmgross/avoid-setting-base-url</li>
<li><a
href="d319f8f5b5"><code>d319f8f</code></a>
Avoid setting <code>baseUrl</code> to undefined when input is not
provided</li>
<li><a
href="e69ef5462f"><code>e69ef54</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/425">#425</a>
from actions/joshmgross/node-20</li>
<li><a
href="ee0914b839"><code>ee0914b</code></a>
Update licenses</li>
<li><a
href="d6fc56f33b"><code>d6fc56f</code></a>
Use <code>@types/node</code> for Node 20</li>
<li><a
href="384d6cf581"><code>384d6cf</code></a>
Fix quotations in tests</li>
<li><a
href="84724927e3"><code>8472492</code></a>
Only validate GraphQL <code>previews</code></li>
<li><a
href="84903f5182"><code>84903f5</code></a>
Remove <code>node-fetch</code> from type</li>
<li>Additional commits viewable in <a
href="https://github.com/actions/github-script/compare/v6...v7">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/github-script&package-manager=github_actions&previous-version=6&new-version=7)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-30 16:04:41 +08:00
Yikun Jiang e4df0a4395
Add Pangu MoE Pro for 300I series docs (#1516)
### What this PR does / why we need it?
Add Pangu MoE Pro for 300I series docs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 13:37:22 +08:00
Yikun Jiang cad4c693c6
Add Pangu MoE Pro docs (#1512)
### What this PR does / why we need it?
This PR add Pangu MoE Pro 72B docs

[1] https://gitcode.com/ascend-tribe/pangu-pro-moe-model

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 12:15:33 +08:00
yiz-liu 75d05ee200
[Core] Fix block table shape to make Prefix cache work with Ascend scheduler (#1446)
### What this PR does / why we need it?

This fix the shape of block_table which was introduced by hybrid kv
groups several weeks ago.

Error will be raised when enable prefix-cache (eager or not) and Ascend
Scheduler at the same time, just send two identical requests and it will
reproduce.

v0.9.1: https://github.com/vllm-project/vllm-ascend/pull/1297

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test manually

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-30 11:25:19 +08:00
Zhu Yi Lin b308a7a258
support pangumoe w8a8c8 and docs (#1477)
### What this PR does / why we need it?
support pangu moe w8a8c8

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

Signed-off-by: zhuyilin <809721801@qq.com>
2025-06-28 18:51:07 +08:00
Angazenn c59d69d9e6
[PERF]support MERRouter (#1421)
### What this PR does / why we need it?
This PR introduces an expert rearrange algorithm for PanguProMoE model.
Different from the original grouped topk, it filters out the top experts
that are allocated more tokens. Therefore, we can load less experts when
calculating gmm.

We have test this algorithm for PanguProMoE-72B on 300I Duo platform and
800I A2 platform. On 300I Duo platform, we find that `num_voted_experts`
set to 5 achieves both good performance and accuracy. While on 800I A2,
we still set it to 8 to use original pangu grouped topk.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:14:49 +08:00
Angazenn 8fa188111d
[PERF]support H2P communication optimization for PanguProMoe (#1463)
### What this PR does / why we need it?
In this PR, we support H2P communication optimization when running
PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather`
to replace `all_reduce` to improve performance:

original layer:
input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm
--> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce
now:
input_layernorm --> tp all_gather --> attn --> tp reduce_scatter -->
post_attention_layernorm --> all_rank all_gather --> moe/mlp -->
all_rank reduce_scatter

Besides, because `reduce_scatter` requires num_tokens that can be
divided by group size, we need pad the seqs based on
`max_tokens_across_dp`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This PR has been tested with both offline and online inference using
PanguProMoE-72B.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:10:27 +08:00
Angazenn 5c53cbaf2a
[BugFix]Fix bugs when initializing communication groups with dp on 300I Duo (#1478)
### What this PR does / why we need it?
This PR fixes a bug that use broadcast with cpu_group when running dp.
The `broadcast310p` patch will take effects for both cpu_group and
device group, but we only need it for device group. Hence a wrapper is
added to allow cpu_group use native torch broadcast and it solves the
bug.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
With this PR, DP on 310p runs normally and generates reasonable answers.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:07:52 +08:00
Mengqing Cao 2cf9c4c3a2
[CI/Build] Fix version conflict on transformers (#1490)
### What this PR does / why we need it?
Fix version conflict on transformers:
`pip._vendor.pkg_resources.ContextualVersionConflict: (transformers
4.53.0 (/usr/local/python3.10.17/lib/python3.10/site-packages),
Requirement.parse('transformers<4.53.0'), {'vllm-ascend'})`
Fix
https://github.com/vllm-project/vllm-ascend/actions/runs/15933263325/job/44947231642

### Does this PR introduce _any_ user-facing change?
Fix broken build

### How was this patch tested?
CI passed with new existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 15:11:04 +08:00
Mengqing Cao 5f4391652f
[PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483)
### What this PR does / why we need it?
Support prompt logprobs in V1. This also enable lm_eval to test accuracy
on V1

### Does this PR introduce _any_ user-facing change?
support prompt logprobs output

### How was this patch tested?
CI passed with accuracy test.

Using lm_eval, which use prompt logprobs as output to test accuracy, to
test:
```python
VLLM_USE_V1=1 lm_eval \
  --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \
  --tasks ceval-valid_computer_network \
  --batch_size 8
```
After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct`
on V1 is:
```bash
|           Tasks            |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|ceval-valid_computer_network|      2|none  |     0|acc     |↑  |0.7368|±  |0.1038|
|                            |       |none  |     0|acc_norm|↑  |0.7368|±  |0.1038|
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1043

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 09:38:52 +08:00
Shanshan Shen 99e685532d
[Doc] Add Qwen2.5-VL eager mode doc (#1394)
### What this PR does / why we need it?
Add Qwen2.5-VL eager mode doc.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-28 09:08:51 +08:00
Mengqing Cao d59e7fa095
[CI] Pin transformers<4.53.0 and fix EPLB load_weights to make CI passed (#1482)
### What this PR does / why we need it?

- Fix vLLM EPLB break
e9fd658a73
by recovering load_weights back to [v0.9.1
version](07b8fae219)
temporarily.

- Fix transformers>=4.53.0 image processor break
Related: https://github.com/vllm-project/vllm-ascend/issues/1470

- Mirror torch_npu requirements to pyproject.toml

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-28 00:12:43 +08:00
Shanshan Shen 3687676fa7
[Doc] Add guidance on how to implement and register new models (#1426)
### What this PR does / why we need it?
Add guidance on how to implement and register new models.

Modified based on PR
https://github.com/vllm-project/vllm-ascend/pull/1126, thanks for the
contribution of @linfeng-yuan.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-27 16:46:49 +08:00
wangxiyuan 5571fb7118
[Misc] Add release checklist issue template (#1447)
Add the release checklist issue template.

Every release manager should create and follow the checklist to do the
release step by step.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-27 09:15:36 +08:00
wangxiyuan 5968dff4e0
[Build] Add build info (#1386)
Add static build_info py file to show soc and sleep mode info. It helps
to make the code clean and the error info will be more friendly for
users

This PR also added the unit test for vllm_ascend/utils.py

This PR also added the base test class for all ut in tests/ut/base.py

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-27 09:14:43 +08:00
Li Wang c563a08f0a
[CI] Fix nightly benchmark (#1453)
### What this PR does / why we need it?
Sometimes the performance benchmark workflow may fail. We hope to add a
prompt when the operation fails and not upload the dirty data of the
failed operation.

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-26 19:39:18 +08:00
Zesheng Zong 192dbbcc6e
Optimize Patch developer guide (#1452)
### What this PR does / why we need it?
Fix some terms in the user guide.


Signed-off-by: zeshengzong <zesheng.zong@outlook.com>
2025-06-26 19:10:16 +08:00
wangyanhui-cmss e5eea64b66
[CI/UT] Add ut for parallel_state.py (#1460)
### What this PR does / why we need it?
 Add ut for parallel_state.py

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
 python -m unittest  test_parallel_state.py

---------

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-06-26 19:03:27 +08:00
Shanshan Shen 4e2daf5ab7
[Doc] Add qwen2-audio eager mode tutorial (#1371)
### What this PR does / why we need it?
Add qwen2-audio eager mode tutorial.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-26 16:56:05 +08:00
leo-pony 1025344912
Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374)
### What this PR does / why we need it?
Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode.
Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248

### Does this PR introduce _any_ user-facing change?
No changes.


### How was this patch tested?
Preview

 Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-06-26 16:52:54 +08:00
sdmyzlp 53c2d58ae1
Handle with_prefill_across_dp for multistream mla (#1322)
### What this PR does / why we need it?
After #1094, decode might be executed with non-compiled mode, despite of
`torchair_graph_config.enabled`, causing multistream mla to fail, which
assumes torchair compiled mode for decode when
`torchair_graph_config.enabled == True`.
Augment that assumption to fix this.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested both offline, and by graph mode mla e2e testcase.

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-26 09:32:07 +08:00
yiz-liu 2690697caa
[Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416)
### What this PR does / why we need it?
Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds
asserts in the `GatherV3` operator.

Currently, in
[`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124),
the `position` tensor may contain values that exceed the dimensions of
the attention mask, triggering a `GatherV3` boundary check failure.
These invalid indices originate from stale “dirty” entries left over in
`position` due to padding logic in the ACL graph. Specifically, in
[`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989),
the variable `num_input_tokens` is always greater than or equal to
`total_num_scheduled_tokens`, so any positions not explicitly cleared
from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed
internally using other args, so these lingering values do not surface.
However, on the Ascend platform—where split-fuse attention requires
externally supplied masks—these residual indices become critical and
lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the
`position` tensor before passing it to `GatherV3`, ensuring that every
index lies within the valid range of the attention mask.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1038

### Does this PR introduce _any_ user-facing change?
No


Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-26 09:27:43 +08:00
zhangxinyuehfad 06ccce1ddf
[FOLLOWUP] fix name and format in accuracy test (#1288) (#1435)
### What this PR does / why we need it?
fix accuracy test:
1. fix accuracy report
like:https://vllm-ascend--1429.org.readthedocs.build/en/1429/developer_guide/evaluation/accuracy_report/Qwen2.5-7B-Instruct-V0.html
2. fix create pr for report

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-26 00:26:54 +08:00
Pr0Wh1teGivee 2fda60464c
[Perf] Use fused ops npu_top_k_top_p (#1308)
### What this PR does / why we need it?
Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are
not None, otherwise fallback to the original one. The replacement will
take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` .

This patch are using `npu_top_k_top_p` which required
torch_npu>=2.5.1.post1.dev20250619

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested by DeepSeek R1 and UT passed

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-06-25 20:59:06 +08:00
yuancaoyaoHW e7efc7e7e7
[BugFix] Remove not using patch_eagle.py for CI. (#1385)
### What this PR does / why we need it?
This PR aims to address a long-standing **CI bug** and remove unused
code. The specific changes include:

1. **Fixing CI Bug**: Resolves the root cause of CI test failures or
instability. This often stems from incorrect environment configurations,
dependency version conflicts, or flawed test script logic. This fix
ensures the reliability and consistency of the CI pipeline.
2. **Removing `patch_eagle.py`**: Deletes the `patch_eagle.py` file,
which is no longer utilized by the project. This file was likely legacy
code, experimental code, or its functionality has since been replaced by
other modules. Its removal helps reduce codebase complexity, improves
maintainability, and prevents potential confusion.

### Does this PR introduce _any_ user-facing change?
No, this PR primarily focuses on internal CI stability maintenance and
code cleanup. It does not introduce any user-visible changes to APIs,
interfaces, or other behaviors.

### How was this patch tested?
CI passed. Specifically:

1. **Existing CI Pipelines Passed**: After fixing the CI bug, all
existing CI tests and pipelines were verified to run correctly and pass
successfully.
2. **Code Cleanup Verified**: Following the removal of `patch_eagle.py`,
it was ensured that any related functional modules (if applicable)
continue to work as expected, without introducing new regressions. This
was typically verified by running the project's main test suite.

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-25 20:36:05 +08:00
sharonyunyun 941269a6c5
adjusting the communication method in graph mode (#1194)
### What this PR does / why we need it?
Communication performance optimization: replace allreduce with
reduce_scatter+all_gather in MLA layer's TP group,to remove
stridedsliced and all_gather in MOE layer.
when tp > 1, It is enabled during the decode phase of the graph mode
when enable_multistream_moe、MLA, use_v1, and MC2 are used.
According to the end-to-end RL inference test results, this PR can bring
3% gain in the decode stage.

**Before Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003)
Evaluation

![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7)

![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057)

**After Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e)
Evaluation

![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0)

![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4)

### Does this PR introduce _any_ user-facing change?
Users need to configure enable_multistream_moe=True

### How was this patch tested?
Add e2e test cases to cover code logic

Signed-off-by: sharonyunyun <zhangying134@huawei.com>
2025-06-25 19:56:49 +08:00
wangxiyuan 205cb85a1e
[Doc] Fix doc typo (#1424)
1. Fix the typo
2. Fix 404 url
3. update graph mode and additional config user guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 19:28:26 +08:00
wangxiyuan ca884ef86d
[Misc] Clean up uesless code for LLM initialize (#1373)
This PR aims to clean up the useless code for LLM setup. It helps to
make the code more clear.
1. remove useless `self.xxx` property
2. change `set_random_seed` to `seed_everything`
3. remove `set_custom_all_reduce`, it's only used for cuda

This is just a code clean. no change for any code logic.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 16:20:14 +08:00
zhangxinyuehfad 0060886a37
[CI]Update accuracy report test (#1288)
### What this PR does / why we need it?
Update accuracy report test
1. Add Record commit hashes and GitHub links for both vllm and
vllm-ascend in accuracy reports
2. Add accuracy result verification checks to ensure output correctness
3. Creat PR via forked repository workflow

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
dense-accuracy-test:
https://github.com/vllm-project/vllm-ascend/actions/runs/15745619485
create pr via forked repository workflow:
https://github.com/zhangxinyuehfad/vllm-ascend/actions/runs/15747013719/job/44385134080
accuracy report pr:
https://github.com/vllm-project/vllm-ascend/pull/1292

Currently, the accuracy report used is old and needs to be merged into
pr, retest, update new report, then close #1292 .


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-25 14:10:34 +08:00
Li Wang 15df8be937
[Doc] Add sleep mode doc (#1295)
### What this PR does / why we need it?
Add sleep related doc and example

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-25 14:07:14 +08:00
wangxiyuan e4e0b7af05
[Doc] Add patch doc (#1414)
1. Format the developer guide  content to make it more clear
2. Add the patch doc for developer guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 12:00:45 +08:00
Mengqing Cao 52317f92cb
[DP] Tiny fix of dp and update example (#1273)
### What this PR does / why we need it?
Add `max_num_tokens_across_dp` to AscendMetadata to fix dp

This pr fixes the bug introduced by
https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg
`max_num_tokens_across_dp` when dp_size > 1.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-25 11:03:04 +08:00
Mengqing Cao c1c5d56255
[Doc] Update FAQ and add test guidance (#1360)
### What this PR does / why we need it?
- Add test guidance
- Add reduce layer guidance
- update faq on determinitic calculation

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-25 09:59:23 +08:00
Li Wang 5f5800ba42
[Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399)
### What this PR does / why we need it?

Sync MRotaryEmbedding interface change to recover main CI
(https://github.com/vllm-project/vllm/pull/19939)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-24 22:56:39 +08:00
liziyu 6ed3f00427
[Doc] remove environment variable VLLM_ENABLE_MC2 (#1406)
### What this PR does / why we need it?
remove unused environment variable VLLM_ENABLE_MC2


Signed-off-by: liziyu <liziyu16@huawei.com>
2025-06-24 21:18:10 +08:00
Mengqing Cao 20767a043c
[CI/UT] Fix disaggregated prefill ci (#1313)
### What this PR does / why we need it?
Use eager mode to run disaggregated prefill ci

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-24 17:11:00 +08:00
wangxiyuan 9cbce423ce
[MISC] Remove useless patch (#1366)
### What this PR does / why we need it?
`stateless_init_dp_group` in vllm works with non-cuda platform now.
Remove this useless patch.

Which was introduced in vllm-ascend by
e74331a1ed
(v0.8.4rc2)
vLLM upstream merged:
3e472d882a
(v0.8.0)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-24 10:05:59 +08:00
lyj-jjj 5177bef87a
support fused_moe_allgather_ep (#1335)
### What this PR does / why we need it?
support fused_moe_allgather_ep

### How was this patch tested?
It was tested by UT.

Signed-off-by: lyj-jjj <liuyingjun5@huawei.com>
2025-06-23 22:03:38 +08:00
Yikun Jiang 917c6b71af
[TEST][DOC] Fix doctest and add system package installation (#1375)
### What this PR does / why we need it?
- Fix
[doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule)
- add system package installation
- Add doc for run doctests
- Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml
- Change schedule job from 4 ---> 12 hours

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- doctest CI passed
- Local test with
`/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-23 20:50:33 +08:00
Icey 08cfc7cb4b
Modify installation.md for adding pip extra index of torch-npu (#1272)
### What this PR does / why we need it?
Modify installation.md for adding pip extra index of torch-npu

### How was this patch tested?
No need

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-06-23 15:37:50 +08:00
weiguihua2 e1123172d1
[Doc] Add reinstall instructions doc (#1303)
Add a new FAQ, if users re-install vllm-ascend with pip, the `build`
folder should be removed first

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: weiguihua <weiguihua2@huawei.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-06-23 14:06:27 +08:00
linfeng-yuan 15592c0d48
[bugfix] fix accuracy prolem for deepseek V3/R1 models with torchair graph in long sequence predictions (#1331)
### What this PR does / why we need it?
Fix the issue of insufficient cached cosine and sine length in MLA's
TorchAir graph mode, which causes accuracy deviation during
long-sequence inference.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We tested the accuracy of this patch with DeepSeek R1 e2e becnhmark
serving, and get 83.33 sore for AIME2024 dataset with DP4TP4EP16
setting.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-06-23 09:52:27 +08:00
zxdukki f04c6763d8
[Bugfix] fix env variable in dbo (#1284)
### What this PR does / why we need it?
Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we
have fixed an known issue in deepseek-dbo.


### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
This patch can be tested with newly added e2e tests:
[tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e).
It can be verified with pytest.

---------

Signed-off-by: zhuohuan <zxdu1997@gmail.com>
2025-06-23 09:07:57 +08:00
Shanshan Shen 21fb68a03a
[CI] Update guided decoding ut (#1312)
### What this PR does / why we need it?
Update guided decoding ut.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-23 09:06:20 +08:00
wemaster 339d6894f6
[CI/UT][bugfix] fix v0 spec decode (#1321)
### What this PR does / why we need it?
1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove it in this pr.

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-06-23 09:05:13 +08:00
Pleaplusone 7e6efbf2a9
update torch-npu to 2.5.1.post1.dev20250619 (#1347)
### What this PR does / why we need it?
This PR update the torch_npu to newest release version
2.5.1.post1.dev20250619 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI tested will guarantee the update

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-06-23 09:02:09 +08:00
xleoken 4447e53d7a
[Doc] Change not to no in faqs.md (#1357)
### What this PR does / why we need it?

Change not to no in faqs.md.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local Test

Signed-off-by: xleoken <xleoken@163.com>
2025-06-23 09:01:00 +08:00
Yikun Jiang a95afc011e
[CI] Enable merge trigger unit test and accuracy test schedule job (#1345)
### What this PR does / why we need it?
- Enable merge trigger unit test and accuracy test schedule job
- Pin lm-eval==0.4.8 to resovle Qwen3 8B accuracy
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 17:21:57 +08:00
Yikun Jiang 2e5f312530
Cleanup ununsed doc (#1352)
### What this PR does / why we need it?
Cleanup ununsed doc for MoGE model, we will add back this when MoGE
model ready.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 15:05:30 +08:00
Yikun Jiang c30ddb8331
Bump v0.9.1rc1 release (#1349)
### What this PR does / why we need it?
Bump v0.9.1rc1 release

Closes: https://github.com/vllm-project/vllm-ascend/pull/1341
Closes: https://github.com/vllm-project/vllm-ascend/pull/1334

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-22 13:15:36 +08:00
Yikun Jiang 097e7149f7
[Platform] Add initial experimental support for Altlas 300I series (#1333)
### What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash
below PR into one to help validation:

- https://github.com/vllm-project/vllm-ascend/pull/914
- https://github.com/vllm-project/vllm-ascend/pull/1318
- https://github.com/vllm-project/vllm-ascend/pull/1327


### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas 300I DUO series

### How was this patch tested?
CI passed with:
- E2E image build for 310P
- CI test on A2 with e2e test and longterm test
- Unit test missing because need a real 310P image to have the test,
will add in a separate PR later.
- Manually e2e test:
- Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B:
https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322
  - Pangu MGoE 72B


The patch has been tested locally on Ascend 310P hardware to ensure that
the changes do not break existing functionality and that the new
features work as intended.

#### ENV information

CANN, NNAL version: 8.1.RC1
> [!IMPORTANT]  
> PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ
format and calling NNAL operators on 310P

#### Code example

##### Build vllm-ascend from source code

```shell
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
```

##### Run offline inference

```python
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

---------

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Vincent Yuan <farawayboat@gmail.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-21 09:00:16 +08:00
Yikun Jiang 2009fdb8da
[Test] Enable code cov for V1 and enable push trigger (#1164)
### What this PR does / why we need it?
- Enable code cov for V1
- Enable push triggered job

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-21 00:01:05 +08:00
Angazenn 2f1266d451
Support Pangu Pro MoE model (#1204)
### What this PR does / why we need it?
Support Pangu Pro MoE model (https://arxiv.org/abs/2505.21411)

### Does this PR introduce _any_ user-facing change?
Yes, new model supported

### How was this patch tested?
Test locally

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-20 23:59:59 +08:00
yuancaoyaoHW 00ae250f3c
[V1][eagle3] Support eagle3 proposer for v1 (#1032)
### What this PR does / why we need it?
This PR implements the Eagle Pososer feature for vLLM v1, which enables
more efficient speculative decoding by using a draft model to predict
potential future tokens.
- The implementation includes the core Eagle algorithm integration with
vLLM's existing architecture, allowing for faster inference while
maintaining output quality.
- This is needed to significantly improve the generation speed of large
language models without compromising on the quality of generated text.

### Does this PR introduce any user-facing change?
Yes, this PR introduces a new speculative decoding mode that can be
enabled via configuration.
- Users can now choose to use Eagle Pososer by setting appropriate flags
in the inference configuration.
- The API remains backward compatible, with the new functionality being
opt-in.

### How was this patch tested?
CI passed with new unit tests added for the Eagle Pososer functionality.
- Benchmark tests were conducted comparing generation speed and quality
with and without Eagle Pososer.
- Integration tests were performed with various model architectures to
ensure compatibility.
- Manual testing was done using different prompt scenarios to verify
output quality remains consistent.
- we test accept rate on one Ascend 910B npu, The acceptance rate
results are basically consistent with those shown here:
https://github.com/vllm-project/vllm/pull/16937
- Currently, we support scenarios where num_spec_tokens <= 2. When
num_spec_tokens > 2, issues such as insufficient GPU memory and operator
computation errors may occur. We will address this in subsequent
updates.
- We will add support for Eagle v1 in future updates.

### Acceptance Test Script
```bash
SCRIPT="/offline/eagle.py"
DATASET="ShareGpt"
MODEL=Meta-Llama-3.1-8B-Instruct
DRAFT=EAGLE3-LLaMA3.1-Instruct-8B

CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \
    --dataset $DATASET \
    --num_spec_tokens 2 \
    --max_num_seqs 1 \
    --model_dir $MODEL \
    --eagle_dir $DRAFT \
    --tp 1 \
    --num_prompts 80
```
### Acceptance Test Results
```bash
██████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s]
-------------------------------------------------------------------------------------
mean acceptance length: 1.63
-------------------------------------------------------------------------------------
total_counts: 8062
acceptance at token 0: 1.00 (8062 times)
acceptance at token 1: 0.70 (5612 times)
acceptance at token 2: 0.47 (3765 times)
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1004

---------

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-20 17:19:54 +08:00
wangxiyuan 45be1aac0c
[CI] Add codespell check for doc (#1314)
Add codespell check test for doc only PR

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 16:48:14 +08:00
22dimensions 761bd3d9d7
Add user guide for quantization (#1206)
### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-20 15:53:25 +08:00
yiz-liu 2c7dd85fd8
[Fix] Fix the token-wise padding mechanism (#1300)
### What this PR does / why we need it?
Fix the token-wise padding mechanism.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-20 14:46:17 +08:00
wangxiyuan b350edae9a
[UT] refactor test_expert_load_balancer and fix broken CI (#1293)
refactor test_expert_load_balancer to keep the ut code style

This PR also fixed the break change from
https://github.com/vllm-project/vllm/pull/16188/files#diff-e2942ece30a5c580437694ffb964bfc664b510c59244c08e5921b8f5cefb4280

This is just a quick fix. We'll support embedding on V1 later

Closes: https://github.com/vllm-project/vllm-ascend/issues/1299

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 01:02:52 +08:00
songshanhu07 ebb2a70dbb
static EPLB fix bug, add unit test (#1186)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
1.add static EPLB unit test
2.fix bug: Tensor cannot be directly judged by if statements
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Run the unit test.

---------

Signed-off-by: songshanhu07 <1763685535@qq.com>
2025-06-18 19:46:56 +08:00
Shanshan Shen 2cd8ecdc4f
[Bugfix][Spec Decode] Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode (#1258)
### What this PR does / why we need it?

Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode.

Find more details at **mengwei805**'s comment in
https://github.com/vllm-project/vllm-ascend/pull/1123.

### Does this PR introduce _any_ user-facing change?

The user will not be aware of `VLLM_ASCEND_ACL_OP_INIT_MODE`
(`ACL_OP_INIT_MODE`).

### How was this patch tested?

Test scripts:

```python
from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    tensor_parallel_size=1,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 4,
    },
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Results:

```
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 76.70it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.33it/s, est. speed input: 6.64 toks/s, output: 21.26 toks/s]
Prompt: 'The future of AI is', Generated text: ' bright\n\n04/15/2020\n\nBy: James'
```

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-18 17:50:20 +08:00
zzzzwwjj db2f630aeb
[bugfix] fix deepseek with mc2 (#1268)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-06-18 00:58:38 +08:00
whx d7e19ed57a
[BugFix] fix length of sin/cos cache in rope (#1266)
This PR fixes the bug that constructs shorter sin/cos cache than model's
max positional embedding.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1038

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-17 23:14:25 +08:00
Jade Zheng afc8edb046
[Bugfix]: Pass scaling args to mc2 (#1202)
Pass `expert_scale` and `expand_scale` args to the dispatch and combine
functions.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-06-17 22:16:44 +08:00
Li Wang f8029945c3
[Bugfix] Remove cuda related lines and add additional pip mirror (#1252)
### What this PR does / why we need it?
- For npu environment, we should use `PYTORCH_NPU_ALLOC_CONF ` rather
than `PYTORCH_CUDA_ALLOC_CONF`
- Add `PIP_EXTRA_INDEX_URL` to make nightly_benchmarks happy


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-17 21:25:40 +08:00
zzzzwwjj 23ca68d0c8
[refactor] Refactoring AscendFusedMoE (#1229)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR is used for resolved [issue
1147](https://github.com/vllm-project/vllm-ascend/issues/1147)
1. Move fused_moe code into one file `fused_moe.py`.
2. Integrate branch conditions into function `get_fused_moe_state`.
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this
env is useless, we can make judgments based on the current scenario
without this env, it will only increase complexity.
2. This PR has removed the env `USING_LCCL_COM`, because this env has
already expired.
3. `additional_config.expert_tensor_parallel_size` has already expired,
and now we also use parameter `enable_expert_parallel`, consistent with
the vLLM.
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-06-17 17:49:03 +08:00
Yikun Jiang 05dec7eda9
[Doc] Refactor and init user story page (#1224)
### What this PR does / why we need it?
This PR refactor the user stories page:
- Move it to community
- Add initial info of LLaMA-Factory, Huggingface/trl, MindIE Turbo,
GPUStack, verl
- Add a new page for LLaMA-Factory

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview locally

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-17 09:36:35 +08:00
Yikun Jiang 9d3cbc0953
[Doctest] add installation doctest (#1179)
### What this PR does / why we need it?
Install doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Related: https://github.com/vllm-project/vllm-ascend/pull/983

Co-authored-by: wangli <wangli858794774@gmail.com>

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-06-17 08:52:26 +08:00
Mengqing Cao 96fa7ff63b
[DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235)
### What this PR does / why we need it?
1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: https://github.com/vllm-project/vllm-ascend/issues/1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: https://github.com/vllm-project/vllm-ascend/pull/1242
Closes: https://github.com/vllm-project/vllm-ascend/issues/1232


### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-06-16 23:09:53 +08:00
zhuo97 f5404dc650
Fix the device error when using ray as vllm-acend backend (#884)
1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES
2. Add lazy init for vllm_ascend_C

Signed-off-by: zhuo97 <1103045176@qq.com>
2025-06-16 21:03:16 +08:00
wangxiyuan 69b817ed65
[CI] Add unit test framework (#1201)
This PR added the unit test framework to enable ut for vLLM Ascend. Unit
test runs on CPU machines. It'll be ran once lint check is passed the
same as e2e test.

For unit test, this PR created a new folder called `ut` under `tests`
module. All the test file in `ut` should keep the same with the code in
`vllm-ascend`. The file name should be start with `test_` prefix. For
example, in this PR. the `test_ascend_config.py` is added for
`ascend_config.py` test.

A new fille `worker/test_worker_v1.py` is also added as the placeholder.
This file should be the unit test for `vllm-ascend/worker/worker_v1.py`.

Additional, a new `fake_weight` folder is added, it contains the
config.json from `facebook/opt-125m`, so that the test will not always
visit huggingface.

TODO:
We should add all the unit test file one by one in the future.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-16 18:32:28 +08:00
Yikun Jiang 966557a2a3
[Build] Speedup image build (#1216)
### What this PR does / why we need it?
1. Rename workflow name to show OS info
2. Speedup image build:
- PR: only arm64 build on openEuler arm64, only amd64 build on Ubuntu
amd64
- Push/Tag: still keep origin logic use qemu on amd64

This PR actually drop the e2e image build per PR but I think it's fine
consider it's stable enough, if we still meet some problem we can revert
this PR

43-44mins ---> about 8-10 mins

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-16 09:02:53 +08:00
Yikun Jiang 4ce860a2be
[CI] Make e2e test to be preemptible and simple (#1217)
### What this PR does / why we need it?
This PR make e2e test to be simple, even bring some repeat code between
single card and multicard, but we will not struggle with across
max-parallel, matrix and concurrency:
1. This PR make e2e test to be preemptible and simple:
- lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel)
- Anytime you push another PR will cancel previous job, whatever the job
is lint / e2e / multi-cards
2. Use Modelscope rather than hf-mirror
3. Resolve some error like `Canceling since a higher priority waiting
request for pr-XXXX-limit-npu-4 exists`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed
- lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel)
- e2e test will canceled by update patch

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-15 22:07:43 +08:00
ttanzhiqiang 4270682383
Waiting for BMM NZ support(Improve TPOP 2ms performance) (#1131)
### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use #1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-06-15 19:57:02 +08:00
22dimensions 0d2074a1ec
[Doc] fix VLLM_USE_V1 value in graph mode docs (#1226)
os.environ["VLLM_USE_V1"] must be assigned with str, not other type.


![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798)

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-15 15:41:11 +08:00
fems14 ab5d110fcc
vllm-ascend support chunked prefill (#1172)
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA


---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-06-14 22:31:16 +08:00
Mengqing Cao a3b5af8307
[CI/UT][Graph] Add ut for torchair graph mode (#1103)
### What this PR does / why we need it?
Add ut for torchair graph mode on DeepSeekV3

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-06-14 16:59:00 +08:00
Yikun Jiang 94a52cf577
Add ShouJian Zheng (@jianzs) as vLLM Ascend maintainer (#1203)
### What this PR does / why we need it?

Add @jianzs as vLLM Ascend maintainer

@jianzs
----
I would like to nominate Shoujian Zheng (@jianzs
<https://github.com/jianzs>) as a maintainer, starting with my +1.

- He focuses on the code quality and good design with solid reviews in P/D
disaggregation and DeepSeek improvement area about 30+ high quality review, such
as #issuecomment-2811764833, #discussion_r2069927605 and
#pullrequestreview-2820996674. This is the most important reason why I nominated
him, because helping community developers complete PRs with high quality and
continuously ensure the quality of codebase is one of the important
responsibilities of a maintainer. We believe he is a great addition.
- Shoujian's main expertise is distributed inference. He has a lot of experience
in production about AI infra. He has very good habits and explains in great
detail all changes #issue-3023082580 anqd share results open:
#issuecomment-2853140443. And High quality PR: #706, #774, #852.
- Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in 30+ PR and issue,
such as #issuecomment-2911934292 and #issuecomment-2833523571.

Reference:
[1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html
[2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-13 18:25:50 +08:00
whx 47b507b180
[CI] Recover ut for ascend scheduler only in ci of v1. (#1180)
Last PR [#943 ](https://github.com/vllm-project/vllm-ascend/pull/943)
wrongly open ut of AscendScheduler in V0 ci, this PR fixes this problem
and only run ut of it in V1 ci.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-13 07:51:23 +08:00
sdmyzlp e72f94e38f
Support multistream of MLA vector operations (#1135)
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-12 21:42:09 +08:00
Wan_Danfeng 55c0e68883
[Doc] Add Referer header for CANN package download url. (#1192)
### What this PR does / why we need it?
fix the CANN download url

### Does this PR introduce _any_ user-facing change?
no, do not have any user-facing change

### How was this patch tested?
run the **wget** command and cann package is rightly downloaded.

---------

Signed-off-by: wan_danfeng <wonderful199082@126.com>
2025-06-12 21:22:23 +08:00
wangyanhui-cmss c6e2a5fb40
[fix] fix bug in 1p1d disaggregated_prefill example (#1184)
### What this PR does / why we need it?
fix  bug in 1p1d  disaggregated_prefill  example

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested with python find_device_ips.py and run disaggregated_prefill
example

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-06-12 19:40:58 +08:00
Li Wang 37f4469a03
[CI][Benchmark] Add qwen2.5-7b test (#1104)
### What this PR does / why we need it?
- Add qwen2.5-7b performance benchmark, this is a sub pr of #1099, for
v1 test, need more verify
- Fix get commit time after checkout

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-12 10:47:30 +08:00
Li Wang dd207cb261
[CI][Benchmark] Add new model and v1 test to perf benchmarks (#1099)
### What this PR does / why we need it?
- Add qwen2.5-7b-instruct test
- Add v1 test
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-12 10:46:41 +08:00
ttanzhiqiang 2498d297ae
add custom ascendc kernel vocabparallelembedding (#796)
This PR add custom ascendc kernel vocabparallelembedding support in
vllm-ascend, related CMakeLists and setuptools is also added in this PR.

pytest -s benchmarks/ops/ben_vocabparallelembedding.py
pytest -s tests/ops/test_vocabparallelembedding.py

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-06-12 10:44:33 +08:00
whx 3393d53b36
[Scheduler][MTP] Add support for speculative decoding in AsecendScheduler. (#943)
This PR adds support for speculative decoding in AsecendScheduler.
Also inculde part of support for disaggregated prefill, full support
will be merged in follow-up PR.

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-11 20:55:44 +08:00
wangxiyuan 4f5964420e
[CI] Upgrade vllm to 0.9.1 (#1165)
1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now.
keep doc to 0.9.0 until we release the first 0.9.1 release.
2. disable V0 test for PR
3. move actionlint check to lint job

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-11 16:33:11 +08:00
chenwaner e46dc142bf
Enable kvcache_nz for the decode process in torchair graph mode (#1098)
What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588

---------

Signed-off-by: chenwaner <861645847@qq.com>
2025-06-11 14:09:28 +08:00
yz 4153a5091b
[Doc] Fix the config parameter name "enable" in graph_mode.md. (#1159)
Fix the doc typo in graph_mode.md

Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>
2025-06-11 11:03:37 +08:00
ttanzhiqiang 980cd81466
etp best a2 (#1101)
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-06-11 10:40:50 +08:00
depeng1994 860a5ef7fd
provide an e2e guide for execute duration profiling (#1113)
### What this PR does / why we need it?
provide an e2e guide for execute duration profiling


Signed-off-by: depeng1994 <depengzhang@foxmail.com>
2025-06-11 10:02:11 +08:00
sdmyzlp 7bdc606677
Support multistream of shared experts in FusedMoE (#997)
Contains on #1111 for completeness.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Implement multi-stream parallelism for MoE layers with shared experts,
where computation of shared experts will be overlapped with expert token
dispatch and combine. Also, when multi-stream is enabled, weights of
shared experts will be force to replicate across all cards, regardless
of any tensor parallelism configurations, to avoid AllReduce operations.

With the expected overlaping being:
```
| shared gate_up | shared act |              | shared down |
|    dispatch    | routed gate_up, act, down |   combine   |
```

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
No.

<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-11 09:18:38 +08:00
Mengqing Cao 04abfd8721
[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass (#1163)
[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make
longterm CI pass

Related: https://github.com/vllm-project/vllm-ascend/issues/1162

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-11 07:31:13 +08:00
22dimensions 8b48daaa44
[CI] rename Qwen2.5-0.5B-Instruct-W8A8 model (#1145)
1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to
vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-11 06:18:32 +08:00
Mengqing Cao 8dd686dfa2
[MLA][Graph] Improve assertion on Graph mode with MLA (#933)
### What this PR does / why we need it?
Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.

### Does this PR introduce _any_ user-facing change?
Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.

### How was this patch tested?
Test locally as the CI machine could not run V3 due to the HBM limits.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-10 22:26:53 +08:00
Pleaplusone 291c216898
fix torchair execute issue on padding data, and mtp padding logic (#1160)
### What this PR does / why we need it?
The former PR https://github.com/vllm-project/vllm-ascend/pull/736
select the valid token inside the `input_ids` and `position_ids` breaks
the necessary padding required by torchair. In this PR, we pending the
pad logic after the multimodal part.


Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-06-10 22:20:40 +08:00
wangxiyuan 95414bae70
[CI] Run e2e after pre check pass (#1132)
Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-10 17:18:09 +08:00
wangxiyuan b75cb788dd
[Bugfix] add compilation/__init__.py to fix import error (#1152)
1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-10 17:14:25 +08:00
zhangxinyuehfad e68e81f2ce
[CI] Make accuarcy CI and report work (#1078)
### What this PR does / why we need it?
Make accuarcy CI and report work

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manaully review

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-10 14:35:44 +08:00
Yikun Jiang 71aee6f97d
Update 0.9.0rc1 contributors info (#1148)
### What this PR does / why we need it?
Update 0.9.0rc1 contributors info

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-10 13:29:09 +08:00
22dimensions 5cd5d64242
[CI] remove old quantization model (#1003)
remove old quantization model, and new models will be added to testcase
later.

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-10 10:07:36 +08:00
linfeng-yuan 706de02317
[fix] fix compatibility for non-EPLB scenarios (#1142)
### What this PR does / why we need it?
Fix incompatibility problem for non-EPLB scenarios in #1116 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested with online serving and e2e CI.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-06-10 08:39:24 +08:00
wangxiyuan 571f88f85e
[Doc] Update 0.9.0rc1 release date (#1139)
1. Update 0.9.0rc1 release date
2. Update feature and model support list
3. Add DP known issue to  release note

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-09 22:51:02 +08:00
whx cd2f14a1b3
[MTP][V1] Adapt mtp with graph mode in v1. (#1023)
Adapts deepseek mtp with torch air graph mode in v1.

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-09 22:21:42 +08:00
wangxiyuan 5ac4872f5e
[Doc] Add 0.9.0rc1 release note (#1106)
Add the release note for v0.9.0rc1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-09 19:39:21 +08:00
Yuxiao-Xu 6b853f15fe
Add static EPLB (#1116)
### What this PR does / why we need it?
   Add EPLB expert map import capabilities
### Does this PR introduce _any_ user-facing change?
When importing the EPLB expert map you need import expert map file by
vllm args additional_config
### How was this patch tested?
1.You need to collect expert hotness and generate an expert placement
file based on the hotness and the EPLB algorithm, or you can directly
use an existing expert placement table.
2.When launching vLLM, enable EC2 and pass the configuration via the
command-line argument:
      --additional-config '{"expert_map_path": "/xxx/xxx/xx.json"}
Co-authored-by: songshanhu07 <1763685535@qq.com>

---------

Signed-off-by: songshanhu07 <1763685535@qq.com>
Signed-off-by: Yuxiao-Xu <664988918@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: songshanhu07 <1763685535@qq.com>
Co-authored-by: Xu Yuxiao <xuyuxiao2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-09 19:28:11 +08:00
wangxiyuan cb341c7bcd
[CI] Fix PD job (#1129)
Fix e2e test for Pd job

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-09 16:34:41 +08:00
Yikun Jiang e63fc6f280
Init vLLM Ascend maintainers info (#1124)
### What this PR does / why we need it?
As plus of https://github.com/vllm-project/vllm-ascend/pull/1070, this
patch adds `Nominating and Removing Maintainers` section (reference some
design from [PyTorch
Governance](https://docs.pytorch.org/docs/stable/community/governance.html))

Below are key info about existing maintainers:

## @wangxiyuan: 
- Super active code and high quality reviewer [450+ PR
reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Awangxiyuan).
- One of the top contributors, he also active contribute [50+ commits
](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Awangxiyuan+)
with good quality, he dares to [refactor the
code](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Awangxiyuan+is%3Aclosed+refactor),
which also shows his deep understanding of vllm and vllm ascend.
- He leads the [[RFC]: Hardware
pluggable](https://github.com/vllm-project/vllm/issues/11162) feature,
this make vllm-ascend project become true.
- Active community involved cross wechat group, slack, github issue.
Involved on [150+
issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Awangxiyuan)
and help users. He is also the spearker of vLLM Beijing meetup help more
users understand vLLM Ascend.
- Relase manager of
[v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1),
[v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1),
[v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2),
[v0.8.4rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc1),
[v0.7.3.post1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3.post1).

## @Yikun: 
- High active code reviewer: [190+ PR
reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AYikun),
especially for new developers to help them onboarding.
- One of the top contributors with sustained contributions: [50+
commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3AYikun+)
since the first day of vLLM Ascend.
- High quality contributions around vLLM compatibility guarantee and
also maintain [CI
](https://github.com/vllm-project/vllm-ascend/pull/1040) and [test
Framework](https://github.com/vllm-project/vllm-ascend/pull/730).
- Active community involved cross local group, github issue Involved on
[170+
issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3AYikun).
He is also main organizer of vLLM Beijing Meetup and speaker of [PyTorch
Day China
2025](https://pytorchdaychina2025.sched.com/event/2401V/poster-session)
to help vLLM Ascend growth.
- Relase manager of
[v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc2),
[v0.8.5rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.5rc1),
[v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3).

## @ganyi1996ppo 
- High active code and high quality reviewer: [90+ PR
reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Aganyi1996ppo),
he has a deep understanding of Ascend operators can always find some key
issues, has deeply understand of the codebase, good code quality and
qualified judgement.
- Major and high quality contributions: [10+
commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Aganyi1996ppo)
with high quality.
- He is the main contributor of [Custom AscendC op
support](https://github.com/vllm-project/vllm-ascend/pull/371),
[Deepseekv3 performance
optimization](https://github.com/vllm-project/vllm-ascend/pull/598).
- Community Involvement‌: Involved on [11+ issue and help
users](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Aganyi1996ppo),
share [custom ops
topic](https://www.bilibili.com/video/BV1Z25az3EqS/?share_source=copy_web&vd_source=72ef9c665af5f2f1370abe26ce1f719f&t=1342)
on vLLM Ascend Weekly meeting.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-09 16:32:58 +08:00
Shanshan Shen d2f87ed9cc
[Patch] Remove `spec_decode.metrics` patch (#1016)
### What this PR does / why we need it?
Remove `spec_decode.metrics` patch as this has been resolved in
https://github.com/vllm-project/vllm/pull/16983 (include in vllm
`v0.9.0`).

Returns a CUDA event recording when the copy is complete **--after
modified-->** Returns a device event (NPU Event for vllm-ascend)
recording when the copy is complete.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-09 15:05:11 +08:00
yiz-liu 6003afa6d2
[BugFix] Fix data parallel (#940)
### What this PR does / why we need it?
With this PR, we can migrate to the native `data_parallel.py` in vllm
examples and remove the version in vllm-ascend.

At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable
difficulties; therefore, we must employ a temporary workaround and
manually specify the device.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-09 14:08:18 +08:00
Shanshan Shen eec6068187
[Bugfix] Set `ACL_OP_INIT_MODE` env var default to `0` (#1123)
### What this PR does / why we need it?

Set `ACL_OP_INIT_MODE` env var default to `0`, since vllm-ascend may
have problems in some scenarios when setting it to `1`.

Plus, the guide https://github.com/vllm-project/vllm-ascend/issues/734
has also been updated.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-09 14:07:37 +08:00
Yikun Jiang 4976b48b98
[Build] Move numba/quart to requirments and update DS baseline and sync graph typo fix (#1121)
### What this PR does / why we need it?
1. The dependency was introduced by
https://github.com/vllm-project/vllm-ascend/pull/874
- Move numba/quart from requirements-dev to requirments
- Align pyproject.toml with requirements

2. This patch also fix deepseek accuracy baseline which
https://github.com/vllm-project/vllm-ascend/pull/1118 was not addressed.
According to https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite the
gsm8k is about `41.1`

3. This also sync the vLLM upstream changes:
eaa2e51088

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed
vllm ascend test (basic workflow)
vllm longterm test (spec decode)

Closes: https://github.com/vllm-project/vllm-ascend/issues/1120

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-08 22:33:37 +08:00
zzzzwwjj f1543d5e0d
[bugfix] fix deeepseek accuracy (#1118)
### What this PR does / why we need it?
fix deeepseek accuracy in mix-parallel case.


Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-06-07 21:11:36 +08:00
wangxiyuan c8742146d3
[CherryPick] Add unpadded Qwen2.5-VL for verl scenario (#1095)
Add unpadded Qwen2.5-VL for verl scenario.

When using vllm-ascend for verl scenario, set `USE_OPTIMIZED_QWEN2_5_VL`
(default `1`) to `0` to use unpadded Qwen2.5-VL to avoid errors.

This is cherry-picked from 0.7.3-dev

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
2025-06-07 19:45:46 +08:00
linfeng-yuan b80a484864
Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE (#1112)
### What this PR does / why we need it?
Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI passed

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-06-07 19:45:33 +08:00
TaoYu Chen 20dedba5d1
Add qwen2.5 vl multimodal feature for vllm-ascend v1 (#736)
### What this PR does / why we need it?

The current vllm-ascend is not support the multimodal model in
vllm-ascend v1 yet. So I change the `model_runner_v1.py` file with using
MRoPE feature and so on to support this feature. It currently still not
perfect since the Ascend operator is not support the `window/full attn`
to reduce Memcpy operations, so it would out of memory if the input
embedding is too large, so We can't use `self._profile_multimodal()` for
profile since it use a big dummy input (i.e. images) as the multimodal
input.

Fixes: https://github.com/vllm-project/vllm-ascend/issues/514

### Does this PR introduce _any_ user-facing change?

No, this feature not need change the user-facing

### How was this patch tested?

I test this offline using my machine 910B3 and my own fork, and it works
well.

---------

Signed-off-by: cty <ctynb@qq.com>
2025-06-07 16:53:19 +08:00
zxdukki 87ebaef4e4
[perf]: support dual-batch overlap(dbo) for deepseek (#941)
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.

ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)

### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.

### How was this patch tested?

This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).



Any advice/discussion is welcome.

### Performance Benchmark

We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.

`python -m vllm.entrypoints.openai.api_server \
 --model=DeepSeek-R1-W8A8 \
 --trust-remote-code \
 --distributed-executor-backend=mp \
 -tp=16 \
 --port 8006 \
 --max-num-seqs 390 \
 --max-model-len 32768 \
 --max-num-batched-tokens 65536 \
 --block-size 128 \
 --compilation_config 0 \
 --gpu-memory-utilization 0.90 \
 --disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`

and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`

1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)

2. test with the version using alltoall: 

prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms

The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm

---------

Signed-off-by: zhuohuan <zxdu1997@gmail.com>
2025-06-07 16:46:58 +08:00
sdmyzlp 3640c60b0e
Avoid unfused Transpose in DeepSeekV3 EP256 MoE layer (#1091)
### What this PR does / why we need it?

View optimization in torchair (defaulted to on for Transpose with any of
its axis being 1) prevents the weight Transpose to be fused with later
GroupedMatmul, which decrease the performance of MoE layer when expert
parallelism equals the total number of experts (e.g. EP256 for DSKv3).
Add an option to solve this problem by disabling the optimization.

### Does this PR introduce _any_ user-facing change?

Controlled by
`additional_config.torchair_graph_config.enable_view_optimize`,
defaulted to `True`.

### How was this patch tested?

Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-07 14:28:20 +08:00
Yikun Jiang 8d00775fce
[SpecDecode][CI] Set default values to fix spec decode and fix multicard CI (#1109)
### What this PR does / why we need it?
- Set default values to fix spec decode
- To avoid oom, we need to run the test in a single process

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed, espcecially multicards CI
- For spec decode test, long term CI passed

Closes: https://github.com/vllm-project/vllm-ascend/pull/1105

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
2025-06-07 11:23:30 +08:00
weijinqian0 e9ada685ec
[CI]Moe alltoall communication optimization (#1067)
[CI]Moe alltoall communication optimization
The DeepSeek V3/R1 model has 256 routing experts. During parallel
inference, if the load of an EP rank is high, the overall communication
and computing time is slowed down, which becomes a weakness of parallel
inference because the load is unevenly distributed. However, the data
volume in the prefill phase is large, and the inter-card communication
time consumption/calculation time consumption and the data volume are
closely related to each other. Therefore, less non-linear precision loss
can be used to obtain a near-linear performance improvement.

During parallel inference, global synchronization occurs during
communication. As a result, the card with low load completes the
calculation first and waits for the card with the highest load to
complete the calculation. Therefore, if the load is unbalanced, the card
with high load slows down the overall time consumption. Significant
performance gains can be achieved by discarding a small number of
tokens, which is unacceptable in some precision-sensitive scenarios.
However, similar to quantification, it is a solution that uses an
acceptable precision loss in some scenarios for performance. In
addition, a trade-off between performance and precision can be achieved
by configuring a proportion of discarded tokens.

Perform the test on A3. The batch size is 8 (B), the prompt length is
3.5K tokens (S), and the parallel configuration is as follows: AttnDP=2,
AttnTP=8, MoeTP=1, and MoeEP=16. In this sence, we got a 10%-15%
performance gain.

Plus, the next version, we'll have an alltoallv moe.

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-06-07 10:15:56 +08:00
Li Wang a2552e10e4
[Worker][V1] Support sleep mode for v1 (#1084)
### What this PR does / why we need it?
 Support sleep mode for v1

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-06 21:54:02 +08:00
wangxiyuan 0395ab30be
[Doc] Add graph mode user doc (#1083)
Add graph mode user guide doc.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 21:14:34 +08:00
ApsarasX 9a4eb94ca9
[Misc] Adjust the default profiler configuration (#1097)
### What this PR does / why we need it?
When profiling, it is often necessary to disable the call stack to
reduce profiling overhead, and adjust the profiler_level to level1 to
obtain more detailed operator and communication information.

Therefore, it is recommended to modify the default profiling
configuration.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-06-06 20:25:59 +08:00
Shanshan Shen 5d0e9fd19a
[Misc] Add `ACL_OP_INIT_MODE` env var and set default to `1` (#597)
### What this PR does / why we need it?
Fix the bug in torch 2.5.1 that raising segment fault when enable
`pin_memory` while creating a tensor using `torch.tensor`.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-06 20:22:51 +08:00
Li Wang 11a7df4270
[ModelRunner] Support embedding inputs (#916)
### What this PR does / why we need it?
- Adds support for passing prompt_embeds to LLM.generate as
```bash
llm.generate({"prompt_embeds": input_embeds}, sampling_params)
```
or
```bash
llm.generate(
    [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)
```
- Add `prompt_embeds` to examples

### How was this patch tested?
CI passed with new added/existing test.
and I have test with the example script in this pr, and the output seems
looks good:
```bash

[Single Inference Output]
------------------------------
The capital of France is Paris. Paris is the largest city in France and is
------------------------------
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3966.87it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s]

[Batch Inference Outputs]
------------------------------
Q1: Please tell me about the capital of France.
A1: The capital of France is Paris. It is located in the northern part of the

Q2: When is the day longest during the year?
A2: The day is longest during the year at the summer solstice. This typically occurs

Q3: Where is bigger, the moon or the sun?
A3: The sun is significantly bigger than the moon. 

The sun has a diameter of

------------------------------
```

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-06 20:21:13 +08:00
NeverRaR c7f1c59911
feat: support compile multiple batch graph (#1085)
### What this PR does / why we need it?

support compile multiple batch graph with different code object to avoid
cache invalidation

### How was this patch tested?

```
export VLLM_ENABLE_MC2=0
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --no-enforce-eager \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config": {"enabled": true,"use_cached_graph": true,"graph_batch_sizes": [8,16,24]},"ascend_scheduler_config": {"enabled":true,"chunked_prefill_enabled":false},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.95 &> run.log &
disown
```

Signed-off-by: boying <897013703@qq.com>
2025-06-06 20:17:51 +08:00
Mengqing Cao c46632439a
[Bugfix][DP] Add with_prefill_across_dp to AscendMetadata to fix dp (#1094)
### What this PR does / why we need it?
Add `with_prefill_across_dp` to AscendMetadata to fix dp

This pr fixes the bug introduced by #1012, which add an arg
`with_prefill_across_dp` when dp_size > 1.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-06 19:20:33 +08:00
hahazhky 0b12c2acf7
[Kernel] Remove cumsum in groupedmatmul (#987)
### What this PR does / why we need it remove cumsum operator in MOE to improve performance

### How was this patch tested?
it should be tested on a case with mc2 operator and graph mode enabled

Signed-off-by: zhky <hahazhky@163.com>
Co-authored-by: 洪炜杰 <hongweijie1@huawei.com>
2025-06-06 19:17:27 +08:00
wangxiyuan dab19d5dca
[BugFix] Fix ascend config check (#1092)
Fix the ascend config check logic:
1. refactor check_ascend_config to make it clear:
    1. torchair graph should not work with enforce_eager=True
    2. aclgraph should not work with torchair graph
3. add refresh config for rlhf case
4. fix a typo in model runner
5. change expert_tensor_parallel_size default to 0 to keep the same as
before

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 18:54:37 +08:00
wangxiyuan 973f993a13
[Misc] fix initialize_kv_cache (#1102)
KV cache manger has been changed by
f8a1a2d108

This PR adapt the change into vllm-ascend to make ci happy

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 16:46:23 +08:00
wangxiyuan c94afd79ce
[Doc] Update the description for env (#1079)
Add the description for env to make it more clear for users

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 09:48:43 +08:00
depeng1994 6b094a2bd4
[ModelRunner]Add profile execute duration observation (#1013)
### What this PR does / why we need it?
We need to **observe the time consumed in each stage of inference
(including pre-processing, model forward, etc.), without any performance
loss**.
Therefore, we use the event timestamp mechanism of the NPU to mark any
stage during the execution of the NPU device (this marking operation is
executed asynchronously, with no performance loss).
Additionally, we provide a blocking synchronization API
`pop_captured_sync` to be called at an appropriate time, to print the
time consumed in all observed stages.

**model_runner_v1.py file only changed 5 lines, all of which were
`ProfileExecuteDuration()` calls, and nothing else was changed, while
more changes were showed due to the alignment issue.**

### Does this PR introduce _any_ user-facing change?
Use  env `VLLM_MODEL_EXECUTE_TIME_OBSERVE `to enable this feature

### How was this patch tested?

Tested in deepseek model,Print like this:
```
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms

```

---------

Signed-off-by: depeng1994 <depengzhang@foxmail.com>
2025-06-06 09:29:34 +08:00
David9857 78431b3469
[perf]Support MOE Multi-stream in Deepseek (#947)
### What this PR does / why we need it?
Support MOE inner Multi-stream for Deepseek. 
This feature requires graph mode with mc2 enabled.

---------

Signed-off-by: David9857 <985700846@qq.com>
2025-06-05 23:39:38 +08:00
sherie 908a851a77
optimize the funtion of computing topk and topp in sampler. (#970)
### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.

### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

### How was this patch tested?
pytest test_sampler.py

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
2025-06-05 16:42:18 +08:00
wangxiyuan e1ab6d318e
[Misc] Refactor additional_config (#1029)
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.

 This PR also added the `additional_config` doc for users.

Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.

TODO: Add e2e test with torchair and deepseek once the CI resource is
available.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-05 16:28:01 +08:00
zhangxinyuehfad 7737aaa40f
[CI] Add accuracy test for Qwen2.5-VL-3B-Instruct (#766)
### What this PR does / why we need it?
Add accuracy test for Qwen2.5-VL-3B-Instruct


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-05 15:09:20 +08:00
Li Wang b4cb0eecb6
[CI] Hotfix on benchmark results path (#1076)
### What this PR does / why we need it?
Fix benchmark results path

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-05 12:53:46 +08:00
Yikun Jiang fd136e6762
Add vLLM Ascend project governance docs (#1070)
### What this PR does / why we need it?
Add vLLM Ascend project governance and first contributors docs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Closes: https://github.com/vllm-project/vllm-ascend/issues/828
Closes: https://github.com/vllm-project/vllm-ascend/issues/929

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-05 11:56:51 +08:00
Li Wang 31dd471574
[CI] Add workflow_dispatch and use main benchmarks directly (#1071)
### What this PR does / why we need it?

This is for the benchmark iteration, which will change the benchmark
scripts while checkouting each commit. So we need ensure the benchmark
scripts always available.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manaully

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-05 10:29:30 +08:00
Yikun Jiang 9e855b70be
Adjust concurrency group for each npu workflow (#1068)
### What this PR does / why we need it?
Adjust concurrency group for each npu workflow
- for pd and benchmarks share the static-08-01, so only one job can runs
on
- other job one PR/schedule should have only 1 job runs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-05 09:17:04 +08:00
Mengqing Cao afc4c0cd03
[Bugfix] Fix deepseek percision issue and add acc ci for it (#905)
### What this PR does / why we need it?
Fix deepseek percision issue on V0 and add acc ci for it
Fixes https://github.com/vllm-project/vllm-ascend/issues/1062
### How was this patch tested?
CI passed with new added test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-04 20:26:44 +08:00
NeverRaR da9acfca60
feat: support data parallel for deepseek (#1012)
### What this PR does / why we need it?
feat: support data parallel for deepseek

### Does this PR introduce _any_ user-facing change?
Yes, support dp for deepseek

### How was this patch tested?

```
export VLLM_ENABLE_MC2=0
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

nohup python -m vllm.entrypoints.openai.api_server
--model=/path/to/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --max-num-seqs 24 \
    --max-model-len 4096 \
    --max-num-batched-tokens 4096 \
    --block-size 128 \
    -O 0 \
    --no-enable-prefix-caching \
--additional-config
'{"torchair_graph_batch_sizes":[24],"expert_tensor_parallel_size":16,"ascend_scheduler_config":{},"enable_graph_mode":true}'
\
    --gpu-memory-utilization 0.95 &> run.log &
disown
```

Signed-off-by: boying <897013703@qq.com>
2025-06-04 18:31:41 +08:00
Li Wang 517811449e
[CI] Re-enable sleep mode test and skip failure breaking CI (#990)
### What this PR does / why we need it?

- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream
[change](https://github.com/vllm-project/vllm/pull/18654)
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-04 16:24:16 +08:00
Li Wang eb2701e0b2
[CI] Remove workflow_dispatch and change schedule time (#1056)
### What this PR does / why we need it?

- Remove workflow_dispatch 
-  Change schedule time to 2:00 UTC+8
### Does this PR introduce _any_ user-facing change?


### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-04 01:19:20 +08:00
Li Wang 06fb5a8d81
[CI][Bugfix] Upgrade escli to v0.2.1 to fix benchmark deps (#1055)
### What this PR does / why we need it?

Update escli-tool to v.0.2.1 to fix deps bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangli <858794774@qq.com>
2025-06-04 01:03:56 +08:00
Li Wang 76dacf3fa0
[CI][Benchmark] Optimize performance benchmark workflow (#1039)
### What this PR does / why we need it?

This is a post patch of #1014, for some convenience optimization
- Set cached dataset path for speed
- Use pypi to install escli-tool
- Add benchmark results convert script to have a developer-friendly
result
- Patch the `benchmark_dataset.py` to disable streaming load for
internet
- Add more trigger ways for different purpose, `pr` for debug,
`schedule` for daily test, `dispatch` and `pr-labled` for manual testing
of a single(current) commit
- Disable latency test for `qwen-2.5-vl`, (This script does not support
multi-modal yet)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-03 23:38:34 +08:00
wangxiyuan 543380ceae
[CI] Add merge conflict label job (#1050)
Add bot to label merge conflicts, it helps developer and maintainer to
do code review and update clear.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-03 17:32:31 +08:00
Yikun Jiang f24375f318
Enable accuracy test for PR labeled with "*accuracy-test" (#1040)
### What this PR does / why we need it?
This PR enable accuracy test for PR labeled with "*accuracy-test" and
workflow_dispatch.

Only one model test running for each type test to reduce excution time.

- The dense test costs about `25mins` to complete (gsm8k 7mins, ~mmlu
3h24mins,~ cEval 18mins)
- The vl test costs about `40mins` to complete


In futute, we might consider enable all job test as nightly schedule
job.

Below is mainly changes:
- the dense/vl accuracy test will be triggered by lableling
`accuracy-test` and `ready-for-test`
- the dense accuracy test will be triggered by lableling
`dense-accuracy-test` and `ready-for-test`
- the vl accuracy test will be triggered by lableling `vl-accuracy-test`
and `ready-for-test`
- accuracy test will also be triggered by workflow_dispatch
- Support V1 and V0 for qwen and V0 for VL

For PR test we also generate summary in test summary.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed with accuracy-test label
- Preview:
https://github.com/vllm-project/vllm-ascend/actions/runs/15407628722?pr=1040

Closes: https://github.com/vllm-project/vllm-ascend/pull/953

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2025-06-03 15:38:13 +08:00
Shanshan Shen 068c3a0167
[Bugfix] Add verification for `quant_action.choices` to avoid `TypeError` (#1046)
### What this PR does / why we need it?

When I run vllm-ascend, I get this error msg:

```bash
Traceback (most recent call last):
  File "/home/sss/software/miniconda3/envs/vllm-v1/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/cli/main.py", line 50, in main
    cmd.subparser_init(subparsers).set_defaults(
  File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/cli/serve.py", line 101, in subparser_init
    serve_parser = make_arg_parser(serve_parser)
  File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/openai/cli_args.py", line 254, in make_arg_parser
    parser = AsyncEngineArgs.add_cli_args(parser)
  File "/home/sss/github/vllm-project/vllm/vllm/engine/arg_utils.py", line 1582, in add_cli_args
    current_platform.pre_register_and_update(parser)
  File "/home/sss/github/vllm-project/vllm-ascend/vllm_ascend/platform.py", line 80, in pre_register_and_update
    if ASCEND_QUATIZATION_METHOD not in quant_action.choices:
TypeError: argument of type 'NoneType' is not iterable
[ERROR] 2025-06-03-02:53:42 (PID:6005, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

This is because the `choices` attribute in `quant_action` can be `None`
and we don't check it.

```bash
# quant_action
_StoreAction(option_strings=['--quantization', '-q'], dest='quantization', nargs=None, const=None, default=None, type=<class 'str'>, choices=None, required=False, help='Method used to quantize the weights. If `None`, we first check the\n`quantization_config` attribute in the model config file. If that is\n`None`, we assume the model weights are not quantized and use `dtype` to\ndetermine the data type of the weights.', metavar=None)
```

Thus, I have added check for the `choices` to handle the scenario of
`choices=None`.

### Does this PR introduce _any_ user-facing change?
yes, vllm server with ascend quantization works now.

### How was this patch tested?
by `vllm server --quantization ascend` command.

Related: https://github.com/vllm-project/vllm/issues/19004

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-03 11:44:45 +08:00
Shanshan Shen 93860574bb
[ModelRunner][MultiModal] Remove legacy input mapper/processor from V0 (#951)
### What this PR does / why we need it?
Remove legacy input mapper/processor from V0.

Find more details at
https://github.com/vllm-project/vllm-ascend/issues/673 and
https://github.com/vllm-project/vllm/pull/15686.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
Launch online service:

```bash
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--max_model_len 32768 \
--max-num-batched-tokens 32768
```

Query the server:

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustrate?"}
    ]}
    ]
    }'
```

Result:

```bash
{"id":"chatcmpl-619e70733ed148b3be3a0b6524ee0ef3","object":"chat.completion","created":1748226332,"model":"/home/sss/.cache/modelscope/hub/models/Qwen/Qwen2___5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The text in the illustration reads \"TONGYI Qwen.\"","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"pro
```

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-03 11:32:03 +08:00
NINGBENZHE 6ec64a3f96
[bugfix] some bugs maybe fail to run (#896)
### What this PR does / why we need it?
Solve the bug that the graph mode is the same as p and d, and some other
bugs.
### Does this PR introduce _any_ user-facing change?
Wouldn't be
### How was this patch tested?
Follow the end-to-end test

Signed-off-by: ningbenzhe1 <ningbenzhe@huawei.com>
2025-06-03 11:07:33 +08:00
Yikun Jiang 92bc5576d8
Skip benchmarks/** in vllm ascend test (#1041)
### What this PR does / why we need it?
Skip benchmarks/** in vllm ascend test to reduce CI cost

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-01 19:01:26 +08:00
NeverRaR 507ae627ca
feat: support compile torchair graph while warming up (#839)
### What this PR does / why we need it?
feat: support compile torchair graph while warming up

Signed-off-by: boying <897013703@qq.com>
2025-05-31 06:03:03 +08:00
Li Wang d9fb027068
[CI] Add benchmark workflows (#1014)
### What this PR does / why we need it?

Add benchmark workflows

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Run locally

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-05-30 22:42:44 +08:00
yiz-liu 5a1689fc64
[Fix] Fix update_aclgraph_sizes when running MoE models (#913)
### What this PR does / why we need it?
Fix update_aclgraph_sizes when running MoE models.

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-05-30 15:17:11 +08:00
XWFAlone 3442fbdb23
[1/N][UT][v1 MTP] add basic v1 mtp features (#890)
### What this PR does / why we need it?
add basic v1 mtp features
please merge it after
https://github.com/vllm-project/vllm-ascend/pull/874 and
https://github.com/vllm-project/vllm-ascend/pull/844.

### Does this PR introduce _any_ user-facing change?
now, we supported basic v1 mtp, only supported tp only、eager mode and
k=1
we will continue to expand more scenarios.

### How was this patch tested?
local tested

Signed-off-by: XWFAlone <xuewenfei2@huawei.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: JC-ut0 <xuyexiong@huawei.com>
2025-05-30 08:59:58 +08:00
wangxiyuan 5903547d09
[doc] add 0.7.3.post1 release note (#1008)
Add release note for 0.7.3.post1
Add the missing release note back for 0.7.3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-29 17:38:34 +08:00
22dimensions c464c32b81
add doc for offline quantization inference (#1009)
add example for offline inference with quantized model

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-05-29 17:32:42 +08:00
zouyida2052 05a471001b
bugfix for qwen2_5_vl (#805)
### What this PR does / why we need it?
the interface of qwen2.5vl changes from column linear to qkv linear,
this makes our weight pad func become abnormal, thus we optimize
split_qkv func to fix this bug.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
with CI

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-05-29 17:20:39 +08:00
Mengqing Cao a93bed4535
[aclgraph] implentment NPUPiecewiseBackend to enable aclgraph (#836)
### What this PR does / why we need it?
1. Implentment `NPUPiecewiseBackend` to enable aclgraph
2. Eable aclgraph by default in V1, but raise error when running
deepseek and raise warning when running models except for qwen

### How was this patch tested?
CI pass with the new ut

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-29 11:58:26 +08:00
Mengqing Cao cc74b97f74
[Bugfix][V1] Fix deepseek with v1 (#958)
### What this PR does / why we need it?
Fix deepseek with v1, this error is introdeced by
https://github.com/vllm-project/vllm-ascend/pull/945. and this pr fix
the block table of mla

### How was this patch tested?
CI passed with new addedtest.

Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-05-29 11:57:43 +08:00
ApsarasX e3c7f71462
[Perf] Refactor tensor disposal logic to reduce memory usage (#966)
### What this PR does / why we need it?
1. In previous PRs https://github.com/vllm-project/vllm-ascend/pull/580
https://github.com/vllm-project/vllm-ascend/pull/784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: https://github.com/sgl-project/sglang/pull/6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-05-29 11:48:26 +08:00
Mengqing Cao 6eddbd2521
[CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889)
Initialize PD Disaggreate UT

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-29 10:17:12 +08:00
wangxiyuan f6e5decc10
[CI] upgrade to vllm 0.9.0 (#959)
Upgrade to vllm 0.9.0.
0.8.5 will not be supported any more.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-28 21:18:41 +08:00
wangxiyuan e2a0c19cea
[CI] Refactor CI (#952)
1. remove some useless test func and file
2. fix format.sh problem
3. enable full test for singlecard and multicard
4. move long term test to long_term folder. For this kind of test, it
only runs by labeled and daily test. Include: spec decode、accuracy test

## After refactor:
There are 4 test modules
- `singlecard`: contains the test running on one NPU. It'll be run for
each PR and daily test.
- `multicard`: contains the test running on multi NPUs. It'll be run for
each PR and daily test.
- `long_term`: contains the test that cost much time(Now include `spec
decode` and `accuracy` test). It'll be run for the PR with
`long-term-test` labeled and daily test.
- `e2e`: contains the test for doc and pd feature. It'll be run for the
PR with `pd-test` labeled and daily test.

## Todo:
1. some test are skipped, they should be fixed and reenabled in the
future.
2. pyhccl test for multicard doesn't work at all. It should be enabled
as well.
3. ensure long-term-test pass by daily test.

### Know issue
Now, `ready` labels is required to start pd test or long term test. And
when `long-term-test` or `pd-test` is labeled after another one, the old
labeled test will be re-run again. So the labeled test should be ran in
the following step:

1. decide which test need run, then label it. `long-term-test` or
`pd-test` or both.
2. add `ready-for-test` label, then the test will be ran.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-28 06:31:35 +08:00
Angazenn 9f5ab59e30
[WIP][BugFix]Fix accuracy issues caused by wrong etp_size passed into FusedMoEParallelConfig when using vLLM 0.9.0 (#961)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR fix accuracy issues incurred by codes that adapt to
`FusedMoEParallelConfig` in vLLM 0.9.0 version. The `tp_size` used to
split weights are wrongly passed. The root cause is that vLLM community
and vLLM-Ascend are using different methods to decide whether to use
Expert Parallel.

vLLM:
vLLM use a flag `enable_expert_parallel` to indicate whether to use EP
and use the following codes to decide `ep_size`:
```
        use_ep = (dp_size_ * tp_size_ > 1
                  and vllm_parallel_config.enable_expert_parallel)

        dp_size = dp_size_
        dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0
        tp_size, tp_rank = flatten_tp_across_dp(dp_rank)

        if not use_ep:
            return FusedMoEParallelConfig(tp_size=tp_size,
                                          tp_rank=tp_rank,
                                          dp_size=dp_size,
                                          dp_rank=dp_rank,
                                          ep_size=1,
                                          ep_rank=0,
                                          use_ep=False)
        # DP + EP / TP + EP / DP + TP + EP
        assert use_ep
        # In EP, each device owns a set of experts fully. There is no tensor
        # parallel update tp_size, tp_rank, ep_size and ep_rank to reflect that.
        ep_size = tp_size
        ep_rank = tp_rank
        return FusedMoEParallelConfig(tp_size=1,
                                      tp_rank=0,
                                      dp_size=dp_size,
                                      dp_rank=dp_rank,
                                      ep_size=ep_size,
                                      ep_rank=ep_rank,
                                      use_ep=True)
```

vLLM-Ascend:
vLLM-Ascend uses `etp` to specify Tensor Parallel in MoE.
```
            self.ep_size = get_ep_group().world_size
            self.tp_size = get_etp_group().world_size
            self.dp_size = (dp_size if dp_size is not None else
                            get_dp_group().world_size)
```

So there will be conflicts if we simply combine these codes together.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-05-27 15:16:17 +08:00
Shuqiao Li 01e3d59eae
add workflow to build and release wheel (#775)
### What this PR does / why we need it?

This is a continuing work of #716.
This PR add workflow to build and release wheel, and also release source
to PYPI.
We have 3 conditions to trigger the workflow:

1. PR to `main` and `*-dev`
2. push to `main` and `*-dev`
3. push tag with name of `v*`

Release to PYPI will only be done under condition 3. Under condition 1
and 2, it will generate .tar.gz and build .whl, upload to github
artifacts but will not release.

update:
Will build .whl and upload to github artifacts with scheduled task.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
All triggered conditions are well tested with my fork repo.

---------

Signed-off-by: Shuqiao Li <celestialli@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-26 14:18:26 +08:00
Mengqing Cao a0c3e9ba50
[Bugfix] Adjust inputbatch to be compatible with latest vllm (#945)
Adjust inputbatch to be compatible with latest vllm, as kvcache group
feature has been redo in https://github.com/vllm-project/vllm/pull/18593

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-26 10:33:28 +08:00
Angazenn 1f9fb869ad
[BugFix] Fix accuracy bugs for unquantized deepseekv3 models (#897)
### What this PR does / why we need it?
This PR fixes two accuracy bugs incurred by PR #819 when running
deepseekv3 series models:
1. #819 adds `all_to_all` communication in quantized cases, but
`all_gather` && `reduce_scatter` are removed in both of quantized and
unquantized cases. When running unquantized deepseekv3 models with
`ep_size == world_size`, the moe modules fail to communicate. Therefore,
this PR adds `all_to_all` communication on unquantized situation to
solve this accuracy issue.
2. Use `ep_size` rather than `dp_size` to decide whether to use
`all_to_all` in moe.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-05-24 14:29:36 +08:00
yiz-liu 17f05b1089
[Feature] Add CustomQwen3MoeForCausalLM model (#925)
Tweak packed_modules_mapping to support W8A8 weights.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-05-23 15:50:48 +08:00
jiangpeng df58fb80ee
Spec decode support for V1 Engine (#874)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Make spec decode support for V1 Engine
- Currently, Ascend does not support the triton kernel. PyTorch is used
to rewrite the `rejection_sampler.py` triton kernel. However, PyTorch is
not as good as Triton. Therefore, ascend c is used to implement the
function in the future.
- Currently, spec decode supports only the ngram algorithm. The eagle
algorithm needs to be further adapted.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
Not change user facing.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
test by `tests/singlecard/spec_decode/e2e/test_v1_spec_decode.py` and
`tests/sample/test_rejection_sampler.py`, test base function of
rejection sampler and e2e function of spec decode.

Signed-off-by: ponix-j <657511300@qq.com>
2025-05-23 14:25:46 +08:00
Angazenn a970b27e2d
[WIP][Perf]remove unnecessary padding before MLA V1 prefill (#917)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Currently, the implementation for MLA V1 pads q, k, v to `head_dim` 256
to conform to early MLA kernel. But the new MLA kernel supports
`head_dim` that can't be devided by 128. Therefore we can remove those
unnecessary paddings to boost the performance

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-05-23 14:14:06 +08:00
ttanzhiqiang dc6172efd3
update attention nz and mla nz(Improve TPOP 6ms performance) (#909)
### What this PR does / why we need it?
Update attention nz and mla nz modules to improve TPOP 6ms performance
Convert W_UV and W_UK_T to NPU format in mla_v1.py
Convert layer.weight to NPU format in w8a8.py

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-05-23 10:18:10 +08:00
Jade Zheng 7153d8890b
[Feature] Impl v1 disaggregated prefill in ascend scheduler (#852)
Implement save kv cache logic for v1 disaggregated prefill in ascend
scheduler

This PR adds support for saving kv cache in the ascend scheduler, which
is part of the v1 disaggregated prefill design. The load functionality
is not yet implemented.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-05-23 10:15:29 +08:00
rjg-lyh b434f37b46
[V1] Revert the default value of enable_chunked_prefill in additional… (#935)
### What this PR does / why we need it?
Revert the default value of enable_chunked_prefill to 'False' in
additional_scheduler_config. In engine v1, enable_chunked_prefill is
forcibly set to True in VllmConfig, which causes it to be perceived as
True in check_and_update_config(). As a result, when the v0 scheduler is
enabled, the chunked prefill feature remains active, leading to the
failure of the v0 scheduler and causing it to fall back to the native v1
scheduling logic.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-05-23 10:06:50 +08:00
yangpuPKU 46df67a5e9
[bugfix] Improve log level and info for custom ops build (#937)
### What this PR does / why we need it?
Fix the bug of #703, where vllm wrong raised the ERROR : Failed to
import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'. The
format for reporting import vllm_ascend_C failure is unified by warning
("Failed to import vllm_ascend_C:%s", e).

### Does this PR introduce _any_ user-facing change?
No

---------

Signed-off-by: yangpuPKU <604425840@qq.com>
2025-05-23 10:05:57 +08:00
yupeng 8ddc0a1002
[DOC] mark v1 multi-lora functional (#932)
### What this PR does / why we need it?
Update feature support for lora

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
preview

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
2025-05-22 19:53:14 +08:00
yupeng 0f53b138f6
[V1][LoRA][Test] V1 Engine LoRA support & e2e test (#893)
### What this PR does / why we need it?

Add V1Engine LoRA support.
Add LoRA e2e test on single card and multiple cards.

### Does this PR introduce _any_ user-facing change?
support lora for V1

### How was this patch tested?

CI passed with new added test

---------

Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: paulyu <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: jesse <szxfml@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
2025-05-22 19:20:51 +08:00
Mengqing Cao 7aa4f85f10
[Bugfix][kvcache] revert multiple kv cache groups (#923)
Revert multiple kv cache groups related changes as this feature is
reverted in vllm https://github.com/vllm-project/vllm/pull/18459

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-22 15:15:33 +08:00
rjg-lyh b4d6672d01
[BugFix] Fix chunked prefill bugs in engine v1 (#844)
### What this PR does / why we need it?
Fix the bugs when run deepseek model in engine v1.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-05-22 10:33:50 +08:00
yiz-liu a73bd6caf4
[Fix] Set div_mode to False and fix view_as position (#912)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

Set div_mode to False to use the ACLNN kernel, which is crucial when
using ACL Graph.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-05-22 09:57:25 +08:00
hfadzxy 58b413752b
[Doc] Support XLM-RoBERTa-based and MiniCPM3 model (#820)
### What this PR does / why we need it?
support XLM-RoBERTa-based and MiniCPM3 model

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-05-21 15:44:54 +08:00
22dimensions d5401a08be
[DOC] update modelslim version (#908)
1. update modelslim version to fix deepseek related issues
2. add note for "--quantization ascend"

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-05-21 09:12:02 +08:00
Wan_Danfeng 5cf9ff18e9
[Performance]: Custom AscendC Kernel of Multi-Step Prepare Input (#814)
### What this PR does / why we need it?

- According to https://github.com/vllm-project/vllm-ascend/issues/807,
we pull request for customer ascendc kernel of multi-step.
- also a bug we found in multi_step_runner.py is fixed when we use
multi-step on V0 Engine.


### Does this PR introduce _any_ user-facing change?

no user-facing change


### How was this patch tested?
we add Unit Test file and offline inference file to test the custom
ascendc kernel. See test/ops/test_multi_step.py and
examples/offline_multi_step.py

---------

Signed-off-by: wan_danfeng <wonderful199082@126.com>
2025-05-20 09:31:30 +08:00
22dimensions 00e0243561
enable online serving quantization (#877)
For online serving, "ascend" quantization method is not a choice
natively, so we need to add "ascend" quantization method to quantization
methods list and the user can enable quantization using "vllm serve
--quantization ascend" command.

---------

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-05-17 17:36:04 +08:00
22dimensions a8730e7a3c
[Doc] update quantization docs with QwQ-32B-W8A8 example (#835)
1. replace deepseek-v2-lite model with more pratical model QwQ 32B
2. fix some incorrect commands
3. replase modelslim version with a more formal tag

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-05-17 15:25:17 +08:00
wangxiyuan 7326644513
[CI] Fix qwen2.5 vl CI failure (#888)
The [vllm
commit](67da5720d4)
changed the input and rotary position embedding for qwen 2.5 vl which
break CI. This PR fix the CI failure for qwen2.5 vl in quick

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-17 05:13:32 +08:00
Mengqing Cao df16c4f2bc
[CI/UT] Ignore vllm/tests/test_vllm_port.py (#887)
Ignore `vllm/tests/test_vllm_port.py` in ut as no related to
vllm-ascend, and it is breaking CI

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-16 18:52:59 +08:00
Mengqing Cao 7a325b2e2d
[Bugfix][Model] Fix fusedmoe and make modelrunner_v1 compatible with latest vllm (#867)
### What this PR does / why we need it?
this PR fix CI failure broken by vllm.
1. add moe_config for fused_moe
2. adjust the change for kv cache group from vllm. currently vllm-ascend
doesn't support this feature. this is just a quick fix for backward
compatibility

fix: #872

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-16 12:14:55 +08:00
hfadzxy fd515cd60b
[Doc][BugFix]Fix Release Compatibility Matrix (#865)
### What this PR does / why we need it?
Fix Release Compatibility Matrix

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-05-15 15:38:38 +08:00
Angazenn 1e67089bc9
[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
1. This PR introduces native `all_to_all` communication operator to fix
`allgather` bugs when dp_size > 1. Besides, it adds a naive
implementation of force-load-balance when doing profile runs.
2. The operator `npu_dequant_swiglu_quant` only supports input
hidden_states with dtype `torch.int32`. This tensor occupies space of
`global_bs * seq_len * topk * hidden_size`, which might be very large as
`ep_size` grows. Therefore we need to disable this operator and use
original `swiglu` && `quantize`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By performing offline inference:

![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828)

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-05-15 09:19:55 +08:00
wangxiyuan 68fb63428b
[CI] Patch torch.library.infer_schema for fused moe ops to fix CI (#854)
make sure pytorch infer_schema check is patched before some case which
using fused moe ops:
1. model register
2. quantization loading
3. fused moe ut

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-14 19:49:09 +08:00
Yikun Jiang 508242425c
[CI][1/N] Add basic ci for PD disaggregation (#830)
### What this PR does / why we need it?
Add basic CI for PD disaggregation, and enable it when schedule and
label with `module:pd`

- Updated `.github/actionlint.yaml` to add a new self-hosted runner
configuration: `linux-arm64-npu-static-8`.
- Introduced a new GitHub Actions workflow
`.github/workflows/vllm_ascend_test_pd.yaml` for PD disaggregation
testing:
- Scheduled to run daily at 23:00 UTC and triggered by pull request
label `module:pd`.
- Added steps for baisci installation and other steps will add in
followup PR

Related: https://github.com/vllm-project/vllm-ascend/issues/841

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed
- No trigger by default
<img width="847" alt="image"
src="https://github.com/user-attachments/assets/23aa128f-526d-447f-91c8-8ebf6be8400f"
/>
- Trigger only if we tag with pd
<img width="930" alt="image"
src="https://github.com/user-attachments/assets/aef1caca-2029-48e8-a6e6-860136adcd37"
/>

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-14 18:04:16 +08:00
Yikun Jiang 59e02502b1
[CI] Add e2e test frame work and doctest (#730)
### What this PR does / why we need it?
Add quickstart doctest CI

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
- CI passed
- Run `/vllm-ascend/tests/e2e/run_doctests.sh`
Related: https://github.com/vllm-project/vllm-ascend/issues/725

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-14 09:27:54 +08:00
wangxiyuan 857f489cbf
[CI] Patch torch.library.infer_schema for torch 2.5 backward compatibility (#837)
Patch torch.library.infer_schema for torch 2.5 backward compatibility

- Introduced a new module `patch_utils` under
`vllm_ascend/patch/worker/patch_common/`.
- Added a function `ascend_direct_register_custom_op` to handle custom
operator registration with backward compatibility for PyTorch < 2.7
(such as torch 2.5.1).
- Implemented type conversion logic for annotations to ensure
compatibility across different PyTorch versions.
- Registered the function `ascend_direct_register_custom_op` to
`utils.direct_register_custom_op`.

- Updated `__init__.py` to include `patch_utils` as the first import.
- Ensured `patch_utils` is available for use in other patch files and
skipped isort checks for `patch_utils` import.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-14 09:20:55 +08:00
cxcxflying e564470338
[Attention][Kernel]moe support for llama4 and mllama4 (#740)
### What this PR does / why we need it?
moe support for llama4 and mllama4 in vllm-ascend

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
start sever:
python -m vllm.entrypoints.openai.api_server --model
/data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct \
--max-num-seqs=256 \
--max-model-len=8192 \
--tensor-parallel-size=8 \
--block-size=128 \
--dtype bfloat16 \
--host=0.0.0.0 \
--port=8000 \
--gpu-memory-utilization=0.9 \
--trust-remote-code

client:
python online_server.py --model-path
/data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct
--image-path /data/nfs/w60040464/cherry_blossom.jpg --docker-ip
7.242.108.253 --served-port 8000 --text "what is the content of this
image?"

result:
{'id': 'chatcmpl-2b709a5d2e1a4017991ec4ba8248686a', 'object':
'chat.completion', 'created': 1747056823, 'model':
'/data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct',
'choices': [{'index': 0, 'message': {'role': 'assistant',
'reasoning_content': None, 'content': 'The image depicts a tower, likely
Tokyo Skytree, framed by branches of a cherry blossom tree. The tower is
white and has a distinctive shape, with a large sphere at the top and a
long, thin spire extending from it. The branches of the cherry blossom
tree are in the foreground, with pink flowers blooming on them. The
background is a clear blue sky.\n\n**Key Features:**\n\n* **Tower:**
White, spherical shape at the top, long thin spire\n', 'tool_calls':
[]}, 'logprobs': None, 'finish_reason': 'length', 'stop_reason': None}],
'usage': {'prompt_tokens': 2340, 'total_tokens': 2440,
'completion_tokens': 100, 'prompt_tokens_details': None},
'prompt_logprobs': None}

Signed-off-by: chenxu <chenxu68@huawei.com>
Co-authored-by: chenxu <chenxu68@huawei.com>
Co-authored-by: evian <eviantai@u.nus.edu>
2025-05-13 19:12:40 +08:00
hfadzxy 217211d8a3
[Misc][Doc] Add the latest stable release url (#826)
### What this PR does / why we need it?
 Add the latest stable release url

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-05-13 12:53:23 +08:00
rjg-lyh c6ac399091
[Bugfix] Fix the method of importing environment variables in DeepSee… (#817)
### What this PR does / why we need it?
Fix the method of importing environment variables in DeepSeek model to
support successful compilation via aclgraph.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-05-13 12:52:30 +08:00
wangxiyuan 6193ba679b
[CI] add codespell CI and fix format.sh (#827)
1. Fix format check error to make format.sh work
2. Add codespell check CI 
3. Add the missing required package for vllm-ascend.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-12 22:04:48 +08:00
whx 5998704c08
[BugFix] Fix ascend scheduler bugs. (#822)
This PR fixes two bugs in AscendScheduler:
1. When running with high concurrency, the length of running queue may
exceed the limit of max_num_seqs
2. When some requests are prempted and recomputing is activated, the
logic of computing new tokens is wrong.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-05-12 21:15:17 +08:00
yiz-liu 701b0fd95e
[Enhancement] Add padding for ACL Graph (#803)
### What this PR does / why we need it?
Add padding for ACL Graph and refactor graph batch size adjustments to
utils.py

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-05-12 20:26:22 +08:00
NeverRaR efabd722eb
feat: support torchair graph mode in v1 engine (#789)
### What this PR does / why we need it?
support torchair graph mode with v1 engine

---------

Signed-off-by: boying <897013703@qq.com>
2025-05-12 19:14:07 +08:00
hfadzxy 4a2505f81f
[accuracy test]Update cann version and huggingface-hub version for Qwen3 (#823)
### What this PR does / why we need it?
1.  update cann version to 8.1.0 for multimodal
2.  fix huggingface-hub version to adapt to qwen3
3.  change Qwen3-8B to Qwen-8B-Base,

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-05-12 19:12:48 +08:00
yiz-liu 5305a2ccf9
[Bugfix] Tweak distributed process group initialization and add dummy… (#816)
fix batch execution method to enable DP in V1

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-05-12 17:31:29 +08:00
Li Wang 4df1e99614
[CI] Re-enable `vllm-empty/tests/benchmarks` (#812)
### What this PR does / why we need it?
For the
[#17962](https://github.com/vllm-project/vllm/pull/17962?notification_referrer_id=NT_kwDOCexQHLUxNjM0MTM3OTEwNDoxNjY0ODE5NDg#event-17608938997)
has merged, vllm openapi server can now launch normally on python==3.10,
we re-enable the related tests

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-05-12 15:50:48 +08:00
Li Wang 8e4e791fcd
[CI] Add deepseek-v2-lite test (#631)
### What this PR does / why we need it?
Add deepseek-v2-lite test, part of #499 
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-05-12 14:59:17 +08:00
Li Wang cdece86f2c
[Bugfix] Add max_num_batched_tokens to InputBatch to make main CI pass (#806)
### What this PR does / why we need it?

1. Fix V1 error found by
[nightly_ci](https://github.com/vllm-project/vllm-ascend/actions/runs/14950004754/job/41998136610),
broken by [[v1] Pass BlockTable and KVCacheSpec to
AttentionMetadataBuilders
#17483](https://github.com/vllm-project/vllm/pull/17483), make
`InputBatch` parameter consistent with vllm.
2. Disable benmark and fix it in upstream.

### Does this PR introduce _any_ user-facing change?

No


### How was this patch tested?

CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-12 00:36:56 +08:00
Li Wang 218f21de21
[Benchmarks] Add qwen2.5-7b test (#763)
### What this PR does / why we need it?
- Add qwen2.5-7b test
- Optimize the documentation to be more developer-friendly 

Signed-off-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: xuedinge233 <damow890@gmail.com>
2025-05-10 09:47:42 +08:00
wemaster 19c8e134e4
[CI/UT] fix spec ut in vllm-ascend main and vllm main (#759)
### What this PR does / why we need it?
#### 1. fix spec ut in vllm-ascend main and vllm main
As https://github.com/vllm-project/vllm-ascend/pull/694 and
https://github.com/vllm-project/vllm-ascend/pull/749 verify, Now,
vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main
and vllm main, CI is fail.

I found the reason is a triton bug
https://github.com/triton-lang/triton/issues/2266, but i I didn't figure
it out that why the bug did not effect vllm-ascend main and vllm 0.8.5,
maybe the usage of triton have changed when vllm 0.8.5 to latest main

As the bug describe, I changed the minimum block_size in UT from 8 to
16, and the modification is verified locally to be effective.

#### 2. modify some case skip form.
I modified some commented out cases to skipif form, which is more
standardized.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-05-10 09:45:56 +08:00
Li Wang 58d2f85c4a
[CI] Fix schedule trigger bug (#757)
### What this PR does / why we need it?
This PR aims to fix nightly ci
[broken](https://github.com/vllm-project/vllm-ascend/actions/runs/14848150987)
We have a workflow containing multiple triggers:

- push events (to the default branch)
- pull requests (against the default branch)
- scheduled events
Our paths-filter action works great for the first two use-cases,
detecting the context and base to compare against. However, it fails for
scheduled events giving the error `This action requires 'base' input to
be configured or 'repository.default_branch' to be set in the event
payload.`
For the scheduling trigger event, we choose to skip this filter
because we don't need its results:
```
      - name: Check for changes in Speculative Decode
        if: github.event_name != 'schedule'
```

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-05-10 09:45:07 +08:00
Yikun Jiang 804ebb17bd
[Doc] Move Release Compatibility Matrix to top and remove v0.7.x rc info (#799)
### What this PR does / why we need it?
- Move Release Compatibility Matrix to top 
- Remove v0.7.x rc info because v0.7.3 final release alread published
- Rename vllm-ascend to vLLM Ascend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-09 16:41:50 +08:00
rjg-lyh fa99f89e93
[Core] Support the features of prefix cache and chunked prefill in v0/v1 (#782)
### What this PR does / why we need it?
Support the features of prefix cache and chunked prefill in v0/v1.

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-05-09 16:39:28 +08:00
ApsarasX 324f819b92
[Perf] Optimize fused_experts quantization code to save npu memory (#784)
### What this PR does / why we need it?
In the w8a8 quantization code of `fused_experts`, the output of almost
every operator is assigned a new variable name. If we want to save NPU
memory, we manually `del` these variables to end their lifecycle, which
fills the code with `del` statements and looks inelegant.
Therefore, I plan to names the output of most operators as
`hidden_states`, thereby ending the lifecycle of the previous
`hidden_states`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-05-09 15:09:37 +08:00
Jade Zheng 2c685e3b61
[Bugfix] Correct method call for _set_cos_sin_cache (#774)
This change ensures proper functionality for longer sequences by
correctly invoking the _set_cos_sin_cache method with self as the first
argument.

For example, with DeepSeek R1, if this change isn't made, the program
will crash when the input sequence exceeds 4096.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-05-09 12:55:57 +08:00
zzzzwwjj 5301649108
[Doc] Add notes for OOM in FAQs (#786)
### What this PR does / why we need it?
add notes for OOM in faqs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-08 16:28:29 +08:00
chris668899 6c020883a8
[WIP]Add Func: aclgraph_batch_size auto-adjust to different model (#771)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR add new function of : aclgraph_batch_size can dynamic adjust to
different model; before this PR, the aclgraph_batch_sizes given from
vllm to vllm-ascend always too large, and that may result in ERROR while
running on different, with the information: "The resources are
insufficient".
Now, with this PR, the code can dynamic adjust aclgraph_batch_sizes
depend on the model hidden_layer_nums and parallel config, for example:
a. for Qwen2.5-7B, the aclgraph_batch_size length is 33 total;
b. for Qwen2.5-72B, the aclgraph_batch_size length is 11 total;

Signed-off-by: chris668899 <15105191595@126.com>
2025-05-08 16:23:33 +08:00
yiz-liu 2e3520e285
[Bugfix] Fix output tensor shape in vanilla_chunked_prefill and update import paths for model_loader (#773)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Fix output tensor shape in vanilla_chunked_prefill function.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
None.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Run offline inference on DeepSeek models.

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-05-08 14:19:26 +08:00
Yikun Jiang ec27af346a
[Doc] Add 0.8.5rc1 release note (#756)
### What this PR does / why we need it?
Add 0.8.5rc1 release note and bump vllm version to v0.8.5.post1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

CI passed

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-06 23:46:35 +08:00
linfeng-yuan 2cd036ee8e
[Bugfix] fix accuracy problem for quantized deepseek models (#768)
### What this PR does / why we need it?

The root cause of the bug is that numerical computations involving NaNs
cannot eliminate them. We addressed it by using `masked_fill_` to
eliminate NaNs while avoiding memory-wasting `torch.where` approach.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
This patch was tested with vllm v0.8.5 and vllm-ascend master. I run
deepseek_v3 model with offline inference scripts
(examples/dp_offline/run_dp.sh & data_parallel.py).

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-05-06 22:09:56 +08:00
ApsarasX d6e9417652
[Bugfix] Fix masked_fill_ function typo (#769)
### What this PR does / why we need it?
Fix function name typo, make `mask_fill_` to `masked_fill_`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-05-06 21:54:52 +08:00
Yikun Jiang afe1767c17
[Core] Cleanup triton patch which has been fixed in vllm (#764)
### What this PR does / why we need it?
- Revert "Re-patch TritonPlaceholder on main to make CI happy (#753)"
because upstream main CI already merged:
https://github.com/vllm-project/vllm/pull/17446
- Keep 0.8.5.post1 compatible

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-06 18:52:15 +08:00
linfeng-yuan b0dbe5f8e1
[Bug fix] fix a typo in setup.py (#762)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
--> Fix a typo in setup.py. Currently, it does not affect any
functionality or interfaces.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-05-06 17:01:26 +08:00
Yikun Jiang 5897dc5bbe
[Build] Bump vLLM version to v0.8.5.post1 (#755)
### What this PR does / why we need it?
Bump vllm version to v0.8.5.post1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-06 11:44:12 +08:00
sunbaosong d6bfae8eee
support 32K model len on deepseek r1 W8A8 (#728)
### What this PR does / why we need it?

Optimize NPU memory usage.
https://github.com/vllm-project/vllm-ascend/issues/723

vllm v0.8.4.rc2 and DeepSeek R1 can only support a model length of 16K.
When attempting to run with a model length of 32K, an "Out of Memory"
(OOM) error will occur.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: sunbaosong <13793883820@163.com>
2025-05-06 10:12:07 +08:00
Yikun Jiang 79538b5d73
Upgrade CANN version to 8.1.rc1 (#747)
### What this PR does / why we need it?

Make CANN version bump separately from
https://github.com/vllm-project/vllm-ascend/pull/708

- Upgrade CANN version to 8.1.rc1
- Add prefix to speed up download
`m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10`
- Address tail sapce for Dockerfile.openEuler
- Add note for `/workspace` and `/vllm-workspace` as followup of
https://github.com/vllm-project/vllm-ascend/pull/741

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

CI passed

Co-authored-by: MengqingCao <cmq0113@163.com>

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-05-06 05:44:18 +08:00
Yikun Jiang d7e1110c8e
Re-patch TritonPlaceholder on main to make CI happy (#753)
### What this PR does / why we need it?
Re-patch TritonPlaceholder on main to make CI happy
- Add triton patch back until
https://github.com/vllm-project/vllm/pull/17446 resolved
- Move patch_main before patch_common to resolve minicpm triton import
issue
- Add `0.8.5` and `0.8.5.post1` to make patch work on 0.8.5 all versions

Related:
- https://github.com/vllm-project/vllm-ascend/pull/704
- https://github.com/vllm-project/vllm-ascend/pull/690

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
All CI passed include main

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-05 23:22:24 +08:00
Yikun Jiang d2ead057ae
Re-enable Speculative Decode test for vLLM v0.8.5 (#749)
### What this PR does / why we need it?
Re-enable Speculative Decode test for vLLM v0.8.5

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-02 14:44:48 +08:00
whx 8b194ad12e
[Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist (#694)
### What this PR does / why we need it?
- This PR proposes a P2P version of Disaggregated Prefill based on
llm_datadist which manages data transfer.

- This solution reconstructs previous offline single-node Disaggregated
Prefill solution, and supports multi-node and online serveing now.

- Currently this solution supports 1P1D situation of Deepseek hybrid
parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered
in the solution design, and will be supported soon within v1 engine.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
2025-05-01 22:31:36 +08:00
linfeng-yuan 84e2ed898b
performance optimization, usability optimization and API compatibility adjustments for deepseek with npu graph mode (#731)
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
1. Improve inference speed and usability for deepsek models with NPU
graph mode.
2. Modify some codes to adapt to CANN 8.1.RC1.beta1.
3. Add a switch for NPU graph mode and its cache.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
This PR provides an experimental configuration to enable NPU graph mode
for Deepseek models. User can set
additional_config={'enable_graph_mode': True} to try this feature. Note
that this feature currently only supports for V0 engine.


### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
This patch was tested with the newest torch_npu 2.5.1
(https://pypi.org/project/torch-npu/#files) and CANN 8.1.RC1.beta1
toolkit&nnal&kernels
(https://www.hiascend.com/developer/download/community/result?module=cann)
released in 25/30 April.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-05-01 13:51:42 +08:00
Mengqing Cao 399b03830d
[Build][Bugfix] Fix source code path to avoid reference error (#726)
### What this PR does / why we need it?
Fix source code path to avoid reference error in docker image
fix https://github.com/vllm-project/vllm-ascend/issues/725

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-30 17:38:13 +08:00
Pleaplusone 3a628891ab
[Feature] Add quant description file for new quant model generated by modelslim (#719)
### What this PR does / why we need it?
After discussed with MindStudio about the quantization model format, we
decide to support another quant format which may used in new modelslim
tool, in which case, `quantization_config` may be removed from the
`config.json` file and `quant_model_description.json` will be used for
quantization configuration.
### Does this PR introduce _any_ user-facing change?
Yes, using the latest quantization format

### How was this patch tested?
Test locally

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-30 16:51:56 +08:00
hfadzxy affca6f348
[Test] Add accuracy test report workflow (#542)
### What this PR does / why we need it?
1. Provide accuracy test report for development branch release.
2. Models and datasets for accuracy test:
    
| Model | datasets |
|---------------------------- | --------------------------- | 
| Qwen2.5-7B-Instruct        |  ceval-val, gsm8k, mmlu  |
| Qwen3-8B                        |  ceval-val, gsm8k, mmlu  |
| Llama-3.1-8B-Instruct      |  ceval-val, gsm8k, mmlu  |
| Qwen2.5-VL-7B-Instruct  |           mmmu_val             |

### Does this PR introduce _any_ user-facing change?
This PR will display the accuracy test report of the release versionin
docs/source/developer_guide/accuracy_report。
Qwen2.5-7B-Instruct.md
Qwen3-8B.md
Llama-3.1-8B-Instruct.md
Qwen2.5-VL-7B-Instruct .md

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-04-30 14:53:58 +08:00
zouyida2052 ba9714ccee
Optimize qwen2_vl and qwen2_5_vl (#701)
### What this PR does / why we need it?
Optimize qwen2_vl and qwen2_5_vl.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and
Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got
roughly 22% perf boost.

---------

Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
2025-04-30 14:22:38 +08:00
Li Wang 90aabaeb2e
[Doc] Add benchmark guide (#635)
### What this PR does / why we need it?
 Add benchmark developer guide

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-30 09:17:59 +08:00
wangxiyuan f8350569e6
[CI] upgrade vllm to 0.8.5 (#715)
1. Upgrade vllm to 0.8.5
2. Drop 0.8.4 support
3. Keep doc to 0.8.4rc2 until we release 0.8.5

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-30 09:15:50 +08:00
wangxiyuan 95e7aa4736
[Platform] format platform to make it more clear (#610)
Platform should only contain the function that based from vllm. This PR
move the unrelated function to the right place to make platform more
clear.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-30 09:03:10 +08:00
wangxiyuan b917361ca5
[MISC] Clean up torch_npu (#688)
torch_npu 2.5.1 support autoload now. This patch does:
1. remove useless torch_npu import
2. replace `torch_npu.npu` to `torch.npu`.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-29 18:03:38 +08:00
Pleaplusone 0329fad927
[Perf] Deepseekv3 performance optimization for eager mode (#598)
### What this PR does / why we need it?
Deepseek v3 now adopt vanilla chunked prefill on MLA part which is
ineffcient for computing but necessary for chunked prefill. Since PR
https://github.com/vllm-project/vllm-ascend/pull/543 bring v0 scheduler
into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside
the mla backend for more performance boost. Also there are some
redundant computation inside the rope, which is also removed. This PR
should bring some performance gain for deepseek eager mode inference.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-29 17:12:03 +08:00
ApsarasX 87975fa058
[Bugfix] Fix early return in CustomDeepseekV2MoE.forward during profile_run (#682)
### What this PR does / why we need it?

Fix #674 to avoild KVCache overallocation and OOM risks.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-04-29 17:06:19 +08:00
Li Wang 7aee9228f0
[CI] Add nightly CI (#668)
### What this PR does / why we need it?
Add nightly CI  for basic function and model usability

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-29 16:35:52 +08:00
Li Wang d6be63e11d
[CI] Add Qwen3-0.6B-Base test (#717)
### What this PR does / why we need it?
Add Qwen3-0.6B-Base for integration test

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-29 14:35:19 +08:00
wangxiyuan 0dae55a9a3
[MISC] fix format check error (#654)
This pr makes format.sh works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-29 11:14:19 +08:00
wangxiyuan 1fce70a2fb
[Model] Support common fused moe ops for moe model, such as Qwen3Moe (#709)
vllm-ascend now only support moe for deepseek. We should add common moe
support back

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-28 21:57:01 +08:00
Jade Zheng 40bd602485
[Feature] Use reshape_and_cache fused op (#706)
Replace torch function with reshape_and_cache fused op for better
performance. The `reshape_and_cache` function wasn't working because it
expected torch.int32 tensor, but a torch.int64 tensor was provided.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-04-28 21:54:42 +08:00
Yikun Jiang d39855b075
Update installation and tutorial doc (#711)
### What this PR does / why we need it?
Update installation and tutorial doc

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-28 21:52:17 +08:00
wangxiyuan 5995d23532
[Doc] Add 0.8.4rc2 release note (#705)
Add 0.8.4rc2 release note

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-28 21:51:35 +08:00
wemaster 54c0e63df7
[MTP] follow custom deepseek modeling changes to support graph mode (#636)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?

As custom deepseek modeling do some changes to support graph mode in
https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to
change custom deepseek_mtp modeling.

And some modifications for k>1 were not carried over by the
https://github.com/vllm-project/vllm-ascend/pull/429, now i add it.

In order to better take care of the MTP feature in the vllm-ascend
repository, I added cases related to graph mode(torchair), but i skip it
since torchair can not correctly clean up memory in vllmrunner.

Also i add some case for MTP quantization weights, but test weight is
not ready, so i skip it and i will open it when test quant weights is
ready.

https://github.com/vllm-project/vllm-ascend/pull/648 did not completely
fix the sample
change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I
added the relevant changes.

### Does this PR introduce _any_ user-facing change?
now, u can use following method to use mtp in deepseek v3/r1 float or
quant weights with eager mode.
```python
llm = LLM(
    model="wemaster/deepseek_mtp_main_random_bf16",
    tensor_parallel_size=2,
    speculative_config={
        "num_speculative_tokens": 1,
    },
    enforce_eager=True,
    trust_remote_code=True,
    disable_log_stats=False,
    gpu_memory_utilization=0.8,
    max_model_len=64,
)
```

or use mtp in deepseek v3/r1 float or quant weights with graph
mode(torchair)
```python
llm = LLM(
    model="wemaster/deepseek_mtp_main_random_bf16",
    tensor_parallel_size=2,
    speculative_config={
        "num_speculative_tokens": 1,
    },
    trust_remote_code=True,
    additional_config={
        'enable_graph_mode': True,
    },
    disable_log_stats=False,
    gpu_memory_utilization=0.8,
    max_model_len=64,
)
```

add notes:
1. now, we support k>1, so u can set num_speculative_tokens > 1 if there
is sufficient redundant computing power;
2. MTP is not supported in V1, we will support it when vLLM does it in
https://github.com/vllm-project/vllm/issues/13500.
3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3
patch https://github.com/vllm-project/vllm-ascend/pull/236 file
`vllm_ascend/patch/patch_metrics.py` method
`__npu_async_metrics_collector_init__`

### How was this patch tested?
local tested passed and test by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-04-28 21:18:53 +08:00
Mengqing Cao be9e3e8545
[Bugfix] Fix triton placeholder patch period (#704)
Fix triton placeholder patch period

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-28 18:52:03 +08:00
Li Wang 58f9d932d3
[Doc] Update faqs (#699)
### What this PR does / why we need it?
Update faqs to make it more clear


Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-28 18:48:23 +08:00
Li Wang d0a0c81ced
[Doc] Add deepsee-v2-lite w8a8 quantization turorial (#630)
### What this PR does / why we need it?
Add deepsee-v2-lite w8a8 quantization turorial

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-28 17:14:26 +08:00
wangxiyuan 5de3646522
[MISC] Make vllm version configurable (#651)
Sometimes, user install a dev/editable version of vllm. In this case, we
should make sure vllm-ascend works as well.

This PR add a new env `VLLM_VERSION`. It's used for developers who edit
vllm. In this case, developers should set thie env to make sure which
vllm version is installed and used.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-28 14:19:06 +08:00
dependabot[bot] 8849cf1eda
Bump actions/setup-python from 5.5.0 to 5.6.0 (#697)
Bumps [actions/setup-python](https://github.com/actions/setup-python)
from 5.5.0 to 5.6.0.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-28 14:06:38 +08:00
Icey ee7a0e2cd4
Update openEuler dockerfile for COMPILE_CUSTOM_KERNELS=1 (#689)
### What this PR does / why we need it?
Update openEuler dockerfile for COMPILE_CUSTOM_KERNELS=1

### Does this PR introduce _any_ user-facing change?
No

Signed-off-by: Icey <1790571317@qq.com>
2025-04-28 11:45:46 +08:00
Pleaplusone 38f34e359f
[Fix] fix deepseek v0 attention eager mode (#671)
### What this PR does / why we need it?
`reshape_and_cache_siso` seems have some funcitonality issues, use torch
op combination replace this custom op by default.


---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-28 08:53:06 +08:00
Yikun Jiang 413657ae43
[FOLLOWUP][DOC] Fix pip install cmd in installation.md (#680)
### What this PR does / why we need it?
Fix pip install cmd in installation.md

Followup on: https://github.com/vllm-project/vllm-ascend/pull/661

### Does this PR introduce _any_ user-facing change?
No, doc only

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-27 18:37:25 +08:00
Yikun Jiang 2e20797934
[BUILD] Upgrade torch-npu to 2.5.1 (#661)
### What this PR does / why we need it?
The torch-npu 2.5.1 are published:
https://pypi.org/project/torch-npu/2.5.1/
It's time to remove all torch-npu dev version from vllm-ascend code base

### Does this PR introduce _any_ user-facing change?
Yes, using torch-npu 2.5.1

### How was this patch tested?
- [ ] CI passed
- [ ] Manually test
- [ ] Grep all `dev2025`

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-27 17:28:29 +08:00
Jade Zheng fa4a5d980e
[Bugfix] Remove redundant tensor creation and unused code (#656)
### What this PR does / why we need it?
Eliminated duplicate `block_table` tensor initialization and cleaned up
unused code segments. This resolves an issue where the second creation
was overwriting the first, potentially leading to unexpected behavior.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-04-27 14:09:16 +08:00
Mengqing Cao ba3d8aae94
[Model][MiniCPM] support MiniCPM (#645)
### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-27 11:27:24 +08:00
Yikun Jiang 742f679c7d
Remove prompt string from engine core data structures (#663)
### What this PR does / why we need it?
vLLM Ascend side followup on:
[Core] Remove prompt string from engine core data structures

df6f3ce883

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-26 23:15:58 +08:00
wangxiyuan c99c4c8c70
[Doc] Update feature support list (#650)
1. remove Chinese doc. The content is out of data and we don't have
enough time to maintain it.
2. Update feature support matrix. Refresh the content and add V1 status.

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-26 10:27:29 +08:00
wangxiyuan 3879d9cad9
[CI] Fix sample backward compatibility problem (#648)
b411418ff0
this vllm commit change the sample usage. This PR adapt the change for
main and make sure it works for 0.8.4 as well.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-25 11:53:26 +08:00
yiz-liu d785e78563
[V1] Make V1 engine backward compatible (#637)
### What this PR does / why we need it?
Enforce eager mode in the V1 engine ahead of the upcoming CANN and
torch_npu releases.

### Does this PR introduce _any_ user-facing change?
After this change, users will no longer need to manually set
enforce_eager=True.

### How was this patch tested?
Test it with regular offline inference examples.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-04-24 17:20:11 +08:00
Li Wang bd70ce828c
[CI] Add qwen2.5-vl test (#643)
### What this PR does / why we need it?
Part of #499 
Add qwen2.5-vl test on single npu, v1 engine is excluded because
qwen2.5-vl has some problems with v1 now, at the same time, this test
can also make #639 more credible

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-24 17:12:12 +08:00
Li Wang a9c6b52205
[Bugfix] Fix qwen2.5-vl positon input bug (#639)
### What this PR does / why we need it?
Fix qwen2.5-vl positon input bug, fix #625 `TypeError: 'NoneType' object
is not iterable`

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-24 15:21:57 +08:00
Li Wang 866ce7168c
[Benchmark] Download model from modelscope (#634)
### What this PR does / why we need it?
-  Run benchmark scripts will Download model from modelscope

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-24 14:48:24 +08:00
Bug Hunter Yan 05bdcbeae4
support aclgraph (#426)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
This PR supports the access of vllm-acend to the piecewise_graph feature
provided by the v1 engine.

1. register unifiled_ascend_attention_with_output for piecewise_graph to
split graph.
2. support NPUGraph to accelerate kernel launch.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
support npugraph to default, Users can disenable the npugraph feature by
configuring enforce_eager.

This has corresponding requirements for the versions of torch_npu and
CANN, and they need to support graph capture.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
it turn to default

---------

Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-04-23 20:56:24 +08:00
zzzzwwjj 5c6d05a59e
support deepseek quant & mix-parallel with graphmode (#585)
### What this PR does / why we need it?
1. support deepseek with w8a8 quant;
2. support deepseek with mix-parallel(multi-DP, EP+TP);
3. support deepseek with graphmode.
---------

Signed-off-by: wen-jie666 <wenjie39@huawei.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: wen-jie666 <wenjie39@huawei.com>
2025-04-23 16:23:25 +08:00
Pleaplusone e74331a1ed
Add dp initialize patch with hccl backend (#626)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Add dp stateless process group initialization path with hccl backend as
vllm-ascend patch.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-23 15:47:51 +08:00
RongRongStudio 848e041a54
Using EvalScope evaluation (#611)
### What this PR does / why we need it?
Using EvalScope to hava a evaluation (include eval and test):
-
https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage
-
https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test locally

---------

Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-23 00:50:09 +08:00
Shanshan Shen 4a0ce3660e
[Misc] Remove some parts of metrics patch (#603)
### What this PR does / why we need it?
Remove some parts of metrics patch, since the `cuda` hard code has been
fixed by https://github.com/vllm-project/vllm/pull/14411.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-22 18:45:21 +08:00
Li Wang cf6ab42ee2
[CI]Add guided decoding test (#422)
### What this PR does / why we need it?
After extensive testing, we are happy to say that guided_decoding is
fully supported by npu, in this pr, we add guided_decoding integrated
with our test, mainly does the following things:
1. test v0 supported backends including ` "outlines",
"lm-format-enforcer","xgrammar"`
2. test v1 supported backends including ` "guidance", "xgrammar"`

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-22 17:50:06 +08:00
wangxiyuan 538a69c145
[Patch] format patch module to make it more clear (#601)
Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-22 14:13:00 +08:00
Shuqiao Li ad845bfe82
fix doc to mention env setting for v0.7.3-dev (#602)
### What this PR does / why we need it?
fix doc to mention env setting for v0.7.3-dev

Signed-off-by: Shuqiao Li <celestialli@outlook.com>
2025-04-22 14:11:41 +08:00
Pleaplusone d12a057df8
Add note for deepseek related docs and remove unnecessary comments (#590)
### What this PR does / why we need it?
Add notes for deepseek's patch and remove some of the unnecessary
comments

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-22 09:59:09 +08:00
Mengqing Cao c5850d302d
[Doc] Update installation (#596)
Many users facing a failed installation when using `pip install -e .`,
this is mainly introduced by the released `torch-npu` version conflict
with `torch>=2.5.1`. This conflict mainly exist in the temp env of
pyproject build.
This pr updates installation tutorial by using `python setup.py develop`
to quick fix this.

cc @wangxiyuan

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-22 09:04:20 +08:00
paulyu12 a8d633f629
[Bugfix] fix import error (#600)
### What this PR does / why we need it?
Fix the import error that
https://github.com/vllm-project/vllm-ascend/issues/592 mentioned.

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
2025-04-22 08:57:25 +08:00
wemaster 0ae9ee0f8a
[BUGFIX] main-sd-bugfix && [UT] add mtp UT (#593)
### What this PR does / why we need it?
The pr will fix some bug about spec decode / MTP
The pr add a mtp e2e UT `test_mtp_correctness.py`

**vllm_ascend/attention/attention.py**
1. add support `self.attn_mask_cache` only has 1 element to cover scene
in which both spec docode and chunked prefill are enabled.

**vllm_ascend/distributed/parallel_state.py**
1. remove 2 assert because spec decode worker would use init_worker
twice

**vllm_ascend/models/deepseek_mtp.py**
1. remove unused params;
2. add support w8a8 in `CustomDeepSeekMTP`

**vllm_ascend/quantization/quant_config.py**
1. use `AscendUnquantizedFusedMoEMethod` instead of
`UnquantizedFusedMoEMethod`

**other**
1. replace `from vllm.logger import init_logger` to `from vllm.logger
import logger` all of the vllm-ascend project



### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-04-21 19:25:51 +08:00
Shuqiao Li 5442b463fd
add doc for patch_config (#574)
### What this PR does / why we need it?
add doc for patch_config
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No code changed.

Signed-off-by: Shuqiao Li <celestialli@outlook.com>
2025-04-21 10:33:38 +08:00
Yikun Jiang 96d6fa7c90
[Docker] Fix openEuler image suffix (#586)
### What this PR does / why we need it?
There was a bug when we release v0.8.4rc1 (openEuler image tag was wrong
set to 0.8.4rc1), according doc of docker-meta-action, it should be
append suffix:
```
tags: |
  type=pep440,enable=true,priority=900,prefix=,suffix=,pattern=,value=
```

This patch just fix openEuler image suffix to make pep440 tag rule work.

This patch also remove the cache step because the cache step bring more
than 10mins export, but reduce less time in next trigger.

### Does this PR introduce _any_ user-facing change?
Yes, docker image tag set to right

### How was this patch tested?
I test with in my fork repo by setting default branch:
- release a tag: v0.7.88rc1 (pep440 tag)
- The log show `--label
org.opencontainers.image.version=v0.7.88rc1-openeuler` is right rule


https://github.com/Yikun/vllm-ascend/actions/runs/14560411481/job/40842950165#step:9:205

Related: https://github.com/vllm-project/vllm-ascend/pull/489

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-21 08:55:26 +08:00
Yikun Jiang 12cae04db9
[quantization] Support w8a8 quantization (#580)
### What this PR does / why we need it?

Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on
linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model
has [quantize
filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27).
If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply,
otherwise will use VLLMAscendQuantizer directly.

- This patch fix installation docs to make installation work
- This patch enable norm quantization by patch `RMSNorm.__init__`,
`RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model`
- Add `AscendW8A8LinearMethod` for W8A8
- Add `AscendW8A8DynamicLinearMethod` and
`AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC
- Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` 

### Does this PR introduce _any_ user-facing change?
Yes, support w8a8 quantization. After this patch supported, users can
use below commands to run w8a8 models:

```
vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B"
```

### How was this patch tested?
0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
1. From @Yikun:
I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls
refer to
https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613

2. From @dingdingchaomian :
Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both
models were quantized using Ascend's msmodelslim tool:
- Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one
for w8a8 dynamic.
- Deepseek-v2-lite-chat were tested once because its quantization used
both static and dynamic w8a8.

Models were tested using both off line inference and online serving, and
both work well. The inference codes are exactly the same with the
examples in
https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with
model path and tensor parallel number changed.

---------

Signed-off-by: dingdingchaomian <wangce21@huawei.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: dingdingchaomian <wangce21@huawei.com>
Co-authored-by: Angazenn <zengyanjia@huawei.com>
Co-authored-by: liujiaxu <liujiaxu4@huawei.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>
2025-04-20 18:14:05 +08:00
Pleaplusone 1a1f9a6d89
port deepseekv2 and mtp to main branch (#429)
### What this PR does / why we need it?
This PR ports all the deepseek graph mode code and mtp code from v0.7.3
to the main branch
---------

Signed-off-by: SidaoY <1024863041@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: mengwei805 <mengwei25@huawei.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: SidaoY <1024863041@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
2025-04-19 17:38:18 +08:00
Yikun Jiang 086423dc35
[Docker] Bump Dockerfile version to v0.8.4 (#577)
### What this PR does / why we need it?
Bump Dockerfile version to v0.8.4

### Does this PR introduce _any_ user-facing change?
docker image are using v0.8.4 version vLLM

### How was this patch tested?
CI passed

Closes: https://github.com/vllm-project/vllm-ascend/pull/571

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-18 19:15:17 +08:00
Shuqiao Li a127cc83f8
catch ImportError when C code not compiled (#575)
### What this PR does / why we need it?
Found a problem when ImportError raised but not ModuleNotFoundError.


### Does this PR introduce _any_ user-facing change?
No


### How was this patch tested?
CI passed

Signed-off-by: Shuqiao Li <celestialli@outlook.com>
2025-04-18 18:11:49 +08:00
Shanshan Shen 985b0548b0
[Doc] Update v0.8.4 release note, add contents for structured output feature (#576)
### What this PR does / why we need it?
Update v0.8.4 release note:

- Add contents for structured output feature.
- Remove redundant `(` in spec decoding.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Preview

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-18 17:44:16 +08:00
Shanshan Shen 65c1f4579f
[V1][Structured Output] Add `apply_grammar_bitmask()` method to model runner (#555)
### What this PR does / why we need it?
Add `apply_grammar_bitmask()` method to model runner.

This method is necessary for `xgrammar` structured output.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-18 16:47:55 +08:00
Mengqing Cao 2c903bc7ac
[Doc] Update doc for custom ops build (#570)
- update doc about custom ops compile

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-18 15:35:10 +08:00
Mengqing Cao b91f9a5afd
[Doc][Build] Update build doc and faq (#568)
Update build doc and faq about deepseek w8a8

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-18 14:16:41 +08:00
wangxiyuan e66ded5679
[Doc] Add release note for 0.8.4rc1 (#557)
Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-18 13:24:36 +08:00
Shanshan Shen 7eeff60715
[Doc] Update FAQ doc (#561)
### What this PR does / why we need it?
Update FAQ doc to make `docker pull` more clear


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-18 13:13:13 +08:00
Shuqiao Li 84563fc65d
Add sleep mode feature for Ascend NPU (#513)
### What this PR does / why we need it?
This PR adds sleep mode feature for vllm-ascend, when sleeps, we do
mainly two things:

- offload model weights
- discard kv cache

RLHF tools(such as https://github.com/volcengine/verl and
https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode
to accelerate the training process.

This PR may solve #375 and #320 .

### Does this PR introduce _any_ user-facing change?
No existing user interfaces changed.
Users will have two new methods(`sleep()` and `wake_up()`) to use.

### How was this patch tested?
This PR is tested with Qwen/Qwen2.5-0.5B-Instruct.

At first, we have free NPU memory M1.

After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)`
executed, we have free NPU memory M2. M2 < M1.

Then we call `llm.sleep(level=1)`, we have free NPU memory M3.

We have M3 > M2, M3 is very close to M1.

Plus, we have the same output tokens before sleep and after wake up,
with the config of `SamplingParams(temperature=0, max_tokens=10)` and
with the same input tokens of course.


This PR is utilizing the CMake procedure of #371 , thanks a lot.

Signed-off-by: Shuqiao Li <celestialli@outlook.com>
2025-04-18 13:11:39 +08:00
wangxiyuan 42c7fbb10e
[Misc] Fix import error and address nits to make CI happy (#563)
1. Add `vllm_version_is` function to check vllm version.
2. `ensure_kv_transfer_initialized` and `get_kv_transfer_group ` have
been moved to other place in vllm main branch via
3408e47159
, this patch fix the import error.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-18 12:23:32 +08:00
Pleaplusone 66a0837963
adopt rope in vllm-ascend (#530)
### What this PR does / why we need it?
Adopt custom kernel rotary embedding in actual model inference,
customized rotary_embedding will generate contiguous query and key in
the cpp side to reduce the overhead of two contiguous and index_select
compared with rotary_embedding in torch_npu. For now, rotary_embedding
can only support the scenario of `is_neox = true`, non-neox version rope
will be updated soon in the future.
---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-18 08:56:05 +08:00
whx 23f85e3f74
[BugFix] Fix scheduler problems in last PR. (#558)
This PR Fixes scheduler problems in last PR:
1. change position of DT test to validate it.
2. fix format of copyright.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-04-18 08:49:48 +08:00
Mengqing Cao 6ee7f5cf71
[SpecDecode] Add spec decode support (#500)
### What this PR does / why we need it?
Backport: https://github.com/vllm-project/vllm-ascend/pull/252
This support speculative decoding in Ascend, including speculating with
a draft model、by matching n-grams in the prompt、using MLP speculators
and using EAGLE based draft models.

Backport: https://github.com/vllm-project/vllm-ascend/pull/423
spec decode MultiStepWorker support TP1DraftModelRunner fully, support
run the draft_model_runner with multi-step prepare on the NPU directly
and support draft_model_runner use MLA.

1. before this pr, `MultiStepWorker` would not step into the branch
using NPU prepare, but only into the branch using CPU prepare (`line 52`
of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has
`no effect` on the `correct operation` of speculative decoding and the
performance of the two branches is basically the same as of the current
version, I support entering this branch in this PR. In general, there
are two main changes in `patch_multi_step_worker.py`: first, the
`is_cuda_like()` check is removed and the `TP1DraftModelRunner`
rewritten in vllm_ascend is used; second, the
`supports_gpu_multi_step()` function is made to return true on NPU
devices when outer Multi_step_worker could work correct.

3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU,
but not MLA. The relevant adaptation is in
`vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why
the `input_positions` of `model_input.attn_metadata` in vllm-ascend
needs to be added in `execute_model`, it is done in `model_runner.py`,
so I also made corresponding changes. Otherwise, when atten_backend is
MLA, it will prompt that input_positions cannot be found.

4. I commented out two lines in `draft_model_runner.py` in `line118` to
support the scenario of K>1.
  ```
  # lora_mapping=model_input.lora_mapping,
  # lora_requests=model_input.lora_requests,
  ```
I added comments. In the future, when vllm-ascend supports lora feature,
the changes here can be restored.

TODO:
- [ ] revert the patch when the related issues are addressed in vllm

### How was this patch tested?
CI passed with new added test.
- e2e test for medusa proposer:
tests/singlecard/spec_decode/e2e/test_medusa_correctness.py
- e2e test for mlp proposer:
tests/singlecard/spec_decode/e2e/test_mlp_correctness.py
- e2e test for n-gram proposer:
tests/singlecard/spec_decode/e2e/test_ngram_correctness.py

Tests for patched files:
- tests/singlecard/spec_decode/test_dynamic_spec_decode.py
- tests/singlecard/spec_decode/test_multi_step_worker.py
- tests/singlecard/spec_decode/test_ngram_worker.py
- tests/singlecard/spec_decode/test_spec_decode_worker.py

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
2025-04-17 20:16:32 +08:00
Mengqing Cao b71f193cb0
[Model][Doc] Update model support list (#552)
Update model support list
cc @Yikun plz help review, thanks!

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-17 19:32:20 +08:00
whx 20dff4deff
[Scheduler] Add AscendScheduler. (#543)
This PR adds AscendScheduler to vllm v1 engine.
This scheduler currently supports v0-style prefill-first scheduling
strategy.
In the future more schedule methods will be supported by this scheduler.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-04-17 19:31:50 +08:00
paulyu12 697908f5cd
[Platform][Worker][ModelRunner] Add LoRA & Multi-LoRA support (#521)
### What this PR does / why we need it?
According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic
Serving feature develop
#396](https://github.com/vllm-project/vllm-ascend/issues/396) and this
[vLLM Ascend Roadmap Q2 2025
#448](https://github.com/vllm-project/vllm-ascend/issues/448), we pull
request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA
Dynamic Serving.

LoRA reference is here: [LoRA
reference](https://docs.vllm.ai/en/latest/features/lora.html)

### Does this PR introduce _any_ user-facing change?

Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

### How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

---------

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
2025-04-17 16:48:46 +08:00
hfadzxy 9935d45728
[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460)
### What this PR does / why we need it?
Add model basic accuracy test(Qwen2.5-0.5B-Instruct)

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-04-17 14:59:56 +08:00
Huazhong Ji c3d1a3782a
Add pyhccl (#503)
This is the first step to support trl vllm serve on Ascend NPU
https://github.com/vllm-project/vllm-ascend/issues/459.
This PR can work properly only when
https://github.com/vllm-project/vllm/pull/16464 is merged into vLLM.

---------

Signed-off-by: hzji210@gmail.com <hzji210@gmail.com>
2025-04-17 14:57:52 +08:00
Li Wang 64fdf4cbef
[Doc]Update faq (#536)
### What this PR does / why we need it?
update performance and accuracy faq

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-17 14:56:51 +08:00
Mengqing Cao 6061f33670
[Bugfix][Model] Fix api in DeepSeek model (#545)
### What this PR does / why we need it?
Fix api in DeepSeekV2, aligning with the latest code of the main branch
in vllm.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Test locally with deepseek-v2-lite, and will add CI by @Potabk.
Plz update the model UT after this pr is merged, thx! cc @Potabk

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-17 11:56:05 +08:00
Li Wang 9859e7313f
[CI]Add global env to runner (#537)
### What this PR does / why we need it?
- add `HF_TOKEN` as global var to the runner
- add `HF_ENDPOINT` as global var to the runner
- change concurrency group, rely on current pr num

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-17 10:08:00 +08:00
hfadzxy 00de2ee6ad
[Doc] update faq about progress bar display issue (#538)
### What this PR does / why we need it?
update faq about progress bar display issue

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-04-16 16:07:08 +08:00
Mengqing Cao fe13cd9ea5
[Doc] update faq about w8a8 (#534)
update faq about w8a8

---------

Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-04-16 09:37:21 +08:00
Shanshan Shen 415ed027fa
[V1][Platform] Remove `supports_structured_output()` in platform (#531)
### What this PR does / why we need it?
Remove `supports_structured_output()` in platform. This method is no need, because upstream has deleted this.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-16 09:30:33 +08:00
wangxiyuan bbe7ccd366
[MISC] Add patch module (#526)
This PR added patch module for vllm
1. platform patch: the patch will be registered when load the platform
2. worker patch: the patch will be registered when worker is started.

The detail is:
1. patch_common: patch for main and 0.8.4 version
4. patch_main: patch for main verison
5. patch_0_8_4: patch for 0.8.4 version
2025-04-16 09:28:58 +08:00
wangxiyuan 434749d299
[CI] update 0.8.3 to 0.8.4 (#528)
Update 0.8.3 CI to 0.8.4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-16 09:26:30 +08:00
Li Wang 13480d1238
[CI]Fix workflow (#532)
### What this PR does / why we need it?
make linux-npu-4 runner run parallel for now


Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-15 19:55:41 +08:00
Shanshan Shen bcbc04f92b
[Doc] Add environment variables doc (#519)
### What this PR does / why we need it?
Add environment variables doc.
---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-15 16:09:36 +08:00
eeethenQ 44a8301424
[Feature] Add PD separation feature (#432)
### What this PR does / why we need it?
Adapt Disaggregated Prefill feature onto Ascend device

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

The test usage has been provided alongwith the PR, in
examples/offline_disaggregated_prefill_npu.py
To run it, do this
```
export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py
```

---------

Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
2025-04-15 15:11:35 +08:00
wangxiyuan c7f6584d75
[V1] clean up V1 code (#505)
Clean up V1 code:
1. remove useless code.
2. format code to be clear.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-15 10:24:02 +08:00
wangxiyuan f6af1d2471
[MISC] fix logger (#515)
logger in vllm-ascend doesn't work. This PR fix the issue.

Fix: https://github.com/vllm-project/vllm-ascend/issues/431

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-15 10:18:05 +08:00
wangxiyuan 5c6d79687c
[Doc] Update FAQ (#518)
Update FAQ

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-15 10:17:56 +08:00
wangxiyuan 5fa70b6393
[Build] Update doc (#509)
1. install torch-npu before vllm-ascend to ensure custom ops build
success.
2. set `COMPILE_CUSTOM_KERNELS=0` if users want to disable custom ops
build.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-14 14:38:50 +08:00
Shanshan Shen 11ecbfdb31
[Doc] Update FAQ doc (#504)
### What this PR does / why we need it?
Update FAQ doc.
---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-14 11:11:40 +08:00
wangxiyuan 9c7428b3d5
[CI] enable custom ops build (#466)
### What this PR does / why we need it?
This PR enable custom ops build  by default. 

### Does this PR introduce _any_ user-facing change?

Yes, users now install vllm-ascend from source will trigger custom ops
build step.

### How was this patch tested?
By image build and e2e CI

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-12 10:24:53 +08:00
Icey d05ea17427
Add openEuler based container image for vLLM Ascend (#489)
### What this PR does / why we need it?

Provide users with openEuler-based vllm images, so modify the quick
start readme

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

There is no need for performing any test.

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-04-10 14:30:49 +08:00
Li Wang afdbf77483
[CI] Add new runner and enable QwQ multinpu test (#417)
### What this PR does / why we need it?

- Add a new runner to the continuous integration system and keep the
original CI runner until the new runner runs stably
- Add distributed test cases

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-08 16:52:45 +08:00
jinyuxin 5d6239306b
[DOC] Update multi_node.md (#468)
### What this PR does / why we need it?
- Added instructions for verifying multi-node communication environment.
- Included explanations of Ray-related environment variables for
configuration.
- Provided detailed steps for launching services in a multi-node
environment.
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
manually tested.

Signed-off-by: jinyuxin <jinyuxin2@huawei.com>
2025-04-08 14:19:57 +08:00
Mengqing Cao f6cf92e7d5
[quant][bugfix] fix deepseek quant bug (#478)
see #465

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2025-04-08 09:15:56 +08:00
Yikun Jiang 579d858a20
Set torchvision<0.21.0 to match torch/torch_npu version (#479)
### What this PR does / why we need it?
Set torchvision<0.21.0 to match torch/torch_npu version to resolve
`RuntimeError: operator torchvision::nms does not exist`.

Closes: https://github.com/vllm-project/vllm-ascend/issues/477

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-08 09:15:42 +08:00
Shanshan Shen 1d88dacf9f
[V1][Platform] Add `supports_structured_output()` method to Platform (#475)
### What this PR does / why we need it?
Add `supports_structured_output()` method to Platform, find more details
at https://github.com/vllm-project/vllm/pull/16148.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-07 19:11:51 +08:00
Yikun Jiang adabdeea7f
Set numpy < 2.0.0 to resolve numpy VersionConflict (#476)
### What this PR does / why we need it?
vLLM bumps numpy version to 2.x:
8427f70493
, this will cause a
`pip._vendor.pkg_resources.ContextualVersionConflict: (numpy 2.2.4
(/usr/local/python3.10/lib/python3.10/site-packages),
Requirement.parse('numpy==1.26.4'), {'vllm-ascend'})` failure when vllm
ascend install. This PR resolved the issue by:
- Set numpy < 2.0.0 to resolve numpy VersionConflict
- Sync requirements and toml 
- Reorder


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/473

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-07 16:07:21 +08:00
Mengqing Cao 344228a5da
[deepseek][bugfix] support deepseek quant (#469)
- support deepseek quant
  - add w8a8_dynamic quant
see #391

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2025-04-07 10:56:12 +08:00
Li Wang 3f9752f8ee
[Bugfix]Lazy import vllm config (#462)
### What this PR does / why we need it?
Lazy import vllm config  to avoid circular imports

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-04-03 16:03:08 +08:00
Pleaplusone ce8259975e
[core] Support custom ascendc kernels in vllm-ascend (#233)
This PR add custom ascendc kernel rotary_embedding support in
vllm-ascend, related CMakeLists and setuptools is also added in this PR.

Related: https://github.com/vllm-project/vllm-ascend/issues/156

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-03 14:52:34 +08:00
Shanshan Shen 14d9a64047
[ModelRunner][V1] Optimize V1 attention mask (#442)
### What this PR does / why we need it?
Pre-construct a mask matrix to improve the efficiency of attention mask
construction during inference.

Note that the length of the matrix needs to be carefully balanced: a
matrix that is too large will consume excessive VRAM, while a matrix
that is too small will require dynamic concatenation during inference,
leading to performance degradation.

Therefore, an environment variable is added here to dynamically set the
size of the pre-constructed mask matrix based on requirements.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: didongli182 <didongli@huawei.com>
2025-04-02 10:33:53 +08:00
hfadzxy 94bf9c379e
[Doc]Add developer guide for using lm-eval (#456)
### What this PR does / why we need it?
Add developer guide for using lm-eval

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
test manually

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-01 23:43:51 +08:00
dependabot[bot] 78083d405e
Bump actions/setup-python from 5.4.0 to 5.5.0 (#440)
Bumps [actions/setup-python](https://github.com/actions/setup-python)
from 5.4.0 to 5.5.0.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-01 14:34:33 +08:00
Mengqing Cao 2dbd763584
[CI] Fix mypy CI (#443)
### What this PR does / why we need it?
Fix CI by updating mypy and pining numpy version

_the modification of model_runner_v1 is just to make CI happy_

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-01 09:25:33 +08:00
Yikun Jiang c42e21a5aa
[Docs] Add install system dependencies in install doc (#438)
### What this PR does / why we need it?
Add install system dependencies in install doc

Resolve:
```
$ pip install vllm==v0.7.3
CMake Error at CMakeLists.txt:14 (project):
  No CMAKE_CXX_COMPILER could be found.
  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.
// ... ...
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for vllm
Failed to build vllm
ERROR: Failed to build installable wheels for some pyproject.toml based projects (vllm)
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/439 


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-31 14:17:55 +08:00
hfadzxy 7beb4339dc
[Doc]Add developer guide for using OpenCompass (#368)
### What this PR does / why we need it?
Add developer guide for using OpenCompass

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

test manually

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-31 00:24:25 +08:00
wangxiyuan b6499ed97d
[CI] Use CI pool (#428)
Use CI pool instead of self-host for e2e test to speed up CI.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-29 12:42:59 +08:00
wangxiyuan ca8b1c3e47
[Doc] Add 0.7.3rc2 release note (#419)
Add 0.7.3rc2 release note. We'll release 0.7.3rc2 right now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-29 09:02:08 +08:00
wangxiyuan 31f29b9f30
[Core] Make V1 work and enable V1 engine test (#389)
1. Make sure the version is string before parse in collect_env
2. Add basic V1 engine test

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-28 19:34:23 +08:00
wuhuikx 57a84bb7be
[Bug Fix] Fix bug of platform for parameter checking (#411)
Fix bug in platform.py to avoid the None value of config parameters.

Signed-off-by: wuhuikx <wuhui_csu@163.com>
2025-03-28 16:31:27 +08:00
Tony b1557abab6
fix multistep bug,remove uselesscodes (#355)
1. remove useluss code in attention.py
2. multistep now using StatefulModelInputForNPU and do not use
StatefulModelInput

Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>
2025-03-28 09:55:35 +08:00
Yikun Jiang 1864c40520
Add vLLM Ascend Weekly meeting link (#400)
### What this PR does / why we need it?
Add vLLM Ascend Weekly meeting link

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-27 09:00:21 +08:00
Zhenyu Zheng 4804b74e95
Update 110-user-story.yml (#402)
Fix a few typos in issue template

Signed-off-by: Zhenyu Zheng <zheng.zhenyu@outlook.com>
2025-03-27 08:58:57 +08:00
Zhenyu Zheng 0b5a9643fd
Add an example for user stories (#399)
Add an example for user stories and fix some typo

Add a new section, user story in the docs, to collect user stories of
llvm-ascend, also add an example and the issue template to collect user
story

Signed-off-by: Zhenyu Zheng <zheng.zhenyu@outlook.com>
2025-03-26 16:25:57 +08:00
BAI Fan 122505208f
FastPatch: Optimized Patch Embedding for Qwen2VL (#345)
### What this PR does / why we need it?
We proposed the FastPatch method, which optimized patch embedding
(Conv3D) for Qwen2VL.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We've tested it on benchmark, it meets our satisfaction and is better
than original patch_embed layer.


---------

Signed-off-by: baifanxxx <baifanxxx@gmail.com>
Signed-off-by: zouyida <zouyida@huawei.com>
Co-authored-by: zouyida <zouyida@huawei.com>
2025-03-26 14:28:20 +08:00
Mengqing Cao d4accf4ec2
[Doc][Model] update LLaVA 1.6 support (#373)
update LLaVA 1.6 support

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-03-26 09:07:55 +08:00
Mengqing Cao 6295d2e9bc
[CI/Build][Doc] upgrade torch-npu to 0320 (#392)
### What this PR does / why we need it?
This pr upgrades torch-npu to 0320, so that #321,
https://github.com/vllm-project/vllm-ascend/issues/267#issuecomment-2745045743
could be fixed, and #372 should be reverted after this pr

### Does this PR introduce _any_ user-facing change?
upgrade torch-npu to 0320

### How was this patch tested?
tested locally with long seq inferencing.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-03-26 09:04:12 +08:00
Shanshan Shen 3fb3b5cf75
[Doc] Update model support doc (add QwQ-32B) (#388)
### What this PR does / why we need it?

Update model support doc (add QwQ-32B)


Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-03-25 11:40:50 +08:00
Mengqing Cao 8996733307
[CI] fix vllm test (#365)
fix vllm test

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-03-24 16:09:06 +08:00
Shanshan Shen 89ca63a2c2
[Bugfix] Disable torch.compile() (#370)
### What this PR does / why we need it?
To resolve this
[patch](https://github.com/vllm-project/vllm-ascend/pull/236/files#diff-43b96b39b5a52fe209d86449ad703a7ff5e1349ebaf1aa12ece8d82163ee5b61R24-R49)
, we need to set `torch.compile()` backend to `eager` to disable
compile, using default pytorch way.


---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-03-21 15:55:51 +08:00
Li Wang 9a175ca0fc
[Doc]Add benchmark scripts (#74)
### What this PR does / why we need it?
The purpose of this PR is to add benchmark scripts for npu, developers
can easily run performance tests on their own machines with one line of
code .


---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-03-21 15:54:34 +08:00
wangxiyuan befbee5883
Update README and add collect_env info (#369)
1. Doc: Fix error link
2. Doc: make Chinese version the same with english
3. remove useless file `test.py`
4. update `collect_env.py`
5. Fix v1 import error

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-21 15:43:43 +08:00
Yikun Jiang 243ed4da69
Add vLLM forum info and update readme (#366)
### What this PR does / why we need it?
Add vLLM forum info and update readme

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-21 09:32:42 +08:00
Shanshan Shen c06af8b2e0
[V1][Core] Add support for V1 Engine (#295)
### What this PR does / why we need it?
Add support for V1 Engine.

Please note that this is just the initial version, and there may be some
places need to be fixed or optimized in the future, feel free to leave
some comments to us.

### Does this PR introduce _any_ user-facing change?

To use V1 Engine on NPU device, you need to set the env variable shown
below:

```bash
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```

If you are using vllm for offline inferencing, you must add a `__main__`
guard like:

```bash
if __name__ == '__main__':

    llm = vllm.LLM(...)
```

Find more details
[here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing).

### How was this patch tested?
I have tested the online serving with `Qwen2.5-7B-Instruct` using this
command:

```bash
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```

Query the model with input prompts:

```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "prompt": "The future of AI is",
        "max_tokens": 7,
        "temperature": 0
    }'
```

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: didongli182 <didongli@huawei.com>
2025-03-20 19:34:44 +08:00
wangxiyuan 663dca7578
[CI] fix race condition problem (#353)
fix race condition problem

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-19 17:04:36 +08:00
Shanshan Shen 441a62e937
[Doc] Fix bugs of installation doc and format tool (#330)
### What this PR does / why we need it?
Fix bugs of installation doc and format tool.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
no.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-03-14 10:21:35 +08:00
wangxiyuan ac1ba1d8d2
[Build] Fix x86 image build (#327)
Install cpu version of pytorch in x86 to reduce image size

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-14 09:41:57 +08:00
wangxiyuan c25631ec7b
[Doc] Add the release note for 0.7.3rc1 (#285)
Add the release note for 0.7.3rc1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-13 17:57:06 +08:00
Li Wang 41aba1cfc1
[Doc]Fix tutorial doc expression (#319)
Fix tutorial doc expression

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-03-13 15:24:05 +08:00
xiemingda 59ea23d0d3
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311)
Run vllm-ascend on Single NPU

What this PR does / why we need it?
Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model
Inference/Serving doc

Does this PR introduce any user-facing change?
no

How was this patch tested?
no

Signed-off-by: xiemingda <xiemingda1002@gmail.com>
2025-03-12 20:37:12 +08:00
Angazenn 7330416de3
[BugFix] Fix bugs when using ascend quantization (#275)
### What this PR does / why we need it?
It fixes following bugs:
1. When searching a specific linear quantization implementation from a
tool (such as MindIE-Turbo), the mapping of packed linear is required to
identify correponding quant type.
2. The exception is narrowed down to ImportError when importing
MindIETurboQuantizer to better throw other errors.
3. The api of AscendKVCacheMethod.apply is aligned with that in
AscendAttentionBackendImpl.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By performing offline inference:

![image](https://github.com/user-attachments/assets/d63804cf-c060-451f-9cb0-d012e06b5333)

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-03-12 11:33:21 +08:00
Mengqing Cao 5c7a95b01d
[Attn] Support encoder-only attention with torch sdpa (#290)
### What this PR does / why we need it?
Support encoder-only attention with torch sdpa
fix
https://github.com/vllm-project/vllm-ascend/pull/229#issuecomment-2695942741

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Test locally with `pytest
vllm-project/vllm/tests/entrypoints/openai/test_score.py`
**Note**: Since torch compile on npu are still work in process, we need
to comment the following code to make UT run:

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L138

result:
```bash
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/xxx/code/vllm-cpu/vllm
configfile: pyproject.toml
plugins: shard-0.1.2, rerunfailures-15.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 8 items                                                                                                                                                                        
Running 8 items in this shard

tests/entrypoints/openai/test_score.py ........                                                                                                                                    [100%]

==================================================================================== warnings summary ====================================================================================
../../../miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8
  /home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================== 8 passed, 1 warning in 131.42s (0:02:11) ========================================================================
```

This ut will be included in CI when torch compile feature is done.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-03-12 08:57:29 +08:00
zouyida2002 12aa7115b5
bugfix for qwen2_vl (#301)
### What this PR does / why we need it?
this pr fixes the error while inferring Qwen2_VL.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
We've tested it on benchmark, it meets our satisfaction and is equal to
gpu.
---------

Signed-off-by: zouyida <zouyida@huawei.com>
2025-03-12 08:39:50 +08:00
wangxiyuan 9450e9811b
[CI] Uninstall triton in dockerfile (#298)
triton doesn't work with ascend. We should make sure it's uninstalled in
dockerfile


Related: https://github.com/vllm-project/vllm-ascend/issues/291

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-12 07:14:57 +08:00
yiz-liu 0db6670bfa
[Feature] Implement EP-compatible fused_moe (#121)
### What this PR does / why we need it?

Enable Expert-Parallel for ascend devices.

### Does this PR introduce _any_ user-facing change?

Enable EP
add `enable_expert_parallel=True` in your offline inference scripts,
like this:
```python
llm = LLM(
    model="/path/to/model",
    trust_remote_code=True,
    tensor_parallel_size=4,
    max_model_len=4096,
    enforce_eager=True,
    distributed_executor_backend="mp",
    enable_expert_parallel=True,
)
```

### How was this patch tested?

Please use the `main` branch of vLLM.

---------

Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
2025-03-11 21:08:02 +08:00
Tony 4c9d78a035
support multistep decode (#299)
Add multi step scheduler support for vllm-ascend

Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>
2025-03-11 19:20:06 +08:00
whx feb6bdb12e
[Platform][Model Runner] Add hash of request_ids; Change blocksize back to 128. (#293)
This PR changes the initial value of blocksize back to 128 and adds hash
value of request id list in model runner for implementing sampling param
cache in sampler.

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-03-11 18:50:28 +08:00
Yikun Jiang 007aeaa48b
[Doc] Change distributed_executor_backend to mp (#287)
### What this PR does / why we need it?
Fix `ValueError: Unrecognized distributed executor backend tp. Supported
values are 'ray', 'mp' 'uni', 'external_launcher' or custom ExecutorBase
subclass.`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test on my local node

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-10 11:27:26 +08:00
Yikun Jiang 38334f5daa
[Docs] Re-arch on doc and make QwQ doc work (#271)
### What this PR does / why we need it?
Re-arch on tutorials, move singe npu / multi npu / multi node to index.
- Unifiy docker run cmd
- Use dropdown to hide build from source installation doc
- Re-arch tutorials to include Qwen/QwQ/DeepSeek
- Make QwQ doc works

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test



Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-10 09:27:48 +08:00
Yikun Jiang 18bb8d1f52
Adapt vLLM requirements changes to fix main CI (#279)
### What this PR does / why we need it?
Adapt vLLM requirements changes:
206e2577fa (diff-01ec17406c969585ed075609a2bbf2f2f4fe3e3def36946694abe6d4eb60a6f2)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-09 16:07:45 +08:00
Yikun Jiang 268da28961
Pin modelscope<1.23.0 on vLLM v0.7.3 (#272)
### What this PR does / why we need it?
Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve:
https://github.com/vllm-project/vllm/pull/13807

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-09 15:59:42 +08:00
Yikun Jiang be58d5f3d8
Bump torch_npu version to dev20250308.3 (#276)
### What this PR does / why we need it?
Bump torch_npu version to dev20250308.3 to fix performance regression on
multi-stream case:
e04c580d07
.


### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-09 15:59:15 +08:00
Mengqing Cao 91f7d8115d
[CI/Build] Bump torch_npu to dev20250307.3 (#265)
Update torch-npu version to fix torch npu exponential_ accuracy
With this update, the percision issue when setting `temperature > 0` is
fixed.

---------

Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-03-07 20:34:07 +08:00
zouyida2002 faf8cd89cb
register qwen2_vl to rewrite qwen2_vl forwad (#241)
Add qwen2-vl ascend impletation.

---------
Signed-off-by: zouyida <zouyida@huawei.com>
2025-03-07 15:41:47 +08:00
Yikun Jiang 35cb7b5234
[CI] Add dispatch job to leverage dynamic devices (#251)
### What this PR does / why we need it?
Add dispatch job to leverage jobs to dynamic devices include 2 stage as
below:

The dispatch job will spend extra about `10s * parallel number + 30s`
time to wait other job launch container and release lock.

- **Stage 1: Acquire lock**
add a dispatch job, this job use lockfile to acquire locks and then get
device number dynamically
- **Stage 2.1: Launch container with dynamic device**
pass the device number via output and start the container job with
dynamic device
- **Stage 2.2: Release lock**
once the job started, release the lock.

In the backend, we use multiple path to setup multiple self host runners
as load balancer:
```
$ pwd
/home/action
$ ll | grep actions
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-01
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-02
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-03
drwx------   6 action action 4096 Mar  7 08:56 actions-runner-04
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-05
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-06
```

```
adduser -G docker action
su action
pip3 install docker prettytable
sudo yum install procmail
```

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
- CI passed
- E2E test manully, triggered 3 jobs in parallel:
- [1st
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297)
dispatch to /dev/davinci2.
- [2nd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250)
dispatch to /dev/davinci3
- [3rd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551)
dispatch to /dev/davinci4

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-07 09:47:13 +08:00
Angazenn 3217f0d10f
[Feature] Modify description and api for ascend quantization (#243)
### What this PR does / why we need it?
1. It adds more description for classes in quant_config.py
2. It renames AscendQKVQuantAttentionMethod to AscendKVCacheMethod to
align with vLLM naming style.
3. It modifies the process when AscendLinearMethod or
AscendKVCacheMethod calls create_weights.


### Does this PR introduce _any_ user-facing change?
Yes. When creating weights, now AscendLinearMethod uses get_weight,
get_pertensor_param and get_perchannel_param api from linear quant
implementation, while AscendKVCacheMethod passes layer into linear quant
implementation.

### How was this patch tested?
By performing offline inference

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-03-06 15:17:25 +08:00
Yikun Jiang cff08f9df8
[Doc] Add initial FAQs (#247)
### What this PR does / why we need it?
Add initial FAQs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-06 10:42:42 +08:00
HongtaoYang dcd0005058
[Fix] Remove npu_group_topk before CANN version update (#242)
Remove npu_group_topk before CANN version update.

Signed-off-by: SidaoY <1024863041@qq.com>
2025-03-06 09:02:46 +08:00
whx 0d3463400a
[Performance] Change the shape of kv_cache to avoid view of k_cache and v_cache. (#204)
This PR changes the shape of kv cache to avoid the view of k_cache and
v_cache.
What's more, cache the metadata of k_cache and v_cache to avoid
duplicative slice operations to improve performance.

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
2025-03-05 10:51:07 +08:00
Shanshan Shen 562fa673e5
[Bugfix] Exclude collect_env.py from CODESPELL check in format.sh (#240)
### What this PR does / why we need it?
Exclude `collect_env.py` from `CODESPELL` check in `format.sh`,
otherwise it will get the error shown below:

```bash
vLLM yapf: Done
vLLM mypy:
Running mypy on vllm_ascend
Success: no issues found in 18 source files
Running mypy on examples
Success: no issues found in 3 source files
Running mypy on tests
Success: no issues found in 3 source files
vLLM mypy: Done
collect_env.py:410: CANN ==> CAN
```

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
no.

Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-03-04 17:14:00 +08:00
Shanshan Shen 503f5045ff
[ModelRunner] Remove redundant profile_run() in model runner (#224)
### What this PR does / why we need it?
Remove redundant `profile_run()` in model runner.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
no.

---------

Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-03-04 16:58:33 +08:00
wangxiyuan ae49bfd13a
[Core] Support pooling (#229)
This PR added pooling support for vllm-ascend

Tested with `bge-base-en-v1.5` by encode:
```
from vllm import LLM

# Sample prompts.
prompts = [
  "Hello, my name is",
  "The president of the United States is",
  "The capital of France is",
  "The future of AI is",
]
# Create an LLM.
model = LLM(model="./bge-base-en-v1.5", enforce_eager=True)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
    print(output.outputs.embedding)  # list of 4096 floats
```

Tested by embedding:
```
from vllm import LLM, SamplingParams

llm = LLM(model="./bge-base-en-v1.5", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

Related: https://github.com/vllm-project/vllm-ascend/issues/200

## Known issue
The accuracy is not correct since this feature rely on `enc-dec`
support. It'll be done in the following PR by @MengqingCao

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-03-04 15:59:34 +08:00
Shanshan Shen 8fda31cafe
[Doc] Update Feature Support doc (#234)
### What this PR does / why we need it?
Update Feature Support doc.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
no.

---------

Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-03-04 14:18:32 +08:00
Shanshan Shen b9f0e25c16
[Misc] Add collect_env.py scripts for bug reporting (#175)
### What this PR does / why we need it?
Add `collect_env.py` scripts from vLLM and remove `nvidia`, `gpu`,
`cuda` related codes, thus users of vllm-ascend can collect their env
info when reporting bugs.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
Run `python collect_env.py` works


Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-03-04 14:14:37 +08:00
Yikun Jiang 839dac8d60
Install wget to fix image build (#231)
### What this PR does / why we need it?

Install `wget` to fix image build

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-04 09:01:23 +08:00
Mengqing Cao b64ee7d346
[Dist] Set device as rank (#202)
### What this PR does / why we need it?
The rank returned by `torch.distributed.get_rank(device_group)` is the
local rank, but rank (or rank in process group (PG)) is expected.
Thus we change to use `torch.npu.current_device()` to set device

```python
    # difference between `local_rank` and `rank_in_group`:
    # if we have a group of size 4 across two nodes:
    # Process | Node | Rank | Local Rank | Rank in Group
    #   0     |   0  |  0   |     0      |       0
    #   1     |   0  |  1   |     1      |       1
    #   2     |   1  |  2   |     0      |       2
    #   3     |   1  |  3   |     1      |       3
```

Tested by @wwfu109 with
`vllm/tests/distributed/test_customops::test_multi_process_tensor_parallel_pipeline_parallel`

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-03-03 09:23:13 +08:00
Yikun Jiang ebe14f20cf
Recover vllm-ascend dev image (#209)
### What this PR does / why we need it?
Recover vllm-ascend dev image

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-03 09:08:41 +08:00
Yikun Jiang 6e358c4bef
Add Document Branch Policy (#217)
### What this PR does / why we need it?
Add Document Branch Policy

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Related: https://github.com/vllm-project/vllm-ascend/issues/214

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-03 09:07:39 +08:00
Yikun Jiang 46740958f2
Add ray to docker image (#197)
### What this PR does / why we need it?
Add ray to docker image to make `ray` work

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-28 15:23:18 +08:00
dependabot[bot] 81dfaae88b
Bump docker/setup-buildx-action from 2 to 3 (#191)
Bumps
[docker/setup-buildx-action](https://github.com/docker/setup-buildx-action)
from 2 to 3.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-28 09:06:46 +08:00
dependabot[bot] a710a7563a
Bump docker/setup-qemu-action from 2 to 3 (#192)
Bumps
[docker/setup-qemu-action](https://github.com/docker/setup-qemu-action)
from 2 to 3.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-28 09:06:13 +08:00
dependabot[bot] a5564ed5d8
Bump actions/setup-python from 5.3.0 to 5.4.0 (#193)
Bumps [actions/setup-python](https://github.com/actions/setup-python)
from 5.3.0 to 5.4.0.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-27 20:05:15 +08:00
whx 14bca9911a
[CI] Fix unsolved bugs caused by pta api change. (#190)
This PR fix some unsolved bugs caused by pta api change.

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-02-27 19:52:28 +08:00
Yuanhao Ji 6aed83335c
[CI] Add dependabot support and labeler workflow (#162)
Add dependabot support and labeler workflow

---------

Signed-off-by: Yuanhao Ji <jiyuanhao@apache.org>
2025-02-27 19:46:31 +08:00
Mengqing Cao 03dc5c01fd
[Doc] update multinode doc (#181)
Update multinode doc
fix #167 #168

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-27 19:29:49 +08:00
HongtaoYang 1715230867
[CI] Upgrade to newest pta.(MLA and FusedMoE) (#189)
Upgrade to newest pta.(MLA and FusedMoE)

---------

Signed-off-by: SidaoY <1024863041@qq.com>
2025-02-27 18:50:52 +08:00
Li Wang c131e43e7d
[Worker]Lazy import torch_npu (#184)
### What this PR does / why we need it?
To avoid unnecessary delays, we only import torch_npu when profilling is
enabled.

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-02-27 16:52:11 +08:00
wangxiyuan 6042c210bc
[CI] upgrade to newest pta (#187)
Upgrade to newest torch-npu

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-02-27 16:40:23 +08:00
Mengqing Cao fd18ae6494
[MOE] fix #176 (#179)
Fix #176
We need to set `topk_group` and `num_expert_group` to `0` if they are
`None`

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-27 14:21:08 +08:00
Shanshan Shen ee43179767
[ModelRunner] Fix cuda hard code in model runner (#155)
### What this PR does / why we need it?
1. Fix cuda hard code in model runner.
2. Fix tutorials doc rendering error.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
no.

Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-02-27 14:16:46 +08:00
zouyida2002 94cd66bba7
[CI][UT]enable multimodal ut (#158)
enable multimodal ut

---------

Signed-off-by: zouyida <zouyida@huawei.com>
2025-02-27 14:14:43 +08:00
Mengqing Cao 94483775e1
[CI] fix hf_token (#180)
Fix the bug introduced by #173

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-26 17:29:31 +08:00
Mengqing Cao 1c238b930d
[worker] remove unused assertion (#161)
### What this PR does / why we need it?
Remove unused assertion in `NPUWorker`, as this has been moved to
`Executor` in vLLM:

aabeb2688f/vllm/executor/uniproc_executor.py (L43)

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-26 16:11:36 +08:00
Mengqing Cao 78530c0667
[CI/Build] add HF_TOKEN for model downloading (#173)
### What this PR does / why we need it?
Add `HF_TOKEN` for downloading models that requires access rights from
huggingface hub. This will fix the CI error in #123 and #76

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-26 15:35:03 +08:00
Mengqing Cao 7776f2e6a4
[ModelRunner] remove padding for vlm inputs (#150)
### What this PR does / why we need it?
Remove padding for vlm inputs.
We don't need padding inputs now, this padding will break the input
preparetion of VLMs.

### Does this PR introduce _any_ user-facing change?
N/A

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-26 10:26:39 +08:00
Mengqing Cao 79fbb20b4d
[ModelRunner] remove unused args (follow vllm changes) (#159)
### What this PR does / why we need it?
The arg list of `Attention.forward()` is changed by
https://github.com/vllm-project/vllm/pull/13555.
The unused args `kv_caches` and `attn_metadata` are removed.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-25 17:51:09 +08:00
wangxiyuan 51ae37b22a
[Doc] update readme (#147)
Fix doc issue in README

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-02-25 11:00:58 +08:00
Mengqing Cao 3a7882208f
[CI] enable test if pytest.ini changes (#151)
enable test if pytest.ini changes

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-24 16:47:05 +08:00
Yaphets24 d0b3cb4fa7
modify:Eliminate redundant operations in the code to improve performance (#137)
### What this PR does / why we need it?
Eliminate redundant operations in the code to improve performance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed
---------

Signed-off-by: Yaphets24 <d_mym0618@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-02-22 17:43:42 +08:00
326 changed files with 46100 additions and 3774 deletions

45
.github/Dockerfile.buildwheel vendored Normal file
View File

@ -0,0 +1,45 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
ARG PY_VERSION=3.10
FROM quay.io/ascend/manylinux:8.0.0-910b-manylinux_2_28-py${PY_VERSION}
ARG COMPILE_CUSTOM_KERNELS=1
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
WORKDIR /workspace
COPY . /workspace/vllm-ascend/
# Install req
RUN python3 -m pip install -r vllm-ascend/requirements.txt --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip install twine
# Install vllm-ascend
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
cd vllm-ascend && \
python3 setup.py bdist_wheel && \
ls -l dist
CMD ["/bin/bash"]

View File

@ -0,0 +1,37 @@
name: 📚 User Story
description: Apply for an user story to be displayed on https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html
title: "[User Story]: "
labels: ["user-story"]
body:
- type: textarea
attributes:
label: 📚 Title
description: >
A clear title about what your user story is about.
validations:
required: true
- type: textarea
attributes:
label: About / Introduction
description: >
A brief introduction about the background of your use case, like your scenario, hardware size etc.
- type: textarea
attributes:
label: Bussiness Challenges
description: >
Tell us how what kind of challenge you faced in this user story.
- type: textarea
attributes:
label: Solving challenges with vLLM Ascend and benefits
description: >
Tell us how vLLM Ascend helped you overcome the challenges, including details like how you use it, what version you used, hardware info, etc. And what kind of benefit do you get from using vLLM Ascend
- type: textarea
attributes:
label: Extra Info
description: >
Any extra infomation you want to include in this story
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!

View File

@ -14,9 +14,7 @@ body:
description: |
Please run the following and paste the output below.
```sh
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
wget https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```

View File

@ -0,0 +1,100 @@
name: Release Checklist
description: Generate a release checklist issue when prepare a new release.(Used for release team)
title: "[Release]: Release checklist for v"
body:
- type: textarea
attributes:
description: >
Brief info for the new release.
label: Release Checklist
value: >
**Release Version**:
**Release Branch**:
**Release Date**:
**Release Manager**:
- type: textarea
attributes:
description: >
Release notes.
label: Prepare Release Note
value: >
- [ ] Create a new issue for release feedback
- [ ] Write the release note PR.
- [ ] Update the feedback issue link in docs/source/faqs.md
- [ ] Add release note to docs/source/user_guide/release_notes.md
- [ ] Update version info in docs/source/community/versioning_policy.md
- [ ] Update contributor info in docs/source/community/contributors.md
- [ ] Update package version in docs/conf.py
- type: textarea
attributes:
description: >
Make sure the code is merged.
label: PR need Merge
value: >
- [ ] PR link1
- [ ] PR link2
- [ ] ...
- type: textarea
attributes:
description: >
Make sure the new Feature/Function is tested
label: Functional Test
value: >
- [ ] Feature1
- [ ] Bug1
- [ ] ...
- type: textarea
attributes:
description: >
Make sure the doc is updated.
label: Doc Test
value: >
- [ ] Tutorial is updated.
- [ ] User Guide is updated.
- [ ] Developer Guide is updated.
- type: textarea
attributes:
description: >
Make sure the artifacts is ready
label: Prepare Artifacts
value: >
- [ ] Docker image is ready.
- [ ] Wheel package is ready.
- type: textarea
attributes:
description: >
Start to release.
label: Release Step
value: >
- [ ] Release note PR is merged.
- [ ] Post the release on GitHub release page.
- [ ] Generate official doc page on https://app.readthedocs.org/dashboard/
- [ ] Wait for the wheel package to be available on https://pypi.org/project/vllm-ascend
- [ ] Wait for the docker image to be available on https://quay.io/ascend/vllm-ascend
- [ ] Upload 310p wheel to Github release page
- [ ] Broadcast the release news (By message, blog , etc)
- [ ] Close this issue

8
.github/actionlint.yaml vendored Normal file
View File

@ -0,0 +1,8 @@
self-hosted-runner:
# Labels of self-hosted runner in array of strings.
labels:
- linux-arm64-npu-1
- linux-arm64-npu-2
- linux-arm64-npu-4
- linux-arm64-npu-static-8
- ubuntu-24.04-arm

10
.github/dependabot.yml vendored Normal file
View File

@ -0,0 +1,10 @@
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
# Check for updates to GitHub Actions every week
interval: "weekly"
open-pull-requests-limit: 2
reviewers:
- "Yikun"

59
.github/format_pr_body.sh vendored Executable file
View File

@ -0,0 +1,59 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
# Adapted from vllm/.github/scripts/cleanup_pr_body.sh
#!/bin/bash
set -eux
# ensure 2 argument is passed
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <pr_number> <vllm_version> <vllm_commit>"
exit 1
fi
PR_NUMBER=$1
VLLM_VERSION=$2
VLLM_COMMIT=$3
OLD=/tmp/orig_pr_body.txt
NEW=/tmp/new_pr_body.txt
FINAL=/tmp/final_pr_body.txt
gh pr view --json body --template "{{.body}}" "${PR_NUMBER}" > "${OLD}"
cp "${OLD}" "${NEW}"
# Remove notes in pr description and add vLLM version and commit
sed -i '/<!--/,/-->/d' "${NEW}"
sed -i '/- vLLM .*$/d' "${NEW}"
{
echo ""
echo "- vLLM version: $VLLM_VERSION"
echo "- vLLM main: $VLLM_COMMIT"
} >> "${NEW}"
# Remove redundant empty lines
uniq "${NEW}" > "${FINAL}"
# Run this only if ${NEW} is different than ${OLD}
if ! cmp -s "${OLD}" "${FINAL}"; then
echo
echo "Updating PR body:"
echo
cat "${NEW}"
gh pr edit --body-file "${FINAL}" "${PR_NUMBER}"
else
echo "No changes needed"
fi

38
.github/labeler.yml vendored Normal file
View File

@ -0,0 +1,38 @@
---
documentation:
- changed-files:
- any-glob-to-any-file:
- 'docs/**'
- '**/*.md'
ci/build:
- changed-files:
- any-glob-to-any-file:
- '.github/actions/*.yml'
- '.github/workflows/*.yml'
'module:tests':
- changed-files:
- any-glob-to-any-file:
- 'tests/**'
'module:tools':
- changed-files:
- any-glob-to-any-file:
- 'tools/**'
'module:ops':
- changed-files:
- any-glob-to-any-file:
- 'vllm_ascend/ops/**'
'module:quantization':
- changed-files:
- any-glob-to-any-file:
- 'vllm_ascend/quantization/**'
'module:core':
- changed-files:
- any-glob-to-any-file:
- 'vllm_ascend/*.py'

405
.github/workflows/accuracy_test.yaml vendored Normal file
View File

@ -0,0 +1,405 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
# This test will be triggered:
# 1. PR labeled with: '*accuracy-test' (ONLY 1 label valid) & 'ready-for-test'
# 2. workflow_dispatch with models input
# See detail rule in strategy.matrix note
name: Benchmarks / accuracy
on:
schedule:
# Runs every 6 hours
- cron: '0 */6 * * *'
pull_request:
types: [ labeled ]
workflow_dispatch:
inputs:
vllm-version:
description: 'vllm version:'
required: true
type: choice
# Please also update this when bump matched version
# Current supported vLLM versions
options:
- main
- v0.9.2
- v0.9.1
- v0.7.3
vllm-ascend-version:
description: 'vllm-ascend version:'
required: true
type: choice
options:
- main
- v0.9.1-dev
- v0.7.3-dev
models:
description: 'model:'
required: true
type: choice
options:
- all
- Qwen/Qwen2.5-VL-7B-Instruct
- Qwen/Qwen3-8B-Base
- Qwen/Qwen3-30B-A3B
default: 'all'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
accuracy_tests:
# test will be triggered when tag '*-accuracy-test' & 'ready-for-test' or workflow_dispatch job
if: >-
${{
(contains(github.event.pull_request.labels.*.name, 'accuracy-test') ||
contains(github.event.pull_request.labels.*.name, 'vl-accuracy-test') ||
contains(github.event.pull_request.labels.*.name, 'moe-accuracy-test') ||
contains(github.event.pull_request.labels.*.name, 'dense-accuracy-test')) &&
contains(github.event.pull_request.labels.*.name, 'ready-for-test') ||
github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
}}
runs-on: >-
${{
(matrix.model_name == 'Qwen/Qwen3-30B-A3B' && 'linux-arm64-npu-4') ||
'linux-arm64-npu-2'
}}
strategy:
matrix:
# the accuracy test will run:
# 1. workflow_dispatch with models input
# - all: Qwen/Qwen3-30B-A3B, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
# - specified but not all: Qwen/Qwen3-30B-A3B, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
# 2. PR labeled with "*-accuracy-test"
# - accuracy-test: Qwen/Qwen3-8B-Base, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-30B-A3B
# - dense-accuracy-test: Qwen/Qwen3-8B-Base
# - vl-accuracy-test: Qwen/Qwen2.5-VL-7B-Instruct
# - moe-accuracy-test: Qwen/Qwen3-30B-A3B
model_name: ${{ fromJSON(
(github.event_name == 'schedule' &&
'["Qwen/Qwen3-30B-A3B","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
(github.event.inputs.models == 'all' &&
'["Qwen/Qwen3-30B-A3B","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
(github.event.inputs.models == 'Qwen/Qwen3-30B-A3B' &&
'["Qwen/Qwen3-30B-A3B"]') ||
(github.event.inputs.models == 'Qwen/Qwen2.5-VL-7B-Instruct' &&
'["Qwen/Qwen2.5-VL-7B-Instruct"]') ||
(github.event.inputs.models == 'Qwen/Qwen3-8B-Base' &&
'["Qwen/Qwen3-8B-Base"]') ||
contains(github.event.pull_request.labels.*.name, 'accuracy-test') &&
'["Qwen/Qwen3-8B-Base","Qwen/Qwen2.5-VL-7B-Instruct", "Qwen/Qwen3-30B-A3B"]' ||
contains(github.event.pull_request.labels.*.name, 'dense-accuracy-test') &&
'["Qwen/Qwen3-8B-Base"]' ||
contains(github.event.pull_request.labels.*.name, 'vl-accuracy-test') &&
'["Qwen/Qwen2.5-VL-7B-Instruct"]' ||
contains(github.event.pull_request.labels.*.name, 'moe-accuracy-test') &&
'["Qwen/Qwen3-30B-A3B"]'
) }}
fail-fast: false
name: ${{ matrix.model_name }} accuracy
container:
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
DATASET_SOURCE: ModelScope
VLLM_USE_MODELSCOPE: True
USE_MODELSCOPE_HUB: 1
# 1. If version specified (work_dispatch), do specified branch accuracy test
# 2. If no version (labeled PR), do accuracy test by default ref:
# The branch, tag or SHA to checkout. When checking out the repository that
# triggered a workflow, this defaults to the reference or SHA for that event.
# Otherwise, uses the default branch.
GHA_VLLM_ASCEND_VERSION: ${{ github.event.inputs.vllm-ascend-version }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: ./vllm-empty
# Please also update this when bump matched version
ref: ${{ github.event.inputs.vllm-version || 'v0.9.2' }}
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: VLLM_TARGET_DEVICE=empty pip install -e .
- name: Resolve vllm-ascend version
run: |
VERSION_INPUT="${{ github.event.inputs.vllm-ascend-version }}"
if [[ "$VERSION_INPUT" == "main" ]]; then
TAGS=$(git ls-remote --tags --sort=-v:refname https://github.com/vllm-project/vllm-ascend "v*" | cut -f2 | sed 's|refs/tags/||')
LATEST_TAG=$(echo "$TAGS" | head -n1)
if [[ -z "$LATEST_TAG" ]]; then
RESOLVED_VERSION="main"
else
RESOLVED_VERSION="$LATEST_TAG"
fi
else
RESOLVED_VERSION="$VERSION_INPUT"
fi
echo "GHA_VLLM_ASCEND_VERSION=$RESOLVED_VERSION" >> $GITHUB_ENV
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm-ascend
path: ./vllm-ascend
ref: ${{ env.GHA_VLLM_ASCEND_VERSION }}
- name: Install vllm-project/vllm-ascend
working-directory: ./vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Get vLLM commit hash and URL
working-directory: ./vllm-empty
run: |
VLLM_COMMIT=$(git rev-parse --short=7 HEAD)
echo "VLLM_COMMIT=$VLLM_COMMIT" >> $GITHUB_ENV
- name: Get vLLM-Ascend commit hash and URL
working-directory: ./vllm-ascend
run: |
VLLM_ASCEND_COMMIT=$(git rev-parse --short=7 HEAD)
echo "VLLM_ASCEND_COMMIT=$VLLM_ASCEND_COMMIT" >> $GITHUB_ENV
- name: Print resolved hashes
run: |
echo "vLLM : ${{ env.VLLM_COMMIT }}"
echo "vLLM-Ascend: ${{ env.VLLM_ASCEND_COMMIT }}"
- name: Install lm-eval, ray, and datasets
run: |
pip install lm-eval==0.4.8
- name: Collect version info
run: |
for dir in /usr/local/Ascend/ascend-toolkit/*; do
dname=$(basename "$dir")
if [ "$dname" != "latest" ]; then
TOOLKIT_DIR="$dname"
break
fi
done
INFO_FILE="/usr/local/Ascend/ascend-toolkit/${TOOLKIT_DIR}/$(uname -i)-linux/ascend_toolkit_install.info"
GHA_CANN_VERSION=$(grep "version=" "$INFO_FILE" \
| head -n1 \
| cut -d'=' -f2 \
| tr -d '"')
{
echo "GHA_CANN_VERSION=$GHA_CANN_VERSION"
pip show torch | grep "Version:" | awk '{print "GHA_TORCH_VERSION="$2}'
pip show torch_npu | grep "Version:" | awk '{print "GHA_TORCH_NPU_VERSION="$2}'
pip show vllm | grep "Version:" | awk '{print "GHA_VLLM_VERSION="$2}' | sed 's/+.*//'
} >> "$GITHUB_ENV"
- name: Print versions
run: |
echo "CANN: ${{ env.GHA_CANN_VERSION }}"
echo "Torch NPU: ${{ env.GHA_TORCH_NPU_VERSION }}"
echo "Torch: ${{ env.GHA_TORCH_VERSION }}"
echo "vLLM: ${{ env.GHA_VLLM_VERSION }}"
echo "vLLM Ascend: ${{ env.GHA_VLLM_ASCEND_VERSION }}"
- name: Run Accuracy Test
id: report
working-directory: ./benchmarks
env:
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
run: |
model_base_name=$(basename ${{ matrix.model_name }})
markdown_name="${model_base_name}"
echo "markdown_name=$markdown_name"
echo "markdown_name=$markdown_name" >> $GITHUB_OUTPUT
mkdir -p ./accuracy
python ./scripts/run_accuracy.py \
--model "${{ matrix.model_name }}" \
--output "./accuracy/${markdown_name}.md" \
--vllm_ascend_version "${{ env.GHA_VLLM_ASCEND_VERSION || github.ref }}" \
--cann_version "${{ env.GHA_CANN_VERSION }}" \
--torch_npu_version "${{ env.GHA_TORCH_NPU_VERSION }}" \
--torch_version "${{ env.GHA_TORCH_VERSION }}" \
--vllm_version "${{ env.GHA_VLLM_VERSION }}" \
--vllm_commit "${{ env.VLLM_COMMIT }}" \
--vllm_ascend_commit "${{ env.VLLM_ASCEND_COMMIT }}" \
- name: Generate step summary
if: ${{ always() }}
run: |
cat ./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md >> $GITHUB_STEP_SUMMARY
- name: Sanitize version string for artifact naming
run: |
SAFE_VLLM_ASCEND_VERSION="${GHA_VLLM_ASCEND_VERSION//\//-}"
echo "SAFE_VLLM_ASCEND_VERSION=$SAFE_VLLM_ASCEND_VERSION" >> "$GITHUB_ENV"
- name: Check report first line for failure
id: check_report
run: |
REPORT_PATH="./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md"
echo "Scanning $REPORT_PATH for ❌ …"
if grep -q '❌' "$REPORT_PATH"; then
echo "contains_fail=true" >> $GITHUB_OUTPUT
else
echo "contains_fail=false" >> $GITHUB_OUTPUT
fi
- name: Upload Report
if: ${{ github.event_name == 'workflow_dispatch' && steps.check_report.outputs.contains_fail == 'false' }}
uses: actions/upload-artifact@v4
with:
name: "report-${{ env.SAFE_VLLM_ASCEND_VERSION }}-${{ steps.report.outputs.markdown_name }}"
path: ./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md
if-no-files-found: warn
retention-days: 90
overwrite: true
create_pr:
runs-on: ubuntu-latest
needs: accuracy_tests
if: ${{ github.event_name == 'workflow_dispatch' }}
env:
UPSTREAM_REPO: vllm-project/vllm-ascend
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: vllm-ascend-ci/vllm-ascend
token: ${{ secrets.PAT_TOKEN }}
ref: main
- name: Add upstream remote
run: |
git remote add upstream https://github.com/${{ env.UPSTREAM_REPO }}.git
git fetch upstream
git remote -v
- name: Set Git user info dynamically
run: |
git config user.name "${{ github.actor }}"
git config user.email "${{ github.actor }}@users.noreply.github.com"
- name: Create or switch to branch
run: |
TIMESTAMP=$(date +%Y%m%d%H%M%S)
BRANCH_NAME="auto-pr/accuracy-report-${TIMESTAMP}"
echo "BRANCH_NAME=${BRANCH_NAME}" >> $GITHUB_ENV
git checkout -B "${BRANCH_NAME}" upstream/${{ github.event.inputs.vllm-ascend-version }}
- name: Download only current run reports
uses: actions/download-artifact@v4
with:
path: ./docs/source/developer_guide/evaluation/accuracy_report
pattern: report-*
github-token: ${{ secrets.GITHUB_TOKEN }}
run-id: ${{ github.run_id }}
- name: Delete old report
run: |
find ./docs/source/developer_guide/evaluation/accuracy_report -maxdepth 1 -type f -name '*.md' ! -name 'index.md' -delete
find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 2 -type f -name '*.md' -exec mv -f {} ./docs/source/developer_guide/evaluation/accuracy_report \;
find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 1 -type d -empty -delete
- name: Update accuracy_report/index.md
run: |
REPORT_DIR="./docs/source/developer_guide/evaluation/accuracy_report"
INDEX_MD="$REPORT_DIR/index.md"
{
echo "# Accuracy Report"
echo ""
echo ":::{toctree}"
echo ":caption: Accuracy Report"
echo ":maxdepth: 1"
for report in "$REPORT_DIR"/*.md; do
filename="$(basename "$report" .md)"
if [ "$filename" != "index" ]; then
echo "$filename"
fi
done
echo ":::"
} > "$INDEX_MD"
- name: push accuracy report
env:
GITHUB_TOKEN: ${{ secrets.PAT_TOKEN }}
run: |
git add ./docs/source/developer_guide/evaluation/accuracy_report/*.md
git commit -s -m "[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-version }}"
git push -f origin "${{ env.BRANCH_NAME }}"
- name: Create PR in upstream via API
uses: actions/github-script@v7
with:
github-token: ${{ secrets.PAT_TOKEN }}
script: |
const pr = await github.rest.pulls.create({
owner: 'vllm-project',
repo: 'vllm-ascend',
head: `vllm-ascend-ci:${{ env.BRANCH_NAME }}`,
base: '${{ github.event.inputs.vllm-ascend-version }}',
title: `[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-version }}`,
body: `The accuracy results running on NPU Altlas A2 have changed, updating reports for:
${{
github.event.inputs.models == 'all'
&& 'All models (Qwen/Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base)'
|| github.event.inputs.models
}}
- [Workflow run][1]
[1]: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}`
});
core.info(`Created PR #${pr.data.number}`);

View File

@ -1,59 +0,0 @@
#
# Adapted from vllm-project/vllm/blob/main/.github
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: Lint GitHub Actions workflows
on:
push:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/*.ya?ml'
- '.github/workflows/actionlint.*'
- '.github/workflows/matchers/actionlint.json'
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/*.ya?ml'
- '.github/workflows/actionlint.*'
- '.github/workflows/matchers/actionlint.json'
env:
LC_ALL: en_US.UTF-8
defaults:
run:
shell: bash
permissions:
contents: read
jobs:
actionlint:
runs-on: ubuntu-latest
steps:
- name: "Checkout"
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: "Run actionlint"
run: |
echo "::add-matcher::.github/workflows/matchers/actionlint.json"
tools/actionlint.sh -color

63
.github/workflows/format_pr_body.yaml vendored Normal file
View File

@ -0,0 +1,63 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: format / pr body
on:
# The PR updated when PR opened and push new commits
pull_request_target:
types: [opened, synchronize]
branches:
- 'main'
permissions:
pull-requests: write
jobs:
update-description:
name: update vLLM version
runs-on: ubuntu-latest
steps:
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: ./vllm-empty
- name: Get vLLM version
working-directory: ./vllm-empty
run: |
VLLM_COMMIT=$(git rev-parse HEAD)
echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> $GITHUB_ENV
- name: Checkout repository
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python
uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
- name: Get vLLM release version
run: |
VLLM_VERSION=$(python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"')
echo "VLLM_VERSION=$VLLM_VERSION" >> $GITHUB_ENV
- name: Update PR description
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
bash .github/format_pr_body.sh "${{ github.event.number }}" "${{ env.VLLM_VERSION }}" "${{ env.VLLM_COMMIT }}"

View File

@ -0,0 +1,117 @@
name: 'image / openEuler / 310p'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main-310p-openeuler / vllm-ascend:*-dev-310p-openeuler
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-310p-openeuler / vllm-ascend:v1.2.3rc1-310p-openeuler
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_310p_openeuler.yml'
- 'Dockerfile.310p.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_310p_openeuler.yml'
- 'Dockerfile.310p.openEuler'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-310p-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-310p-openeuler
# - pre/post/dev: v0.7.1rc1-310p-openeuler/v0.7.1rc1-310p-openeuler/v0.7.1rc1.dev1-310p-openeuler/v0.7.1.post1-310p-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-310p-openeuler
type=ref,event=pr,suffix=-310p-openeuler
type=pep440,pattern={{raw}},suffix=-310p-openeuler
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 310p
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
file: Dockerfile.310p.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

113
.github/workflows/image_310p_ubuntu.yml vendored Normal file
View File

@ -0,0 +1,113 @@
name: 'image / Ubuntu / 310p'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main-310p / vllm-ascend:*-dev-310p
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-310p / vllm-ascend:v1.2.3rc1-310p
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_310p_ubuntu.yml'
- 'Dockerfile.310p'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_310p_ubuntu.yml'
- 'Dockerfile.310p'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-310p
# - pre/post/dev: v0.7.1rc1-310p/v0.7.1rc1-310p/v0.7.1rc1.dev1-310p/v0.7.1.post1-310p, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-310p
type=ref,event=pr,suffix=-310p
type=pep440,pattern={{raw}},suffix=-310p
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 310p
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile.310p
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

117
.github/workflows/image_a3_openeuler.yml vendored Normal file
View File

@ -0,0 +1,117 @@
name: 'image / openEuler / a3'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-a3-openeuler / vllm-ascend:v1.2.3rc1-a3-openeuler
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_a3_openeuler.yml'
- 'Dockerfile.a3.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_a3_openeuler.yml'
- 'Dockerfile.a3.openEuler'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-a3-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-a3-openeuler
# - pre/post/dev: v0.7.1rc1-a3-openeuler/v0.7.1rc1-a3-openeuler/v0.7.1rc1.dev1-a3-openeuler/v0.7.1.post1-a3-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-a3-openeuler
type=ref,event=pr,suffix=-a3-openeuler
type=pep440,pattern={{raw}},suffix=-a3-openeuler
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push a3
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
file: Dockerfile.a3.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

113
.github/workflows/image_a3_ubuntu.yml vendored Normal file
View File

@ -0,0 +1,113 @@
name: 'image / Ubuntu / a3'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-a3|vllm-ascend:v1.2.3rc1-a3
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_a3_ubuntu.yml'
- 'Dockerfile.a3'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_a3_ubuntu.yml'
- 'Dockerfile.a3'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-a3 is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-a3
# - pre/post/dev: v0.7.1rc1-a3/v0.7.1rc1-a3/v0.7.1rc1.dev1-a3/v0.7.1.post1-a3, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-a3
type=ref,event=pr,suffix=-a3
type=pep440,pattern={{raw}},suffix=-a3
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push a3
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile.a3
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

116
.github/workflows/image_openeuler.yml vendored Normal file
View File

@ -0,0 +1,116 @@
name: 'image / openEuler'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main-openeuler / vllm-ascend:*-dev-openeuler
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-openeuler / vllm-ascend:v1.2.3rc1-openeuler
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_openeuler.yml'
- 'Dockerfile.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_openeuler.yml'
- 'Dockerfile.openEuler'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image build
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
steps:
- uses: actions/checkout@v4
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-openeuler
# - pre/post/dev: v0.7.1rc1-openeuler/v0.7.1rc1-openeuler/v0.7.1rc1.dev1-openeuler/v0.7.1.post1-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-openeuler
type=ref,event=pr,suffix=-openeuler
type=pep440,pattern={{raw}},suffix=-openeuler
flavor:
latest=true
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 910b
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
file: Dockerfile.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@ -1,4 +1,4 @@
name: 'image'
name: 'image / Ubuntu'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
@ -9,16 +9,22 @@ name: 'image'
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3|latest / vllm-ascend:v1.2.3rc1
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3 / vllm-ascend:v1.2.3rc1
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image.yml'
- '.github/workflows/image_ubuntu.yml'
- 'Dockerfile'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
@ -27,13 +33,13 @@ on:
tags:
- 'v*'
paths:
- '.github/workflows/image.yml'
- '.github/workflows/image_ubuntu.yml'
- 'Dockerfile'
- 'vllm_ascend/**'
jobs:
build:
name: vllm-ascend image
name: vllm-ascend image build
runs-on: ubuntu-latest
steps:
@ -63,6 +69,8 @@ jobs:
type=ref,event=branch
type=ref,event=pr
type=pep440,pattern={{raw}}
flavor:
latest=true
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
@ -71,31 +79,35 @@ jobs:
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v2
# TODO(yikun): remove this after https://github.com/docker/setup-qemu-action/issues/198 resolved
with:
image: tonistiigi/binfmt:qemu-v7.0.0-28
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v2
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' }}
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push
- name: Build and push 910b
uses: docker/build-push-action@v6
with:
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile
# only trigger when tag, branch/main push
push: ${{ github.event_name != 'pull_request' }}
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@ -0,0 +1,21 @@
name: "Merge Conflict Labeler"
on:
# So that PRs touching the same files as the push are updated
push:
# So that the `dirtyLabel` is removed if conflicts are resolve
# We recommend `pull_request_target` so that github secrets are available.
# In `pull_request` we wouldn't be able to change labels of fork PRs
pull_request_target:
types: [synchronize]
jobs:
main:
runs-on: ubuntu-latest
steps:
- name: check if prs are dirty
uses: eps1lon/actions-label-merge-conflict@v3
with:
dirtyLabel: "merge-conflicts"
removeOnDirtyLabel: "ready"
repoToken: "${{ secrets.GITHUB_TOKEN }}"
commentOnDirty: "This pull request has conflicts, please resolve those before we can evaluate the pull request."

18
.github/workflows/labeler.yml vendored Normal file
View File

@ -0,0 +1,18 @@
name: Pull Request Labeler
on: pull_request_target
jobs:
label:
name: Label
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- name: Label the PR
uses: actions/labeler@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
configuration-path: .github/labeler.yml
sync-labels: true

View File

@ -1,78 +0,0 @@
#
# Adapted from vllm-project/vllm/blob/main/.github
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: mypy
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- 'main'
- '*-dev'
paths:
- '**/*.py'
- '.github/workflows/mypy.yaml'
- 'tools/mypy.sh'
- 'mypy.ini'
pull_request:
branches:
- 'main'
- '*-dev'
# This workflow is only relevant when one of the following files changes.
# However, we have github configured to expect and require this workflow
# to run and pass before github with auto-merge a pull request. Until github
# allows more flexible auto-merge policy, we can just run this on every PR.
# It doesn't take that long to run, anyway.
paths:
- '**/*.py'
- '.github/workflows/mypy.yaml'
- 'tools/mypy.sh'
- 'mypy.ini'
jobs:
mypy:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
pip install -r requirements-dev.txt
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: vllm-empty
- name: Install vllm-project/vllm from source
working-directory: vllm-empty
run: |
pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
VLLM_TARGET_DEVICE=empty pip install .
- name: Mypy
run: |
echo "::add-matcher::.github/workflows/matchers/mypy.json"
tools/mypy.sh 1 ${{ matrix.python-version }}

View File

@ -0,0 +1,207 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: 'Benchmarks / Performance'
# This workflow runs nightly benchmarks for vllm-ascend.
on:
schedule:
# Run benchmarks at 20:00 and 03:00 Beijing time (UTC+8)
- cron: "0 12 * * *"
- cron: "0 19 * * *"
workflow_dispatch:
# Allow manual triggering of the workflow
pull_request:
types: [ labeled ]
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only 1 job can runs on static-8-01-cards
concurrency:
group: static-8-01-cards
cancel-in-progress: false
jobs:
test:
if: ${{ contains(github.event.pull_request.labels.*.name, 'performance-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
name: Benchmarks/vLLM=${{ matrix.vllm_branch }}, vLLM-Ascend=${{ matrix.vllm_ascend_branch }}, use_v1=${{ matrix.vllm_use_v1 }}
runs-on: 'linux-arm64-npu-static-8'
strategy:
matrix:
include:
- vllm_branch: v0.9.2
vllm_ascend_branch: main
vllm_use_v1: 1
max-parallel: 1
container:
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
volumes:
- /usr/local/dcmi:/usr/local/dcmi
- /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
- /usr/local/Ascend/driver/:/usr/local/Ascend/driver/
# Use self-host cache speed up pip and model download
- /home/action/.cache:/github/home/.cache/
options: >-
--device /dev/davinci0
--device /dev/davinci1
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
env:
VLLM_USE_MODELSCOPE: True
ES_OM_DOMAIN: ${{ secrets.ES_OM_DOMAIN }}
ES_OM_AUTHORIZATION: ${{ secrets.ES_OM_AUTHORIZATION }}
VLLM_USE_V1: ${{ matrix.vllm_use_v1 }}
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
# keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
- name: Install system dependencies
run: |
apt-get update -y
apt-get -y install git jq wget curl lsof gcc g++ cmake libnuma-dev
- name: Config git
run: |
git config --global --add safe.directory "$GITHUB_WORKSPACE"
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: ./vllm-empty
ref: ${{ matrix.vllm_branch }}
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install "transformers<=4.52.4"
pip install -e .
pip install -r benchmarks/requirements-bench.txt
- name: Run current commit benchmarks
if: github.event_name != 'schedule' && github.event_name != 'workflow_dispatch'
run: |
# Sometimes we only want to run benchmarks on the current commit
# This is useful for debugging or a release benchmark
bash benchmarks/scripts/run-performance-benchmarks.sh
# Convert the benchmark results to markdown format
python3 benchmarks/scripts/convert_json_to_markdown.py
- name: Generate step summary
if: github.event_name != 'schedule' && github.event_name != 'workflow_dispatch'
run: |
cat ./benchmarks/results/benchmark_results.md >> $GITHUB_STEP_SUMMARY
- name: Upload benchmark artifacts
if: github.event_name != 'schedule' && github.event_name != 'workflow_dispatch'
uses: actions/upload-artifact@v4
with:
name: "benchmark-performance-${{ matrix.vllm_branch }}-${{ matrix.vllm_ascend_branch }}-report"
path: ./benchmarks/results/benchmark_results.md
if-no-files-found: warn
retention-days: 90
overwrite: true
- name: Install elastic_tool
if: github.event_name != 'pull_request'
run: |
pip install escli-tool==0.2.3
- name: Collect pr info from vllm-project/vllm-ascend
if: github.event_name != 'pull_request'
run: |
# Only get the pull request which may influences performance
git log --pretty=format:"%H %s" -- '**/*.py' ':!docs/*' ':!tests/*' ':!examples/*' ':!benchmarks/*' > commit_log.txt
escli check commit_log.txt
- name: Prepare benchmark script in advance
if: github.event_name != 'pull_request'
# This is for the benchmark iteration, which will change the benchmark scripts while checkouting each commit.
# We need ensure the benchmark scripts always available.
run: |
# Prepare the benchmark script in advance
mkdir -p /github/home/benchmarks
cp -r benchmarks/* /github/home/benchmarks/
- name: Run benchmark iteration
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
if: github.event_name != 'pull_request'
run: |
while IFS= read -r line || [[ -n "$line" ]]; do
commit_id=${line%% *}
commit_title=${line#* }
git checkout $commit_id
commit_time=$(git show -s --format=%cd $commit_hash --date=iso-strict)
commit_time_no_tz=${commit_time::19}
pip install -e .
echo "------------------------"
echo "commit_id: $commit_id"
echo "commit_title: $commit_title"
echo "commit_time: $commit_time_no_tz"
echo "vllm branch: ${{ matrix.vllm_branch }}"
echo "vllm-ascend branch: ${{ matrix.vllm_ascend_branch }}"
echo "------------------------"
cd /github/home
ERROR_MSG=""
if ! bash benchmarks/scripts/run-performance-benchmarks.sh; then
ERROR_MSG="Benchmark failed to run"
fi
# send the result to es
escli add --vllm_branch ${{ matrix.vllm_branch }} \
--vllm_ascend_branch ${{ matrix.vllm_ascend_branch }} \
--commit_id $commit_id \
--commit_title "$commit_title" \
--created_at "$commit_time_no_tz" \
--res_dir ./benchmarks/results \
--error "$ERROR_MSG" \
--extra_feat '{"VLLM_USE_V1": "${{ matrix.vllm_use_v1 }}"}'
rm -rf ./benchmarks/results
cd -
done < commit_log.txt

37
.github/workflows/pre-commit.yml vendored Normal file
View File

@ -0,0 +1,37 @@
name: pre-commit
on:
workflow_call:
permissions:
contents: read
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
with:
python-version: "3.10"
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
- run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: ./vllm-empty
- name: Install vllm
working-directory: vllm-empty
run: |
pip install -r requirements/build.txt --extra-index-url https://download.pytorch.org/whl/cpu
VLLM_TARGET_DEVICE=empty pip install .
- name: Install vllm-ascend dev
run: |
pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
env:
SHELLCHECK_OPTS: "--exclude=SC2046,SC2006,SC2086" # Exclude SC2046, SC2006, SC2086 for actionlint
with:
extra_args: --all-files --hook-stage manual

75
.github/workflows/release_code.yml vendored Normal file
View File

@ -0,0 +1,75 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: build / sdist
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/release_code.yml'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
tags:
- 'v*'
jobs:
build:
name: release code
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Print
run: |
lscpu
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python3 -m pip install twine setuptools_scm
- name: Generate tar.gz
run: |
python3 setup.py sdist
ls dist
- name: Archive tar.gz
uses: actions/upload-artifact@v4
with:
name: vllm-ascend-src
path: dist/*
- name: Release
if: startsWith(github.ref, 'refs/tags/')
run: |
python3 -m twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}

118
.github/workflows/release_whl.yml vendored Normal file
View File

@ -0,0 +1,118 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: build / wheel
on:
schedule:
# Runs at 23:00 UTC (7:00 AM Beijing) every day
- cron: '0 23 * * *'
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/release_whl.yml'
- '.github/Dockerfile.buildwheel'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
tags:
- 'v*'
jobs:
build:
name: build and release wheel
strategy:
matrix:
os: [ubuntu-24.04, ubuntu-24.04-arm]
# PR only trigger latest version
python-version: ${{ fromJSON(
(github.event_name == 'pull_request' && '["3.11"]') ||
'["3.9", "3.10", "3.11"]'
) }}
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Print
run: |
lscpu
- name: Build wheel
run: |
ls
docker build -f ./.github/Dockerfile.buildwheel \
--build-arg PY_VERSION=${{ matrix.python-version }} \
-t wheel:v1 .
docker run --rm \
-u $(id -u):$(id -g) \
-v $(pwd):/outpwd \
wheel:v1 \
bash -c "cp -r /workspace/vllm-ascend/dist /outpwd"
ls dist
- name: Set up Python ${{ matrix.python-version }}
if: startsWith(github.ref, 'refs/tags/')
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: ${{ matrix.python-version }}
- name: Repair wheels with auditwheel
run: |
python3 -m pip install auditwheel
python3 -m pip install patchelf
mkdir -p dist/repaired
for whl in dist/*.whl; do
auditwheel repair "$whl" -w dist/repaired/ \
--exclude libplatform.so \
--exclude libregister.so \
--exclude libge_common_base.so \
--exclude libc10.so \
--exclude libc_sec.so \
--exclude "libascend*.so" \
--exclude "libtorch*.so"
done
rm -f dist/*.whl
mv dist/repaired/*.whl dist/
rmdir dist/repaired
ls dist
- name: Verify automatic platform tags
run: |
cd dist
for wheel in *.whl; do
echo "verification file: $wheel"
auditwheel show "$wheel"
done
- name: Archive wheel
uses: actions/upload-artifact@v4
with:
name: vllm-ascend-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
path: dist/*
- name: Release
if: startsWith(github.ref, 'refs/tags/')
run: |
python3 -m pip install twine
python3 -m twine upload --verbose dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}

View File

@ -1,59 +0,0 @@
#
# Adapted from vllm-project/vllm/blob/main/.github
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: ruff
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- 'main'
- '*-dev'
paths:
- "**/*.py"
- requirements-lint.txt
- .github/workflows/matchers/ruff.json
- .github/workflows/ruff.yml
pull_request:
branches:
- 'main'
- '*-dev'
jobs:
ruff:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.12"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
- name: Analysing the code with ruff
run: |
echo "::add-matcher::.github/workflows/matchers/ruff.json"
ruff check --output-format github .
- name: Run isort
run: |
isort . --check-only

View File

@ -1,56 +0,0 @@
#
# Adapted from vllm-project/vllm/blob/main/.github
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: Lint shell scripts
on:
push:
branches:
- 'main'
- '*-dev'
paths:
- '**/*.sh'
- '.github/workflows/shellcheck.yml'
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '**/*.sh'
- '.github/workflows/shellcheck.yml'
env:
LC_ALL: en_US.UTF-8
defaults:
run:
shell: bash
permissions:
contents: read
jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- name: "Checkout"
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: "Check shell scripts"
run: |
tools/shellcheck.sh

View File

@ -0,0 +1,87 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: 'e2e test / doctest'
on:
workflow_dispatch:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
# If we are changing the doctest we should do a PR test
- '.github/workflows/vllm_ascend_doctest.yaml'
- 'tests/e2e/doctests/**'
- 'tests/e2e/common.sh'
- 'tests/e2e/run_doctests.sh'
schedule:
# Runs every 12 hours
- cron: '0 */12 * * *'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
jobs:
test:
strategy:
# Each version should be tested
fail-fast: false
matrix:
vllm_verison: [v0.9.1-dev, v0.9.1-dev-openeuler, main, main-openeuler]
name: vLLM Ascend test
runs-on: linux-arm64-npu-1
container:
image: m.daocloud.io/quay.io/ascend/vllm-ascend:${{ matrix.vllm_verison }}
steps:
- name: Check NPU/CANN and git info
run: |
echo "====> Print NPU/CANN info"
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
echo "====> Print vllm-ascend git info"
cd /vllm-workspace/vllm-ascend
git --no-pager log -1 || true
echo "====> Print vllm git info"
cd /vllm-workspace/vllm
git --no-pager log -1 || true
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Run vllm-ascend/tests/e2e/run_doctests.sh
run: |
# PWD: /__w/vllm-ascend/vllm-ascend
# Make sure e2e tests are latest
echo "Replacing /vllm-workspace/vllm-ascend/tests/e2e ..."
rm -rf /vllm-workspace/vllm-ascend/tests/e2e
mkdir -p /vllm-workspace/vllm-ascend/tests
# Overwrite e2e and examples
cp -r tests/e2e /vllm-workspace/vllm-ascend/tests/
cp -r examples /vllm-workspace/vllm-ascend/
# Simulate container to enter directory
cd /workspace
# Run real test
echo "Test:"
/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh

View File

@ -1,6 +1,5 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -13,29 +12,19 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: 'e2e test'
name: 'test'
on:
push:
branches:
- 'main'
- '*-dev'
paths:
- '*.txt'
- '**/*.py'
- '.github/workflows/vllm_ascend_test.yaml'
- '!docs/**'
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '*.txt'
- '**/*.py'
- '.github/workflows/vllm_ascend_test.yaml'
- '!docs/**'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
@ -44,26 +33,119 @@ defaults:
run:
shell: bash -el {0}
jobs:
test:
name: vLLM Ascend test (self-host)
runs-on: ascend-arm64 # actionlint-ignore: runner-label
# only cancel in-progress runs of the same workflow
# and ignore the lint / 1 card / 4 cards test type
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint:
uses: ./.github/workflows/pre-commit.yml
changes:
runs-on: ubuntu-latest
outputs:
e2e_tracker: ${{ steps.filter.outputs.e2e_tracker }}
ut_tracker: ${{ steps.filter.outputs.ut_tracker }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
e2e_tracker:
- '.github/workflows/vllm_ascend_test.yaml'
- 'vllm_ascend/**'
- 'csrc/**'
- 'cmake/**'
- 'tests/e2e/**'
- 'CMakeLists.txt'
- 'setup.py'
- 'requirements.txt'
- 'requirements-dev.txt'
- 'requirements-lint.txt'
- 'packages.txt'
ut_tracker:
- 'tests/ut/**'
ut:
needs: [lint, changes]
name: unit test
# only trigger unit test after lint passed and the change is e2e and ut related.
if: ${{ needs.lint.result == 'success' && (needs.changes.outputs.e2e_tracker == 'true' || needs.changes.outputs.ut_tracker == 'true') }}
runs-on: ubuntu-latest
container:
image: quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py3.10
volumes:
- /usr/local/dcmi:/usr/local/dcmi
- /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
- /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
# Use self-host cache speed up pip and model download
- /home/action/actions-runner/_work/cache:/github/home/.cache/
options: >-
--device /dev/davinci6
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
image: quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
HF_ENDPOINT: https://hf-mirror.com
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
strategy:
matrix:
vllm_version: [main, v0.9.2]
steps:
- name: Install packages
run: |
apt-get update -y
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty python3 -m pip install . --extra-index https://download.pytorch.org/whl/cpu/
python3 -m pip uninstall -y triton
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install vllm-project/vllm-ascend
run: |
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
python3 -m pip install -r requirements-dev.txt --extra-index https://download.pytorch.org/whl/cpu/
python3 -m pip install -v . --extra-index https://download.pytorch.org/whl/cpu/
- name: Run unit test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
TORCH_DEVICE_BACKEND_AUTOLOAD: 0
run: |
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
pytest -sv --cov --cov-report=xml:unittests-coverage.xml tests/ut
- name: Upload coverage to Codecov
if: ${{ matrix.vllm_version == 'main' }}
uses: codecov/codecov-action@v5
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
with:
flags: unittests
name: vllm-ascend
verbose: true
e2e:
needs: [lint, changes]
# only trigger e2e test after lint passed and the change is e2e related with pull request.
if: ${{ github.event_name == 'pull_request' && needs.lint.result == 'success' && needs.changes.outputs.e2e_tracker == 'true' }}
strategy:
max-parallel: 2
matrix:
os: [linux-arm64-npu-1]
vllm_version: [main, v0.9.2]
name: singlecard e2e test
runs-on: ${{ matrix.os }}
container:
# TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
@ -72,25 +154,25 @@ jobs:
- name: Config mirrors
run: |
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get update -y
apt-get -y install `cat packages.txt`
- name: Install dependencies
run: |
pip install -r requirements-dev.txt
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
@ -99,23 +181,106 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -e .
pip install -r requirements-dev.txt
pip install -v -e .
- name: Install pta
- name: Run e2e test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
run: |
mkdir pta
cd pta
wget https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py310.tar.gz
tar -xvf pytorch_v2.5.1_py310.tar.gz
pip install ./torch_npu-2.5.1.dev20250218-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
cd ..
rm -rf pta
pytest -sv tests/e2e/singlecard/test_offline_inference.py
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/singlecard/test_guided_decoding.py
pytest -sv tests/e2e/singlecard/test_camem.py
pytest -sv tests/e2e/singlecard/test_embedding.py
pytest -sv tests/e2e/singlecard/ \
--ignore=tests/e2e/singlecard/test_offline_inference.py \
--ignore=tests/e2e/singlecard/test_ilama_lora.py \
--ignore=tests/e2e/singlecard/test_guided_decoding.py \
--ignore=tests/e2e/singlecard/test_camem.py \
--ignore=tests/e2e/singlecard/test_embedding.py \
--ignore=tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py \
--ignore=tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
# ------------------------------------ v1 spec decode test ------------------------------------ #
VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py
# TODO: revert me when test_v1_spec_decode.py::test_ngram_correctness is fixed
VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
e2e-4-cards:
needs: [e2e]
if: ${{ needs.e2e.result == 'success' }}
strategy:
max-parallel: 1
matrix:
os: [linux-arm64-npu-4]
vllm_version: [main, v0.9.2]
name: multicard e2e test
runs-on: ${{ matrix.os }}
container:
# TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
run: |
pytest -sv tests
- name: Run vllm-project/vllm test
run: |
pytest -sv
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py
# Fixme: run VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py will raise error.
# To avoid oom, we need to run the test in a single process.
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W8A8
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_dbo
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeekV3_dbo
pytest -sv tests/e2e/multicard/test_data_parallel.py
pytest -sv tests/e2e/multicard/ --ignore=tests/e2e/multicard/test_ilama_lora_tp2.py \
--ignore=tests/e2e/multicard/test_offline_inference_distributed.py \
--ignore=tests/e2e/multicard/test_data_parallel.py

View File

@ -0,0 +1,103 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: 'e2e test / long-term-test'
on:
schedule:
# Runs at 23:00 UTC (7:00 AM Beijing) every day
- cron: '0 23 * * *'
pull_request:
types: [ labeled ]
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
long-term-test:
# long-term-test will be triggered when tag 'long-term-test' & 'ready-for-test' or schedule job
if: ${{ contains(github.event.pull_request.labels.*.name, 'long-term-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' }}
strategy:
max-parallel: 2
matrix:
os: [linux-arm64-npu-1, linux-arm64-npu-4]
vllm_version: [main, v0.9.2]
name: vLLM Ascend long term test
runs-on: ${{ matrix.os }}
container:
# TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend long term test
run: |
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
pytest -sv tests/e2e/long_term/accuracy/accuracy_singlecard.py
else
# accuracy test multi card
pytest -sv tests/e2e/long_term/accuracy/accuracy_multicard.py
fi

View File

@ -0,0 +1,112 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: 'e2e test / pd-disaggregation'
on:
schedule:
# Runs at 23:00 UTC (7:00 AM Beijing) every day
- cron: '0 23 * * *'
pull_request:
types: [ labeled ]
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only 1 job can runs on static-8-01-cards
concurrency:
group: static-8-01-cards
cancel-in-progress: false
jobs:
prefilling-decoding-disaggregation:
# pd-test will be triggered when tag 'pd-test' & 'ready-for-test' or schedule job
if: ${{ contains(github.event.pull_request.labels.*.name, 'pd-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' }}
strategy:
matrix:
vllm_verison: [
# revert me when V1 disaggregation prefill is merged in main
# main,
v0.9.1
]
name: vLLM Ascend prefilling decoding disaggregation test
runs-on: linux-arm64-npu-static-8
container:
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
volumes:
- /usr/local/dcmi:/usr/local/dcmi
- /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
- /usr/local/Ascend/driver/:/usr/local/Ascend/driver/
# Use self-host cache speed up pip and model download
- /home/action/.cache:/github/home/.cache/
options: >-
--device /dev/davinci0
--device /dev/davinci1
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
env:
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
# keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
apt-get update -y
apt install git -y
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_verison }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend PD Disaggregation test
run: |
pytest -sv tests/e2e/pd_disaggreate/test_pd_e2e.py

View File

@ -1,57 +0,0 @@
#
# Adapted from vllm-project/vllm/blob/main/.github
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: yapf
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- 'main'
- '*-dev'
paths:
- "**/*.py"
- .github/workflows/yapf.yml
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- "**/*.py"
- .github/workflows/yapf.yml
jobs:
yapf:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.12"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install toml
pip install yapf==0.32.0
- name: Running yapf
run: |
yapf --diff --recursive .

6
.gitignore vendored
View File

@ -196,3 +196,9 @@ kernel_meta/
# version file generated by setuptools-scm
/vllm_ascend/_version.py
# build info file generated by setup.py
/vllm_ascend/_build_info.py
/vllm_ascend/include/
# generated by CANN
fusion_result.json

141
.pre-commit-config.yaml Normal file
View File

@ -0,0 +1,141 @@
default_install_hook_types:
- pre-commit
- commit-msg
default_stages:
- pre-commit # Run locally
- manual # Run in CI
exclude: 'examples/.*' # Exclude examples from all hooks by default
repos:
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
hooks:
- id: codespell
args: [
--toml, pyproject.toml,
'--skip', 'tests/e2e/multicard/test_torchair_graph_mode.py,tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**,.github/**,typos.toml',
'-L', 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn'
]
additional_dependencies:
- tomli
- repo: https://github.com/google/yapf
rev: v0.43.0
hooks:
- id: yapf
args: [--in-place, --verbose]
# Keep the same list from yapfignore here to avoid yapf failing without any inputs
exclude: '(.github|benchmarks|examples|docs)/.*'
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.11.7
hooks:
- id: ruff
args: [--output-format, github, --fix]
- id: ruff-format
files: ^(benchmarks|examples)/.*
- repo: https://github.com/crate-ci/typos
rev: v1.32.0
hooks:
- id: typos
- repo: https://github.com/PyCQA/isort
rev: 6.0.1
hooks:
- id: isort
# - repo: https://github.com/pre-commit/mirrors-clang-format
# rev: v20.1.3
# hooks:
# - id: clang-format
# files: ^csrc/.*\.(cpp|hpp|cc|hh|cxx|hxx)$
# types_or: [c++]
# args: [--style=google, --verbose]
# - repo: https://github.com/jackdewinter/pymarkdown
# rev: v0.9.29
# hooks:
# - id: pymarkdown
# args: [fix]
- repo: https://github.com/rhysd/actionlint
rev: v1.7.7
hooks:
- id: actionlint
- repo: local
hooks:
# For local development, you can run mypy using tools/mypy.sh script if needed.
# - id: mypy-local
# name: Run mypy for local Python installation
# entry: tools/mypy.sh 0 "local"
# language: system
# types: [python]
# stages: [pre-commit] # Don't run in CI
- id: mypy-3.9 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.9
entry: tools/mypy.sh 1 "3.9"
# Use system python because vllm installation is required
language: system
types: [python]
stages: [manual] # Only run in CI
- id: mypy-3.10 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.10
entry: tools/mypy.sh 1 "3.10"
# Use system python because vllm installation is required
language: system
types: [python]
stages: [manual] # Only run in CI
- id: mypy-3.11 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.11
entry: tools/mypy.sh 1 "3.11"
# Use system python because vllm installation is required
language: system
types: [python]
stages: [manual] # Only run in CI
- id: mypy-3.12 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.12
entry: tools/mypy.sh 1 "3.12"
# Use system python because vllm installation is required
language: system
types: [python]
stages: [manual] # Only run in CI
# FIXME: enable shellcheck
# - id: shellcheck
# name: Lint shell scripts
# entry: tools/shellcheck.sh
# language: script
# types: [shell]
- id: png-lint
name: Lint PNG exports from excalidraw
entry: tools/png-lint.sh
language: script
types: [png]
- id: signoff-commit
name: Sign-off Commit
entry: bash
args:
- -c
- |
if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" "$(git rev-parse --git-path COMMIT_EDITMSG)"; then
printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> "$(git rev-parse --git-path COMMIT_EDITMSG)"
fi
language: system
verbose: true
stages: [commit-msg]
- id: check-filenames
name: Check for spaces in all filenames
entry: bash
args:
- -c
- 'git ls-files | grep " " && echo "Filenames should not contain spaces!" && exit 1 || exit 0'
language: system
always_run: true
pass_filenames: false
- id: enforce-import-regex-instead-of-re
name: Enforce import regex as re
entry: python tools/enforce_regex_import.py
language: python
types: [python]
pass_filenames: false
additional_dependencies: [regex]
# Keep `suggestion` last
- id: suggestion
name: Suggestion
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false
# Insert new entries above the `suggestion` entry

98
CMakeLists.txt Normal file
View File

@ -0,0 +1,98 @@
cmake_minimum_required(VERSION 3.16)
project(vllm_ascend_C)
# include(CheckCXXcompilerFlag)
# check_cxx_compiler_flag("-std=c++17", COMPILER_SUPPORTS_CXX17)
set(CMAKE_CXX_STANDARD 17)
include(${CMAKE_CURRENT_LIST_DIR}/cmake/utils.cmake)
# Suppress potential warnings about unused manually-specified variables
set(ignoreMe "${VLLM_PYTHON_PATH}")
# TODO: Add 3.12 back when torch-npu support 3.12
set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11")
find_package(pybind11 REQUIRED)
append_cmake_prefix_path("torch" "torch.utils.cmake_prefix_path")
set(VLLM_ASCEND_INSTALL_PATH "${CMAKE_INSTALL_PREFIX}")
find_package(Torch REQUIRED)
set(RUN_MODE "npu" CACHE STRING "cpu/sim/npu")
set(SOC_VERSION ${SOC_VERSION})
message(STATUS "Detected SOC version: ${SOC_VERSION}")
if (NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE "Release" CACHE STRINGS "Build type Release/Debug (default Release)" FORCE)
endif()
if (CMAKE_INSTALL_PREFIX STREQUAL /usr/local)
set(CMAKE_INSTALL_PREFIX "${CMAKE_CURRENT_LIST_DIR}/out" CACHE STRINGS "path to install()")
endif()
set(ASCEND_CANN_PACKAGE_PATH ${ASCEND_HOME_PATH})
if(EXISTS ${ASCEND_HOME_PATH}/tools/tikcpp/ascendc_kernel_cmake)
set(ASCENDC_CMAKE_DIR ${ASCEND_HOME_PATH}/tools/tikcpp/ascendc_kernel_cmake)
elseif(EXISTS ${ASCEND_HOME_PATH}/compiler/tikcpp/ascendc_kernel_cmake)
set(ASCENDC_CMAKE_DIR ${ASCEND_HOME_PATH}/compiler/tikcpp/ascendc_kernel_cmake)
elseif(EXISTS ${ASCEND_HOME_PATH}/ascendc_devkit/tikcpp/samples/cmake)
set(ASCENDC_CMAKE_DIR ${ASCEND_HOME_PATH}/ascendc_devkit/tikcpp/samples/cmake)
else()
message(FATAL_ERROR "ascendc_kernel_cmake does not exist, please check whether the cann package is installed.")
endif()
include(${ASCENDC_CMAKE_DIR}/ascendc.cmake)
file(GLOB KERNEL_FILES
${CMAKE_CURRENT_SOURCE_DIR}/csrc/kernels/*.cpp)
ascendc_library(vllm_ascend_kernels SHARED
${KERNEL_FILES}
)
message("TORCH_NPU_PATH is ${TORCH_NPU_PATH}")
file(GLOB VLLM_ASCEND_SRC
${CMAKE_CURRENT_SOURCE_DIR}/csrc/*.cpp)
include_directories(
${pybind11_INCLUDE_DIRS}
${PYTHON_INCLUDE_PATH}
${TORCH_INCLUDE_DIRS}
${TORCH_NPU_PATH}/include
${ASCEND_HOME_PATH}/include
${ASCEND_HOME_PATH}/aarch64-linux/include/experiment/platform
${ASCEND_HOME_PATH}/x86_64-linux/include/experiment/platform
)
set(
INCLUDES
${TORCH_INCLUDE_DIRS}
${TORCH_NPU_INCLUDE_DIRS}
${ASCEND_HOME_PATH}/include
${ASCEND_HOME_PATH}/aarch64-linux/include/experiment/platform
)
pybind11_add_module(vllm_ascend_C ${VLLM_ASCEND_SRC})
target_link_directories(
vllm_ascend_C
PRIVATE
${TORCH_NPU_PATH}/lib/
${ASCEND_HOME_PATH}/lib64
)
target_link_libraries(
vllm_ascend_C
PUBLIC
${TORCH_LIBRARIES}
libtorch_npu.so
vllm_ascend_kernels
ascendcl
platform
)
target_link_options(vllm_ascend_C PRIVATE "-Wl,-rpath,$ORIGIN:$ORIGIN/lib")
install(TARGETS vllm_ascend_C vllm_ascend_kernels DESTINATION ${VLLM_ASCEND_INSTALL_PATH})

3
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,3 @@
# Contributing to vLLM Ascend
You may find information about contributing to vLLM Ascend on [Developer Guide - Contributing](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html), including step-by-step guide to help you setup development environment, contribute first PR and test locally.

View File

@ -1,6 +1,5 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -13,35 +12,49 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py3.10
FROM quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN apt-get update -y && \
apt-get install -y python3-pip git vim && \
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
rm -rf /var/cache/apt/* && \
rm -rf /var/lib/apt/lists/*
WORKDIR /workspace
COPY . /workspace/vllm-ascend/
COPY . /vllm-workspace/vllm-ascend/
RUN pip config set global.index-url ${PIP_INDEX_URL}
# Install vLLM main
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
RUN git clone --depth 1 $VLLM_REPO /workspace/vllm
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install /workspace/vllm/
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend main
RUN python3 -m pip install /workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/
# Install vllm-ascend
# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope
RUN python3 -m pip install modelscope
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

61
Dockerfile.310p Normal file
View File

@ -0,0 +1,61 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN apt-get update -y && \
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
rm -rf /var/cache/apt/* && \
rm -rf /var/lib/apt/lists/*
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
RUN pip config set global.index-url ${PIP_INDEX_URL}
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
export SOC_VERSION=ASCEND310P3 && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

58
Dockerfile.310p.openEuler Normal file
View File

@ -0,0 +1,58 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-310p-openeuler22.03-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
RUN pip config set global.index-url ${PIP_INDEX_URL}
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
export SOC_VERSION=ASCEND310P3 && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

60
Dockerfile.a3 Normal file
View File

@ -0,0 +1,60 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-a3-ubuntu22.04-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN apt-get update -y && \
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
rm -rf /var/cache/apt/* && \
rm -rf /var/lib/apt/lists/*
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
RUN pip config set global.index-url ${PIP_INDEX_URL}
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

57
Dockerfile.a3.openEuler Normal file
View File

@ -0,0 +1,57 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-a3-openeuler22.03-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
RUN pip config set global.index-url ${PIP_INDEX_URL}
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

57
Dockerfile.openEuler Normal file
View File

@ -0,0 +1,57 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/cann:8.1.rc1-910b-openeuler22.03-py3.10
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG COMPILE_CUSTOM_KERNELS=1
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
RUN pip config set global.index-url ${PIP_INDEX_URL}
WORKDIR /workspace
COPY . /vllm-workspace/vllm-ascend/
# Install vLLM
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
ARG VLLM_TAG=v0.9.2
RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
# Install vllm-ascend
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip cache purge
# Install modelscope (for fast download) and ray (for multinode)
RUN python3 -m pip install modelscope ray && \
python3 -m pip cache purge
CMD ["/bin/bash"]

View File

@ -10,7 +10,7 @@ vLLM Ascend Plugin
</h3>
<p align="center">
| <a href="https://www.hiascend.com/en/"><b>About Ascend</b></a> | <a href="https://vllm-ascend.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack (#sig-ascend)</b></a> |
| <a href="https://www.hiascend.com/en/"><b>About Ascend</b></a> | <a href="https://vllm-ascend.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://slack.vllm.ai"><b>#sig-ascend</b></a> | <a href="https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support"><b>Users Forum</b></a> | <a href="https://tinyurl.com/vllm-ascend-meeting"><b>Weekly Meeting</b></a> |
</p>
<p align="center">
@ -20,79 +20,69 @@ vLLM Ascend Plugin
---
*Latest News* 🔥
- [2025/06] [User stories](https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html) page is now live! It kicks off with LLaMA-Factory/verl//TRL/GPUStack to demonstrate how vLLM Ascend assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
- [2025/06] [Contributors](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html) page is now live! All contributions deserve to be recorded, thanks for all contributors.
- [2025/05] We've released first official version [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)! We collaborated with the vLLM community to publish a blog post sharing our practice: [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html).
- [2025/03] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/VtxO9WXa5fC-mKqlxNUJUQ) with vLLM team! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
- [2025/02] vLLM community officially created [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) repo for running vLLM seamlessly on the Ascend NPU.
- [2024/12] We are working with the vLLM community to support [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
---
## Overview
vLLM Ascend plugin (`vllm-ascend`) is a backend plugin for running vLLM on the Ascend NPU.
vLLM Ascend (`vllm-ascend`) is a community maintained hardware plugin for running vLLM seamlessly on the Ascend NPU.
This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
It is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU.
## Prerequisites
- Hardware: Atlas 800I A2 Inference series, Atlas A2 Training series
- OS: Linux
- Software:
* Python >= 3.9
* CANN >= 8.0.RC2
* PyTorch >= 2.4.0, torch-npu >= 2.4.0
* Python >= 3.9, < 3.12
* CANN >= 8.1.RC1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
* vLLM (the same version as vllm-ascend)
Find more about how to setup your environment step by step in [here](docs/source/installation.md).
## Getting Started
> [!NOTE]
> Currently, we are actively collaborating with the vLLM community to support the Ascend backend plugin, once supported you can use one line command `pip install vllm vllm-ascend` to compelete installation.
Please use the following recommended versions to get started quickly:
Installation from source code:
```bash
# Install vllm main branch according:
# https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html#build-wheel-from-source
git clone --depth 1 https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt
VLLM_TARGET_DEVICE=empty pip install .
# Install vllm-ascend main branch
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .
```
Run the following command to start the vLLM server with the [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model:
```bash
# export VLLM_USE_MODELSCOPE=true to speed up download
vllm serve Qwen/Qwen2.5-0.5B-Instruct
curl http://localhost:8000/v1/models
```
Please refer to [QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details.
| Version | Release type | Doc |
|------------|--------------|--------------------------------------|
|v0.9.2rc1|Latest release candidate|[QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details|
|v0.7.3.post1|Latest stable version|[QuickStart](https://vllm-ascend.readthedocs.io/en/stable/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/stable/installation.html) for more details|
## Contributing
See [CONTRIBUTING](docs/source/developer_guide/contributing.md) for more details, which is a step-by-step guide to help you set up development environment, build and test.
See [CONTRIBUTING](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html) for more details, which is a step-by-step guide to help you set up development environment, build and test.
We welcome and value any contributions and collaborations:
- Please feel free comments [here](https://github.com/vllm-project/vllm-ascend/issues/19) about your usage of vLLM Ascend Plugin.
- Please let us know if you encounter a bug by [filing an issue](https://github.com/vllm-project/vllm-ascend/issues).
- Please let us know if you encounter a bug by [filing an issue](https://github.com/vllm-project/vllm-ascend/issues)
- Please use [User forum](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support) for usage questions and help.
## Branch
vllm-ascend has main branch and dev branch.
- **main**: main branchcorresponds to the vLLM main branch, and is continuously monitored for quality through Ascend CI.
- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.1-dev` is the dev branch for vLLM `v0.7.1` version.
- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.3-dev` is the dev branch for vLLM `v0.7.3` version.
Below is maintained branches:
| Branch | Status | Note |
|------------|--------------|--------------------------------------|
| main | Maintained | CI commitment for vLLM main branch |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version |
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.9.x branch |
| v0.7.1-dev | Unmaintained | Only doc fixed is allowed |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version, only bug fix is allowed and no new release tag any more. |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
Please refer to [Versioning policy](docs/source/developer_guide/versioning_policy.md) for more details.
Please refer to [Versioning policy](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html) for more details.
## Weekly Meeting
- vLLM Ascend Weekly Meeting: https://tinyurl.com/vllm-ascend-meeting
- Wednesday, 15:00 - 16:00 (UTC+8, [Convert to your timezone](https://dateful.com/convert/gmt8?t=15))
## License

View File

@ -10,7 +10,7 @@ vLLM Ascend Plugin
</h3>
<p align="center">
| <a href="https://www.hiascend.com/en/"><b>关于昇腾</b></a> | <a href="https://vllm-ascend.readthedocs.io/en/latest/"><b>官方文档</b></a> | <a href="https://slack.vllm.ai"><b>开发者 Slack (#sig-ascend)</b></a> |
| <a href="https://www.hiascend.com/en/"><b>关于昇腾</b></a> | <a href="https://vllm-ascend.readthedocs.io/en/latest/"><b>官方文档</b></a> | <a href="https://slack.vllm.ai"><b>#sig-ascend</b></a> | <a href="https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support"><b>用户论坛</b></a> | <a href="https://tinyurl.com/vllm-ascend-meeting"><b>社区例会</b></a> |
</p>
<p align="center">
@ -20,11 +20,16 @@ vLLM Ascend Plugin
---
*最新消息* 🔥
- [2025/06] [用户案例](https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html)现已上线展示了LLaMA-Factory/verl/TRL/GPUStack等用户案例展示了vLLM Ascend如何帮助昇腾用户在模型微调、评估、强化学习 (RL) 以及部署等场景中提升体验。
- [2025/06] [贡献者](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html)页面现已上线!所有的贡献都值得被记录,感谢所有的贡献者。
- [2025/05] 我们发布了首个正式版本 [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)!我们与 vLLM 社区合作发布了一篇博客文章,分享了我们的实践:[Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html)。
- [2025/03] 我们和vLLM团队举办了[vLLM Beijing Meetup](https://mp.weixin.qq.com/s/CGDuMoB301Uytnrkc2oyjg)! 你可以在[这里](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF)找到演讲材料.
- [2025/02] vLLM社区正式创建了[vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)仓库让vLLM可以无缝运行在Ascend NPU。
- [2024/12] 我们正在与 vLLM 社区合作,以支持 [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
---
## 总览
vLLM 昇腾插件 (`vllm-ascend`) 是一个让vLLM在Ascend NPU无缝运行的后端插件。
vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NPU无缝运行的后端插件。
此插件是 vLLM 社区中支持昇腾后端的推荐方式。它遵循[[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162)所述原则通过解耦的方式提供了vLLM对Ascend NPU的支持。
@ -33,67 +38,50 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个让vLLM在Ascend NPU无缝运行的
## 准备
- 硬件Atlas 800I A2 Inference系列、Atlas A2 Training系列
- 操作系统Linux
- 软件:
* Python >= 3.9
* CANN >= 8.0.RC2
* PyTorch >= 2.4.0, torch-npu >= 2.4.0
* Python >= 3.9, < 3.12
* CANN >= 8.1.RC1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
* vLLM (与vllm-ascend版本一致)
在[此处](docs/source/installation.md),您可以了解如何逐步准备环境。
## 开始使用
> [!NOTE]
> 目前,我们正在积极与 vLLM 社区合作以支持 Ascend 后端插件,一旦支持,您可以使用一行命令: `pip install vllm vllm-ascend` 来完成安装。
推荐您使用以下版本快速开始使用:
通过源码安装:
```bash
# 安装vllm main 分支参考文档:
# https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html#build-wheel-from-source
git clone --depth 1 https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt
VLLM_TARGET_DEVICE=empty pip install .
| Version | Release type | Doc |
|------------|--------------|--------------------------------------|
|v0.9.2rc1| 最新RC版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多|
|v0.7.3.post1| 最新正式/稳定版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/stable/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/stable/installation.html)了解更多|
# 安装vllm-ascend main 分支
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .
```
## 贡献
请参考 [CONTRIBUTING]((https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。
运行如下命令使用 [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) 模型启动服务:
```bash
# 设置环境变量 VLLM_USE_MODELSCOPE=true 加速下载
vllm serve Qwen/Qwen2.5-0.5B-Instruct
curl http://localhost:8000/v1/models
```
请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多.
## 分支
我们欢迎并重视任何形式的贡献与合作:
- 请通过[Issue](https://github.com/vllm-project/vllm-ascend/issues)来告知我们您遇到的任何Bug。
- 请通过[用户论坛](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support)来交流使用问题和寻求帮助。
## 分支策略
vllm-ascend有主干分支和开发分支。
- **main**: 主干分支与vLLM的主干分支对应并通过昇腾CI持续进行质量看护。
- **vX.Y.Z-dev**: 开发分支随vLLM部分新版本发布而创建比如`v0.7.1-dev`是vllm-asend针对vLLM `v0.7.1`版本的开发分支。
- **vX.Y.Z-dev**: 开发分支随vLLM部分新版本发布而创建比如`v0.7.3-dev`是vllm-asend针对vLLM `v0.7.3`版本的开发分支。
下面是维护中的分支:
| 分支 | 状态 | 备注 |
|------------|------------|---------------------|
| main | Maintained | 基于vLLM main分支CI看护 |
| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护 |
| v0.7.1-dev | Unmaintained | 只允许文档修复 |
| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护, 只允许Bug修复不会再发布新版本 |
| v0.9.1-dev | Maintained | 基于vLLM v0.9.1版本CI看护 |
请参阅[版本策略](docs/source/developer_guide/versioning_policy.zh.md)了解更多详细信息。
请参阅[版本策略](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html)了解更多详细信息。
## 贡献
有关更多详细信息,请参阅 [CONTRIBUTING](docs/source/developer_guide/contributing.zh.md),可以更详细的帮助您部署开发环境、构建和测试。
## 社区例会
我们欢迎并重视任何形式的贡献与合作:
- 您可以在[这里](https://github.com/vllm-project/vllm-ascend/issues/19)反馈您的使用体验。
- 请通过[提交问题](https://github.com/vllm-project/vllm-ascend/issues)来告知我们您遇到的任何错误。
- vLLM Ascend 每周社区例会: https://tinyurl.com/vllm-ascend-meeting
- 每周三下午15:00 - 16:00 (UTC+8, [查看您的时区](https://dateful.com/convert/gmt8?t=15))
## 许可证
Apache 许可证 2.0,如 [LICENSE](./LICENSE) 文件中所示。
Apache 许可证 2.0,如 [LICENSE](./LICENSE) 文件中所示。

166
benchmarks/README.md Normal file
View File

@ -0,0 +1,166 @@
# Introduction
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
# Overview
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
- Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
- Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput.
- Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
**Benchmarking Duration**: about 800 senond for single model.
# Quick Use
## Prerequisites
Before running the benchmarks, ensure the following:
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
- Install necessary dependencies for benchmarks:
```
pip install -r benchmarks/requirements-bench.txt
```
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
```shell
[
{
"test_name": "serving_qwen2_5vl_7B_tp1",
"qps_list": [
1,
4,
16,
"inf"
],
"server_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"trust_remote_code": "",
"max_model_len": 16384
},
"client_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"backend": "openai-chat",
"dataset_name": "hf",
"hf_split": "train",
"endpoint": "/v1/chat/completions",
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
"num_prompts": 200
}
}
]
```
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- disable_log_stats: disables logging of performance statistics.
- disable_log_requests: disables logging of individual requests.
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Client Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
- Dataset Source: Hugging Face (hf)
- Dataset Split: train
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
- Number of Prompts: 200 (the total number of prompts used during the test)
## Run benchmarks
### Use benchmark script
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
```
bash benchmarks/scripts/run-performance-benchmarks.sh
```
Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
```
.
|-- serving_qwen2_5_7B_tp1_qps_1.json
|-- serving_qwen2_5_7B_tp1_qps_16.json
|-- serving_qwen2_5_7B_tp1_qps_4.json
|-- serving_qwen2_5_7B_tp1_qps_inf.json
|-- latency_qwen2_5_7B_tp1.json
|-- throughput_qwen2_5_7B_tp1.json
```
These files contain detailed benchmarking results for further analysis.
### Use benchmark cli
For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
#### Online serving
1. Launch the server:
```shell
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
```
2. Running performance tests using cli
```shell
vllm bench serve --model Qwen2.5-VL-7B-Instruct\
--endpoint-type "openai-chat" --dataset-name hf \
--hf-split train --endpoint "/v1/chat/completions" \
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
--num-prompts 200 \
--request-rate 16
```
#### Offline
- **Throughput**
```shell
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 200 --backend vllm
```
- **Latency**
```shell
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
--load-format dummy --num-iters-warmup 5 --num-iters 15
```

View File

@ -0,0 +1,158 @@
from typing import Tuple
import numpy as np
import pytest
import torch
import torch_npu # noqa: F401
import vllm # noqa: F401
import vllm_ascend.platform # noqa: F401
def benchmark_npu(fn, num_iterations=100, num_warmup_iterations=50):
"""
Benchmark function for NPU operations
Args:
fn: Function to benchmark
num_iterations: Number of timing iterations
num_warmup_iterations: Number of warmup iterations
Returns:
float: Minimum elapsed time in seconds
"""
start = torch.npu.Event(enable_timing=True)
end = torch.npu.Event(enable_timing=True)
times = np.zeros(num_iterations + num_warmup_iterations)
# Run iterations
for i in range(num_warmup_iterations + num_iterations):
with torch.no_grad():
start.record()
fn() # Execute the function
end.record()
torch.npu.synchronize()
times[i] = start.elapsed_time(end)
# Remove warmup iterations and convert to seconds
times = times[num_warmup_iterations:]
elapsed_time = np.amin(times) / 1000
return elapsed_time
def get_masked_input_and_mask_ref(
input_: torch.Tensor,
org_vocab_start_index: int,
org_vocab_end_index: int,
num_org_vocab_padding: int,
added_vocab_start_index: int,
added_vocab_end_index: int,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Reference implementation for verification"""
org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
added_vocab_mask = (input_ >= added_vocab_start_index) & (
input_ < added_vocab_end_index
)
added_offset = (
added_vocab_start_index
- (org_vocab_end_index - org_vocab_start_index)
- num_org_vocab_padding
)
valid_offset = (org_vocab_start_index * org_vocab_mask) + (
added_offset * added_vocab_mask
)
vocab_mask = org_vocab_mask | added_vocab_mask
masked_input = vocab_mask * (input_ - valid_offset)
return masked_input, ~vocab_mask
DTYPES = [torch.int32]
SHAPES = [(3, 4, 5)]
DEVICES = [f"npu:{0}"]
SEEDS = [0]
@pytest.mark.parametrize("shape", SHAPES)
@pytest.mark.parametrize("dtype", DTYPES)
@pytest.mark.parametrize("device", DEVICES)
@pytest.mark.parametrize("seed", SEEDS)
@torch.inference_mode()
def test_get_masked_input_and_mask(
shape: Tuple[int, ...],
dtype: torch.dtype,
device: str,
seed: int,
) -> None:
# Set random seed and device
torch.manual_seed(seed)
torch.set_default_device(device)
# Generate random input tensor
input_tensor = torch.randint(0, 1000, shape, dtype=dtype)
# Test parameters
test_case = {
"org_start": 100,
"org_end": 200,
"padding": 0,
"added_start": 300,
"added_end": 400,
}
# Define reference function
def ref_fn():
return get_masked_input_and_mask_ref(
input_tensor,
test_case["org_start"],
test_case["org_end"],
test_case["padding"],
test_case["added_start"],
test_case["added_end"],
)
# Define custom function
def custom_fn():
return torch.ops._C.get_masked_input_and_mask(
input_tensor,
test_case["org_start"],
test_case["org_end"],
test_case["padding"],
test_case["added_start"],
test_case["added_end"],
)
# Get results for correctness testing
ref_masked_input, ref_mask = ref_fn()
custom_masked_input, custom_mask = custom_fn()
# Benchmark both implementations
ref_time = benchmark_npu(ref_fn)
custom_time = benchmark_npu(custom_fn)
# Print performance results
print("\nPerformance Results:")
print(f"Reference implementation: {ref_time * 1000:.3f} ms")
print(f"Custom implementation: {custom_time * 1000:.3f} ms")
print(f"Speedup: {ref_time / custom_time:.2f}x")
# Compare results for correctness
ref_masked_input = ref_masked_input.to(dtype)
print("\nResults comparison:")
print("custom_masked_input:", custom_masked_input)
print("ref_masked_input:", ref_masked_input)
print("custom_mask:", custom_mask)
print("ref_mask:", ref_mask)
torch.testing.assert_close(
custom_masked_input,
ref_masked_input,
rtol=1e-5,
atol=1e-5,
msg=f"Masked input mismatch for case: {test_case}",
)
torch.testing.assert_close(
custom_mask,
ref_mask,
rtol=1e-5,
atol=1e-5,
msg=f"Mask mismatch for case: {test_case}",
)

View File

@ -0,0 +1,4 @@
pandas
datasets
modelscope
tabulate

View File

@ -0,0 +1,188 @@
import argparse
import json
import os
from pathlib import Path
import pandas as pd
from tabulate import tabulate
CUR_PATH = Path(__file__).parent.resolve()
# latency results and the keys that will be printed into markdown
latency_results = []
latency_column_mapping = {
"test_name": "Test name",
"avg_latency": "Mean latency (ms)",
"P50": "Median latency (ms)",
"P99": "P99 latency (ms)",
}
# throughput tests and the keys that will be printed into markdown
throughput_results = []
throughput_results_column_mapping = {
"test_name": "Test name",
"num_requests": "Num of reqs",
"total_num_tokens": "Total num of tokens",
"elapsed_time": "Elapsed time (s)",
"requests_per_second": "Tput (req/s)",
"tokens_per_second": "Tput (tok/s)",
}
# serving results and the keys that will be printed into markdown
serving_results = []
serving_column_mapping = {
"test_name": "Test name",
"request_rate": "Request rate (req/s)",
"request_throughput": "Tput (req/s)",
"output_throughput": "Output Tput (tok/s)",
"median_ttft_ms": "TTFT (ms)",
"median_tpot_ms": "TPOT (ms)",
"median_itl_ms": "ITL (ms)",
}
def read_markdown(file):
if os.path.exists(file):
with open(file) as f:
return f.read() + "\n"
else:
return f"{file} not found.\n"
def results_to_json(latency, throughput, serving):
return json.dumps(
{
"latency": latency.to_dict(),
"throughput": throughput.to_dict(),
"serving": serving.to_dict(),
}
)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Process the results of the benchmark tests."
)
parser.add_argument(
"--results_folder",
type=str,
default="../results/",
help="The folder where the benchmark results are stored.",
)
parser.add_argument(
"--output_folder",
type=str,
default="../results/",
help="The folder where the benchmark results are stored.",
)
parser.add_argument(
"--markdown_template",
type=str,
default="./perf_result_template.md",
help="The template file for the markdown report.",
)
parser.add_argument(
"--tag", default="main", help="Tag to be used for release message."
)
parser.add_argument(
"--commit_id", default="", help="Commit ID to be used for release message."
)
args = parser.parse_args()
results_folder = (CUR_PATH / args.results_folder).resolve()
output_folder = (CUR_PATH / args.output_folder).resolve()
markdown_template = (CUR_PATH / args.markdown_template).resolve()
# collect results
for test_file in results_folder.glob("*.json"):
with open(test_file) as f:
raw_result = json.loads(f.read())
if "serving" in str(test_file):
# this result is generated via `benchmark_serving.py`
# update the test name of this result
raw_result.update({"test_name": test_file.stem})
# add the result to raw_result
serving_results.append(raw_result)
continue
elif "latency" in f.name:
# this result is generated via `benchmark_latency.py`
# update the test name of this result
raw_result.update({"test_name": test_file.stem})
# get different percentiles
for perc in [10, 25, 50, 75, 90, 99]:
# Multiply 1000 to convert the time unit from s to ms
raw_result.update(
{f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
)
raw_result["avg_latency"] = raw_result["avg_latency"] * 1000
# add the result to raw_result
latency_results.append(raw_result)
continue
elif "throughput" in f.name:
# this result is generated via `benchmark_throughput.py`
# update the test name of this result
raw_result.update({"test_name": test_file.stem})
# add the result to raw_result
throughput_results.append(raw_result)
continue
print(f"Skipping {test_file}")
serving_results.sort(key=lambda x: (len(x["test_name"]), x["test_name"]))
latency_results = pd.DataFrame.from_dict(latency_results)
serving_results = pd.DataFrame.from_dict(serving_results)
throughput_results = pd.DataFrame.from_dict(throughput_results)
raw_results_json = results_to_json(
latency_results, throughput_results, serving_results
)
# remapping the key, for visualization purpose
if not latency_results.empty:
latency_results = latency_results[list(latency_column_mapping.keys())].rename(
columns=latency_column_mapping
)
if not serving_results.empty:
serving_results = serving_results[list(serving_column_mapping.keys())].rename(
columns=serving_column_mapping
)
if not throughput_results.empty:
throughput_results = throughput_results[
list(throughput_results_column_mapping.keys())
].rename(columns=throughput_results_column_mapping)
processed_results_json = results_to_json(
latency_results, throughput_results, serving_results
)
# get markdown tables
latency_md_table = tabulate(
latency_results, headers="keys", tablefmt="pipe", showindex=False
)
serving_md_table = tabulate(
serving_results, headers="keys", tablefmt="pipe", showindex=False
)
throughput_md_table = tabulate(
throughput_results, headers="keys", tablefmt="pipe", showindex=False
)
# document the result
print(output_folder)
with open(output_folder / "benchmark_results.md", "w") as f:
results = read_markdown(markdown_template)
results = results.format(
latency_tests_markdown_table=latency_md_table,
throughput_tests_markdown_table=throughput_md_table,
serving_tests_markdown_table=serving_md_table,
benchmarking_results_in_json_string=processed_results_json,
)
f.write(results)

View File

@ -0,0 +1,31 @@
## Online serving tests
- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).
{serving_tests_markdown_table}
## Offline tests
### Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: end-to-end latency.
{latency_tests_markdown_table}
### Throughput tests
- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput.
{throughput_tests_markdown_table}

View File

@ -0,0 +1,321 @@
#!/bin/bash
set -e
check_npus() {
# shellcheck disable=SC2155
declare -g npu_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | tr -d ' ')
if [[ -z "$npu_count" || "$npu_count" -eq 0 ]]; then
echo "Need at least 1 NPU to run benchmarking."
exit 1
else
echo "found NPU conut: $npu_count"
fi
npu_type=$(npu-smi info | grep -E "^\| [0-9]+" | awk -F '|' '{print $2}' | awk '{$1=$1;print}' | awk '{print $2}')
echo "NPU type is: $npu_type"
}
ensure_sharegpt_downloaded() {
local FILE="/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json"
local DIR
DIR=$(dirname "$FILE")
if [ ! -f "$FILE" ]; then
echo "$FILE not found, downloading from hf-mirror ..."
mkdir -p "$DIR"
wget -O "$FILE" https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
if [ $? -ne 0 ]; then
echo "Download failed!" >&2
return 1
fi
echo "Download completed and saved to $FILE"
else
echo "$FILE already exists."
fi
}
json2args() {
# transforms the JSON string to command line args, and '_' is replaced to '-'
# example:
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
local json_string=$1
local args
args=$(
echo "$json_string" | jq -r '
to_entries |
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
join(" ")
'
)
echo "$args"
}
wait_for_server() {
local waited=0
local timeout_sec=1200
while (( waited < timeout_sec )); do
if curl -s -X GET localhost:8000/health > /dev/null; then
return 0
fi
echo "Waiting for vllm server to start..."
sleep 1
((waited++))
done
echo "Timeout waiting for server"
return 1
}
get_cur_npu_id() {
npu-smi info -l | awk -F ':' '/NPU ID/ {print $2+0; exit}'
}
kill_npu_processes() {
ps -aux
lsof -t -i:8000 | xargs -r kill -9
pgrep python3 | xargs -r kill -9
sleep 4
rm -rf ~/.config/vllm
}
update_json_field() {
local json_file="$1"
local field_name="$2"
local field_value="$3"
jq --arg value "$field_value" \
--arg key "$field_name" \
'.[$key] = $value' "$json_file" > "${json_file}.tmp" && \
mv "${json_file}.tmp" "$json_file"
}
run_latency_tests() {
# run latency tests using `benchmark_latency.py`
# $1: a json file specifying latency test cases
local latency_test_file
latency_test_file=$1
# Iterate over latency tests
jq -c '.[]' "$latency_test_file" | while read -r params; do
# get the test name, and append the NPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
if [[ ! "$test_name" =~ ^latency_ ]]; then
echo "In latency-test.json, test_name must start with \"latency_\"."
exit 1
fi
# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi
# get arguments
latency_params=$(echo "$params" | jq -r '.parameters')
latency_args=$(json2args "$latency_params")
latency_command="vllm bench latency \
--output-json $RESULTS_FOLDER/${test_name}.json \
$latency_args"
echo "Running test case $test_name"
echo "Latency command: $latency_command"
# run the benchmark
eval "$latency_command"
# echo model_name to result file
model_name=$(echo "$latency_params" | jq -r '.model')
update_json_field "$RESULTS_FOLDER/${test_name}.json" "model_name" "$model_name"
kill_npu_processes
done
}
run_throughput_tests() {
# run throughput tests using `benchmark_throughput.py`
# $1: a json file specifying throughput test cases
local throughput_test_file
throughput_test_file=$1
# Iterate over throughput tests
jq -c '.[]' "$throughput_test_file" | while read -r params; do
# get the test name, and append the NPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
if [[ ! "$test_name" =~ ^throughput_ ]]; then
echo "In throughput-test.json, test_name must start with \"throughput_\"."
exit 1
fi
# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi
# get arguments
throughput_params=$(echo "$params" | jq -r '.parameters')
throughput_args=$(json2args "$throughput_params")
throughput_command="vllm bench throughput \
--output-json $RESULTS_FOLDER/${test_name}.json \
$throughput_args"
echo "Running test case $test_name"
echo "Throughput command: $throughput_command"
# run the benchmark
eval "$throughput_command"
# echo model_name to result file
model_name=$(echo "$throughput_params" | jq -r '.model')
update_json_field "$RESULTS_FOLDER/${test_name}.json" "model_name" "$model_name"
kill_npu_processes
done
}
run_serving_tests() {
# run serving tests using `benchmark_serving.py`
# $1: a json file specifying serving test cases
local serving_test_file
serving_test_file=$1
# Iterate over serving tests
jq -c '.[]' "$serving_test_file" | while read -r params; do
# get the test name, and append the NPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
if [[ ! "$test_name" =~ ^serving_ ]]; then
echo "In serving-test.json, test_name must start with \"serving_\"."
exit 1
fi
# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi
# get client and server arguments
server_params=$(echo "$params" | jq -r '.server_parameters')
client_params=$(echo "$params" | jq -r '.client_parameters')
server_args=$(json2args "$server_params")
client_args=$(json2args "$client_params")
qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list"
# check if server model and client model is aligned
server_model=$(echo "$server_params" | jq -r '.model')
client_model=$(echo "$client_params" | jq -r '.model')
if [[ $server_model != "$client_model" ]]; then
echo "Server model and client model must be the same. Skip testcase $test_name."
continue
fi
server_command="python3 \
-m vllm.entrypoints.openai.api_server \
$server_args"
# run the server
echo "Running test case $test_name"
echo "Server command: $server_command"
bash -c "$server_command" &
server_pid=$!
# wait until the server is alive
if wait_for_server; then
echo ""
echo "vllm server is up and running."
else
echo ""
echo "vllm failed to start within the timeout period."
fi
# iterate over different QPS
for qps in $qps_list; do
# remove the surrounding single quote from qps
if [[ "$qps" == *"inf"* ]]; then
echo "qps was $qps"
qps="inf"
echo "now qps is $qps"
fi
new_test_name=$test_name"_qps_"$qps
client_command="vllm bench serve \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
--request-rate $qps \
$client_args"
echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"
bash -c "$client_command"
done
# clean up
kill -9 $server_pid
kill_npu_processes
done
}
cleanup() {
rm -rf ./vllm_benchmarks
}
cleanup_on_error() {
echo "An error occurred. Cleaning up results folder..."
rm -rf $RESULTS_FOLDER
}
main() {
START_TIME=$(date +%s)
check_npus
# dependencies
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)
(which lsof) || (apt-get update && apt-get install -y lsof)
# get the current IP address, required by benchmark_serving.py
# shellcheck disable=SC2155
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
# turn of the reporting of the status of each request, to clean up the terminal output
export VLLM_LOG_LEVEL="WARNING"
# set env
export VLLM_USE_MODELSCOPE=True
# prepare for benchmarking
cd benchmarks || exit 1
trap cleanup EXIT
QUICK_BENCHMARK_ROOT=./
declare -g RESULTS_FOLDER=results
mkdir -p $RESULTS_FOLDER
trap cleanup_on_error ERR
ensure_sharegpt_downloaded
# benchmarks
run_serving_tests $QUICK_BENCHMARK_ROOT/tests/serving-tests.json
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json
END_TIME=$(date +%s)
ELAPSED_TIME=$((END_TIME - START_TIME))
echo "Total execution time: $ELAPSED_TIME seconds"
}
main "$@"

View File

@ -0,0 +1,313 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
import argparse
import gc
import json
import multiprocessing
import sys
import time
from multiprocessing import Queue
import lm_eval
import torch
# URLs for version information in Markdown report
VLLM_URL = "https://github.com/vllm-project/vllm/commit/"
VLLM_ASCEND_URL = "https://github.com/vllm-project/vllm-ascend/commit/"
# Model and task configurations
UNIMODAL_MODEL_NAME = ["Qwen/Qwen3-8B-Base", "Qwen/Qwen3-30B-A3B"]
UNIMODAL_TASK = ["ceval-valid", "gsm8k"]
MULTIMODAL_NAME = ["Qwen/Qwen2.5-VL-7B-Instruct"]
MULTIMODAL_TASK = ["mmmu_val"]
# Batch size configurations per task
BATCH_SIZE = {"ceval-valid": 1, "mmlu": 1, "gsm8k": "auto", "mmmu_val": 1}
# Model type mapping (vllm for text, vllm-vlm for vision-language)
MODEL_TYPE = {
"Qwen/Qwen3-8B-Base": "vllm",
"Qwen/Qwen3-30B-A3B": "vllm",
"Qwen/Qwen2.5-VL-7B-Instruct": "vllm-vlm",
}
# Command templates for running evaluations
MODEL_RUN_INFO = {
"Qwen/Qwen3-30B-A3B": (
"export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True'\n"
"lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
),
"Qwen/Qwen3-8B-Base": (
"export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
"lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
),
"Qwen/Qwen2.5-VL-7B-Instruct": (
"export MODEL_ARGS='pretrained={model},max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2'\n"
"lm_eval --model vllm-vlm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
"--apply_chat_template --fewshot_as_multiturn --batch_size 1"
),
}
# Evaluation metric filters per task
FILTER = {
"gsm8k": "exact_match,flexible-extract",
"ceval-valid": "acc,none",
"mmmu_val": "acc,none",
}
# Expected accuracy values for models
EXPECTED_VALUE = {
"Qwen/Qwen3-30B-A3B": {"ceval-valid": 0.83, "gsm8k": 0.85},
"Qwen/Qwen3-8B-Base": {"ceval-valid": 0.82, "gsm8k": 0.83},
"Qwen/Qwen2.5-VL-7B-Instruct": {"mmmu_val": 0.51},
}
PARALLEL_MODE = {
"Qwen/Qwen3-8B-Base": "TP",
"Qwen/Qwen2.5-VL-7B-Instruct": "TP",
"Qwen/Qwen3-30B-A3B": "EP",
}
# Execution backend configuration
EXECUTION_MODE = {
"Qwen/Qwen3-8B-Base": "ACLGraph",
"Qwen/Qwen2.5-VL-7B-Instruct": "ACLGraph",
"Qwen/Qwen3-30B-A3B": "ACLGraph",
}
# Model arguments for evaluation
MODEL_ARGS = {
"Qwen/Qwen3-8B-Base": "pretrained=Qwen/Qwen3-8B-Base,max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6",
"Qwen/Qwen2.5-VL-7B-Instruct": "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2",
"Qwen/Qwen3-30B-A3B": "pretrained=Qwen/Qwen3-30B-A3B,max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True",
}
# Whether to apply chat template formatting
APPLY_CHAT_TEMPLATE = {
"Qwen/Qwen3-8B-Base": True,
"Qwen/Qwen2.5-VL-7B-Instruct": True,
"Qwen/Qwen3-30B-A3B": False,
}
# Few-shot examples handling as multi-turn dialogues.
FEWSHOT_AS_MULTITURN = {
"Qwen/Qwen3-8B-Base": True,
"Qwen/Qwen2.5-VL-7B-Instruct": True,
"Qwen/Qwen3-30B-A3B": False,
}
# Relative tolerance for accuracy checks
RTOL = 0.03
ACCURACY_FLAG = {}
def run_accuracy_test(queue, model, dataset):
"""Run accuracy evaluation for a model on a dataset in separate process"""
try:
eval_params = {
"model": MODEL_TYPE[model],
"model_args": MODEL_ARGS[model],
"tasks": dataset,
"apply_chat_template": APPLY_CHAT_TEMPLATE[model],
"fewshot_as_multiturn": FEWSHOT_AS_MULTITURN[model],
"batch_size": BATCH_SIZE[dataset],
}
if MODEL_TYPE[model] == "vllm":
eval_params["num_fewshot"] = 5
results = lm_eval.simple_evaluate(**eval_params)
print(f"Success: {model} on {dataset} ")
measured_value = results["results"]
queue.put(measured_value)
except Exception as e:
print(f"Error in run_accuracy_test: {e}")
queue.put(e)
sys.exit(1)
finally:
if "results" in locals():
del results
gc.collect()
torch.npu.empty_cache()
time.sleep(5)
def generate_md(model_name, tasks_list, args, datasets):
"""Generate Markdown report with evaluation results"""
# Format the run command
run_cmd = MODEL_RUN_INFO[model_name].format(model=model_name, datasets=datasets)
model = model_name.split("/")[1]
# Version information section
version_info = (
f"**vLLM Version**: vLLM: {args.vllm_version} "
f"([{args.vllm_commit}]({VLLM_URL + args.vllm_commit})), "
f"vLLM Ascend: {args.vllm_ascend_version} "
f"([{args.vllm_ascend_commit}]({VLLM_ASCEND_URL + args.vllm_ascend_commit})) "
)
# Report header with system info
preamble = f"""# {model}
{version_info}
**Software Environment**: CANN: {args.cann_version}, PyTorch: {args.torch_version}, torch-npu: {args.torch_npu_version}
**Hardware Environment**: Atlas A2 Series
**Datasets**: {datasets}
**Parallel Mode**: {PARALLEL_MODE[model_name]}
**Execution Mode**: {EXECUTION_MODE[model_name]}
**Command**:
```bash
{run_cmd}
```
"""
header = (
"| Task | Filter | n-shot | Metric | Value | Stderr |\n"
"|-----------------------|-------:|-------:|----------|--------:|-------:|"
)
rows = []
rows_sub = []
# Process results for each task
for task_dict in tasks_list:
for key, stats in task_dict.items():
alias = stats.get("alias", key)
task_name = alias.strip()
if "exact_match,flexible-extract" in stats:
metric_key = "exact_match,flexible-extract"
else:
metric_key = None
for k in stats:
if "," in k and not k.startswith("acc_stderr"):
metric_key = k
break
if metric_key is None:
continue
metric, flt = metric_key.split(",", 1)
value = stats[metric_key]
stderr = stats.get(f"{metric}_stderr,{flt}", 0)
if model_name in UNIMODAL_MODEL_NAME:
n_shot = "5"
else:
n_shot = "0"
flag = ACCURACY_FLAG.get(task_name, "")
row = (
f"| {task_name:<37} "
f"| {flt:<6} "
f"| {n_shot:6} "
f"| {metric:<6} "
f"| {flag}{value:>5.4f} "
f"| ± {stderr:>5.4f} |"
)
if not task_name.startswith("-"):
rows.append(row)
rows_sub.append(
"<details>"
+ "\n"
+ "<summary>"
+ task_name
+ " details"
+ "</summary>"
+ "\n" * 2
+ header
)
rows_sub.append(row)
rows_sub.append("</details>")
# Combine all Markdown sections
md = (
preamble
+ "\n"
+ header
+ "\n"
+ "\n".join(rows)
+ "\n"
+ "\n".join(rows_sub)
+ "\n"
)
print(md)
return md
def safe_md(args, accuracy, datasets):
"""
Safely generate and save Markdown report from accuracy results.
"""
data = json.loads(json.dumps(accuracy))
for model_key, tasks_list in data.items():
md_content = generate_md(model_key, tasks_list, args, datasets)
with open(args.output, "w", encoding="utf-8") as f:
f.write(md_content)
print(f"create Markdown file:{args.output}")
def main(args):
"""Main evaluation workflow"""
accuracy = {}
accuracy[args.model] = []
result_queue: Queue[float] = multiprocessing.Queue()
if args.model in UNIMODAL_MODEL_NAME:
datasets = UNIMODAL_TASK
else:
datasets = MULTIMODAL_TASK
datasets_str = ",".join(datasets)
# Evaluate model on each dataset
for dataset in datasets:
accuracy_expected = EXPECTED_VALUE[args.model][dataset]
p = multiprocessing.Process(
target=run_accuracy_test, args=(result_queue, args.model, dataset)
)
p.start()
p.join()
if p.is_alive():
p.terminate()
p.join()
gc.collect()
torch.npu.empty_cache()
time.sleep(10)
result = result_queue.get()
print(result)
if (
accuracy_expected - RTOL
< result[dataset][FILTER[dataset]]
< accuracy_expected + RTOL
):
ACCURACY_FLAG[dataset] = ""
else:
ACCURACY_FLAG[dataset] = ""
accuracy[args.model].append(result)
print(accuracy)
safe_md(args, accuracy, datasets_str)
if __name__ == "__main__":
multiprocessing.set_start_method("spawn", force=True)
# Initialize argument parser
parser = argparse.ArgumentParser(
description="Run model accuracy evaluation and generate report"
)
parser.add_argument("--output", type=str, required=True)
parser.add_argument("--model", type=str, required=True)
parser.add_argument("--vllm_ascend_version", type=str, required=False)
parser.add_argument("--torch_version", type=str, required=False)
parser.add_argument("--torch_npu_version", type=str, required=False)
parser.add_argument("--vllm_version", type=str, required=False)
parser.add_argument("--cann_version", type=str, required=False)
parser.add_argument("--vllm_commit", type=str, required=False)
parser.add_argument("--vllm_ascend_commit", type=str, required=False)
args = parser.parse_args()
main(args)

View File

@ -0,0 +1,23 @@
[
{
"test_name": "latency_qwen3_8B_tp1",
"parameters": {
"model": "Qwen/Qwen3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"max_model_len": 16384,
"num_iters_warmup": 5,
"num_iters": 15
}
},
{
"test_name": "latency_qwen2_5_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
}
]

View File

@ -0,0 +1,77 @@
[
{
"test_name": "serving_qwen2_5vl_7B_tp1",
"qps_list": [
1,
4,
16,
"inf"
],
"server_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"trust_remote_code": "",
"max_model_len": 16384
},
"client_parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"endpoint_type": "openai-chat",
"dataset_name": "hf",
"hf_split": "train",
"endpoint": "/v1/chat/completions",
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
"num_prompts": 200
}
},
{
"test_name": "serving_qwen3_8B_tp1",
"qps_list": [
1,
4,
16,
"inf"
],
"server_parameters": {
"model": "Qwen/Qwen3-8B",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "Qwen/Qwen3-8B",
"endpoint_type": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_qwen2_5_7B_tp1",
"qps_list": [
1,
4,
16,
"inf"
],
"server_parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"endpoint_type": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
}
]

View File

@ -0,0 +1,38 @@
[
{
"test_name": "throughput_qwen3_8B_tp1",
"parameters": {
"model": "Qwen/Qwen3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200,
"backend": "vllm"
}
},
{
"test_name": "throughput_qwen2_5vl_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"tensor_parallel_size": 1,
"backend": "vllm-chat",
"dataset_name": "hf",
"hf_split": "train",
"max_model_len": 16384,
"dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
"num_prompts": 200
}
},
{
"test_name": "throughput_qwen2_5_7B_tp1",
"parameters": {
"model": "Qwen/Qwen2.5-7B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200,
"backend": "vllm"
}
}
]

133
cmake/utils.cmake Normal file
View File

@ -0,0 +1,133 @@
#
# Attempt to find the python package that uses the same python executable as
# `EXECUTABLE` and is one of the `SUPPORTED_VERSIONS`.
#
macro (find_python_from_executable EXECUTABLE SUPPORTED_VERSIONS)
file(REAL_PATH ${EXECUTABLE} EXECUTABLE)
set(Python_EXECUTABLE ${EXECUTABLE})
find_package(Python COMPONENTS Interpreter Development.Module Development.SABIModule)
if (NOT Python_FOUND)
message(FATAL_ERROR "Unable to find python matching: ${EXECUTABLE}.")
endif()
set(_VER "${Python_VERSION_MAJOR}.${Python_VERSION_MINOR}")
set(_SUPPORTED_VERSIONS_LIST ${SUPPORTED_VERSIONS} ${ARGN})
if (NOT _VER IN_LIST _SUPPORTED_VERSIONS_LIST)
message(FATAL_ERROR
"Python version (${_VER}) is not one of the supported versions: "
"${_SUPPORTED_VERSIONS_LIST}.")
endif()
message(STATUS "Found python matching: ${EXECUTABLE}.")
endmacro()
#
# Run `EXPR` in python. The standard output of python is stored in `OUT` and
# has trailing whitespace stripped. If an error is encountered when running
# python, a fatal message `ERR_MSG` is issued.
#
function (run_python OUT EXPR ERR_MSG)
execute_process(
COMMAND
"${PYTHON_EXECUTABLE}" "-c" "${EXPR}"
OUTPUT_VARIABLE PYTHON_OUT
RESULT_VARIABLE PYTHON_ERROR_CODE
ERROR_VARIABLE PYTHON_STDERR
OUTPUT_STRIP_TRAILING_WHITESPACE)
if(NOT PYTHON_ERROR_CODE EQUAL 0)
message(FATAL_ERROR "${ERR_MSG}: ${PYTHON_STDERR}")
endif()
set(${OUT} ${PYTHON_OUT} PARENT_SCOPE)
endfunction()
# Run `EXPR` in python after importing `PKG`. Use the result of this to extend
# `CMAKE_PREFIX_PATH` so the torch cmake configuration can be imported.
macro (append_cmake_prefix_path PKG EXPR)
run_python(_PREFIX_PATH
"import ${PKG}; print(${EXPR})" "Failed to locate ${PKG} path")
list(APPEND CMAKE_PREFIX_PATH ${_PREFIX_PATH})
endmacro()
# This cmake function is adapted from vllm /Users/ganyi/workspace/vllm-ascend/cmake/utils.cmake
# Define a target named `GPU_MOD_NAME` for a single extension. The
# arguments are:
#
# DESTINATION <dest> - Module destination directory.
# LANGUAGE <lang> - The GPU language for this module, e.g CUDA, HIP,
# etc.
# SOURCES <sources> - List of source files relative to CMakeLists.txt
# directory.
#
# Optional arguments:
#
# ARCHITECTURES <arches> - A list of target GPU architectures in cmake
# format.
# Refer `CMAKE_CUDA_ARCHITECTURES` documentation
# and `CMAKE_HIP_ARCHITECTURES` for more info.
# ARCHITECTURES will use cmake's defaults if
# not provided.
# COMPILE_FLAGS <flags> - Extra compiler flags passed to NVCC/hip.
# INCLUDE_DIRECTORIES <dirs> - Extra include directories.
# LIBRARIES <libraries> - Extra link libraries.
# WITH_SOABI - Generate library with python SOABI suffix name.
# USE_SABI <version> - Use python stable api <version>
#
# Note: optimization level/debug info is set via cmake build type.
#
function (define_gpu_extension_target GPU_MOD_NAME)
cmake_parse_arguments(PARSE_ARGV 1
GPU
"WITH_SOABI"
"DESTINATION;LANGUAGE;USE_SABI"
"SOURCES;ARCHITECTURES;COMPILE_FLAGS;INCLUDE_DIRECTORIES;LIBRARIES")
# Add hipify preprocessing step when building with HIP/ROCm.
if (GPU_LANGUAGE STREQUAL "HIP")
hipify_sources_target(GPU_SOURCES ${GPU_MOD_NAME} "${GPU_SOURCES}")
endif()
if (GPU_WITH_SOABI)
set(GPU_WITH_SOABI WITH_SOABI)
else()
set(GPU_WITH_SOABI)
endif()
if (GPU_USE_SABI)
Python_add_library(${GPU_MOD_NAME} MODULE USE_SABI ${GPU_USE_SABI} ${GPU_WITH_SOABI} "${GPU_SOURCES}")
else()
Python_add_library(${GPU_MOD_NAME} MODULE ${GPU_WITH_SOABI} "${GPU_SOURCES}")
endif()
if (GPU_LANGUAGE STREQUAL "HIP")
# Make this target dependent on the hipify preprocessor step.
add_dependencies(${GPU_MOD_NAME} hipify${GPU_MOD_NAME})
endif()
if (GPU_ARCHITECTURES)
set_target_properties(${GPU_MOD_NAME} PROPERTIES
${GPU_LANGUAGE}_ARCHITECTURES "${GPU_ARCHITECTURES}")
endif()
set_property(TARGET ${GPU_MOD_NAME} PROPERTY CXX_STANDARD 17)
target_compile_options(${GPU_MOD_NAME} PRIVATE
$<$<COMPILE_LANGUAGE:${GPU_LANGUAGE}>:${GPU_COMPILE_FLAGS}>)
target_compile_definitions(${GPU_MOD_NAME} PRIVATE
"-DTORCH_EXTENSION_NAME=${GPU_MOD_NAME}")
target_include_directories(${GPU_MOD_NAME} PRIVATE csrc
${GPU_INCLUDE_DIRECTORIES})
target_link_libraries(${GPU_MOD_NAME} PRIVATE torch ${GPU_LIBRARIES})
# Don't use `TORCH_LIBRARIES` for CUDA since it pulls in a bunch of
# dependencies that are not necessary and may not be installed.
if (GPU_LANGUAGE STREQUAL "CUDA")
target_link_libraries(${GPU_MOD_NAME} PRIVATE CUDA::cudart CUDA::cuda_driver)
else()
target_link_libraries(${GPU_MOD_NAME} PRIVATE ${TORCH_LIBRARIES})
endif()
install(TARGETS ${GPU_MOD_NAME} LIBRARY DESTINATION ${GPU_DESTINATION} COMPONENT ${GPU_MOD_NAME})
endfunction()

30
codecov.yml Normal file
View File

@ -0,0 +1,30 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
coverage:
status:
# non-voting, new code must be fully tested
patch:
default:
target: 100%
# non-voting
informational: true
# non-voting
project:
default:
# non-voting
informational: true

489
collect_env.py Normal file
View File

@ -0,0 +1,489 @@
#
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Adapted from https://github.com/vllm-project/vllm/blob/main/collect_env.py
#
import datetime
import locale
import os
import re
import subprocess
import sys
from collections import namedtuple
from vllm.envs import environment_variables
try:
import torch
TORCH_AVAILABLE = True
except (ImportError, NameError, AttributeError, OSError):
TORCH_AVAILABLE = False
# System Environment Information
SystemEnv = namedtuple(
'SystemEnv',
[
'torch_version',
'is_debug_build',
'gcc_version',
'clang_version',
'cmake_version',
'os',
'libc_version',
'python_version',
'python_platform',
'pip_version', # 'pip' or 'pip3'
'pip_packages',
'conda_packages',
'cpu_info',
'vllm_version', # vllm specific field
'vllm_ascend_version', # vllm ascend specific field
'env_vars',
'npu_info', # ascend specific field
'cann_info', # ascend specific field
])
DEFAULT_CONDA_PATTERNS = {
"torch",
"numpy",
"soumith",
"mkl",
"magma",
"optree",
"transformers",
"zmq",
"pynvml",
}
DEFAULT_PIP_PATTERNS = {
"torch",
"numpy",
"mypy",
"flake8",
"optree",
"onnx",
"transformers",
"zmq",
"pynvml",
}
def run(command):
"""Return (return-code, stdout, stderr)."""
shell = True if type(command) is str else False
p = subprocess.Popen(command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
shell=shell)
raw_output, raw_err = p.communicate()
rc = p.returncode
if get_platform() == 'win32':
enc = 'oem'
else:
enc = locale.getpreferredencoding()
output = raw_output.decode(enc)
err = raw_err.decode(enc)
return rc, output.strip(), err.strip()
def run_and_read_all(run_lambda, command):
"""Run command using run_lambda; reads and returns entire output if rc is 0."""
rc, out, _ = run_lambda(command)
if rc != 0:
return None
return out
def run_and_parse_first_match(run_lambda, command, regex):
"""Run command using run_lambda, returns the first regex match if it exists."""
rc, out, _ = run_lambda(command)
if rc != 0:
return None
match = re.search(regex, out)
if match is None:
return None
return match.group(1)
def run_and_return_first_line(run_lambda, command):
"""Run command using run_lambda and returns first line if output is not empty."""
rc, out, _ = run_lambda(command)
if rc != 0:
return None
return out.split('\n')[0]
def get_conda_packages(run_lambda, patterns=None):
if patterns is None:
patterns = DEFAULT_CONDA_PATTERNS
conda = os.environ.get('CONDA_EXE', 'conda')
out = run_and_read_all(run_lambda, "{} list".format(conda))
if out is None:
return out
return "\n".join(line for line in out.splitlines()
if not line.startswith("#") and any(name in line
for name in patterns))
def get_gcc_version(run_lambda):
return run_and_parse_first_match(run_lambda, 'gcc --version', r'gcc (.*)')
def get_clang_version(run_lambda):
return run_and_parse_first_match(run_lambda, 'clang --version',
r'clang version (.*)')
def get_cmake_version(run_lambda):
return run_and_parse_first_match(run_lambda, 'cmake --version',
r'cmake (.*)')
def _parse_version(version, version_tuple):
version_str = version_tuple[-1]
if isinstance(version_str, str) and version_str.startswith('g'):
if '.' in version_str:
git_sha = version_str.split('.')[0][1:]
date = version_str.split('.')[-1][1:]
return f"{version} (git sha: {git_sha}, date: {date})"
else:
git_sha = version_str[1:] # type: ignore
return f"{version} (git sha: {git_sha})"
return version
def get_vllm_version():
from vllm import __version__, __version_tuple__
return _parse_version(__version__, __version_tuple__)
def get_vllm_ascend_version():
from vllm_ascend._version import __version__, __version_tuple__
return _parse_version(__version__, __version_tuple__)
def get_cpu_info(run_lambda):
rc, out, err = 0, '', ''
if get_platform() == 'linux':
rc, out, err = run_lambda('lscpu')
elif get_platform() == 'win32':
rc, out, err = run_lambda(
'wmic cpu get Name,Manufacturer,Family,Architecture,ProcessorType,DeviceID, \
CurrentClockSpeed,MaxClockSpeed,L2CacheSize,L2CacheSpeed,Revision /VALUE'
)
elif get_platform() == 'darwin':
rc, out, err = run_lambda("sysctl -n machdep.cpu.brand_string")
cpu_info = 'None'
if rc == 0:
cpu_info = out
else:
cpu_info = err
return cpu_info
def get_platform():
if sys.platform.startswith('linux'):
return 'linux'
elif sys.platform.startswith('win32'):
return 'win32'
elif sys.platform.startswith('cygwin'):
return 'cygwin'
elif sys.platform.startswith('darwin'):
return 'darwin'
else:
return sys.platform
def get_mac_version(run_lambda):
return run_and_parse_first_match(run_lambda, 'sw_vers -productVersion',
r'(.*)')
def get_windows_version(run_lambda):
system_root = os.environ.get('SYSTEMROOT', 'C:\\Windows')
wmic_cmd = os.path.join(system_root, 'System32', 'Wbem', 'wmic')
findstr_cmd = os.path.join(system_root, 'System32', 'findstr')
return run_and_read_all(
run_lambda,
'{} os get Caption | {} /v Caption'.format(wmic_cmd, findstr_cmd))
def get_lsb_version(run_lambda):
return run_and_parse_first_match(run_lambda, 'lsb_release -a',
r'Description:\t(.*)')
def check_release_file(run_lambda):
return run_and_parse_first_match(run_lambda, 'cat /etc/*-release',
r'PRETTY_NAME="(.*)"')
def get_os(run_lambda):
from platform import machine
platform = get_platform()
if platform == 'win32' or platform == 'cygwin':
return get_windows_version(run_lambda)
if platform == 'darwin':
version = get_mac_version(run_lambda)
if version is None:
return None
return 'macOS {} ({})'.format(version, machine())
if platform == 'linux':
# Ubuntu/Debian based
desc = get_lsb_version(run_lambda)
if desc is not None:
return '{} ({})'.format(desc, machine())
# Try reading /etc/*-release
desc = check_release_file(run_lambda)
if desc is not None:
return '{} ({})'.format(desc, machine())
return '{} ({})'.format(platform, machine())
# Unknown platform
return platform
def get_python_platform():
import platform
return platform.platform()
def get_libc_version():
import platform
if get_platform() != 'linux':
return 'N/A'
return '-'.join(platform.libc_ver())
def get_pip_packages(run_lambda, patterns=None):
"""Return `pip list` output. Note: will also find conda-installed pytorch and numpy packages."""
if patterns is None:
patterns = DEFAULT_PIP_PATTERNS
# People generally have `pip` as `pip` or `pip3`
# But here it is invoked as `python -mpip`
def run_with_pip(pip):
out = run_and_read_all(run_lambda, pip + ["list", "--format=freeze"])
return "\n".join(line for line in out.splitlines()
if any(name in line for name in patterns))
pip_version = 'pip3' if sys.version[0] == '3' else 'pip'
out = run_with_pip([sys.executable, '-mpip'])
return pip_version, out
def get_npu_info(run_lambda):
return run_and_read_all(run_lambda, 'npu-smi info')
def get_cann_info(run_lambda):
out = run_and_read_all(run_lambda, 'lscpu | grep Architecture:')
cpu_arch = str(out).split()[-1]
return run_and_read_all(
run_lambda,
'cat /usr/local/Ascend/ascend-toolkit/latest/{}-linux/ascend_toolkit_install.info'
.format(cpu_arch))
def get_env_vars():
env_vars = ''
secret_terms = ('secret', 'token', 'api', 'access', 'password')
report_prefix = ("TORCH", "PYTORCH", "ASCEND_", "ATB_")
for k, v in os.environ.items():
if any(term in k.lower() for term in secret_terms):
continue
if k in environment_variables:
env_vars = env_vars + "{}={}".format(k, v) + "\n"
if k.startswith(report_prefix):
env_vars = env_vars + "{}={}".format(k, v) + "\n"
return env_vars
def get_env_info():
run_lambda = run
pip_version, pip_list_output = get_pip_packages(run_lambda)
if TORCH_AVAILABLE:
version_str = torch.__version__
debug_mode_str = str(torch.version.debug)
else:
version_str = debug_mode_str = 'N/A'
sys_version = sys.version.replace("\n", " ")
conda_packages = get_conda_packages(run_lambda)
return SystemEnv(
torch_version=version_str,
is_debug_build=debug_mode_str,
python_version='{} ({}-bit runtime)'.format(
sys_version,
sys.maxsize.bit_length() + 1),
python_platform=get_python_platform(),
pip_version=pip_version,
pip_packages=pip_list_output,
conda_packages=conda_packages,
os=get_os(run_lambda),
libc_version=get_libc_version(),
gcc_version=get_gcc_version(run_lambda),
clang_version=get_clang_version(run_lambda),
cmake_version=get_cmake_version(run_lambda),
cpu_info=get_cpu_info(run_lambda),
vllm_version=get_vllm_version(),
vllm_ascend_version=get_vllm_ascend_version(),
env_vars=get_env_vars(),
npu_info=get_npu_info(run_lambda),
cann_info=get_cann_info(run_lambda),
)
env_info_fmt = """
PyTorch version: {torch_version}
Is debug build: {is_debug_build}
OS: {os}
GCC version: {gcc_version}
Clang version: {clang_version}
CMake version: {cmake_version}
Libc version: {libc_version}
Python version: {python_version}
Python platform: {python_platform}
CPU:
{cpu_info}
Versions of relevant libraries:
{pip_packages}
{conda_packages}
""".strip()
# both the above code and the following code use `strip()` to
# remove leading/trailing whitespaces, so we need to add a newline
# in between to separate the two sections
env_info_fmt += "\n"
env_info_fmt += """
vLLM Version: {vllm_version}
vLLM Ascend Version: {vllm_ascend_version}
ENV Variables:
{env_vars}
NPU:
{npu_info}
CANN:
{cann_info}
""".strip()
def pretty_str(envinfo):
def replace_nones(dct, replacement='Could not collect'):
for key in dct.keys():
if dct[key] is not None:
continue
dct[key] = replacement
return dct
def replace_bools(dct, true='Yes', false='No'):
for key in dct.keys():
if dct[key] is True:
dct[key] = true
elif dct[key] is False:
dct[key] = false
return dct
def prepend(text, tag='[prepend]'):
lines = text.split('\n')
updated_lines = [tag + line for line in lines]
return '\n'.join(updated_lines)
def replace_if_empty(text, replacement='No relevant packages'):
if text is not None and len(text) == 0:
return replacement
return text
def maybe_start_on_next_line(string):
# If `string` is multiline, prepend a \n to it.
if string is not None and len(string.split('\n')) > 1:
return '\n{}\n'.format(string)
return string
mutable_dict = envinfo._asdict()
# Replace True with Yes, False with No
mutable_dict = replace_bools(mutable_dict)
# Replace all None objects with 'Could not collect'
mutable_dict = replace_nones(mutable_dict)
# If either of these are '', replace with 'No relevant packages'
mutable_dict['pip_packages'] = replace_if_empty(
mutable_dict['pip_packages'])
mutable_dict['conda_packages'] = replace_if_empty(
mutable_dict['conda_packages'])
# Tag conda and pip packages with a prefix
# If they were previously None, they'll show up as ie '[conda] Could not collect'
if mutable_dict['pip_packages']:
mutable_dict['pip_packages'] = prepend(
mutable_dict['pip_packages'], '[{}] '.format(envinfo.pip_version))
if mutable_dict['conda_packages']:
mutable_dict['conda_packages'] = prepend(
mutable_dict['conda_packages'], '[conda] ')
mutable_dict['cpu_info'] = envinfo.cpu_info
mutable_dict['npu_info'] = envinfo.npu_info
mutable_dict['cann_info'] = envinfo.cann_info
return env_info_fmt.format(**mutable_dict)
def get_pretty_env_info():
return pretty_str(get_env_info())
def main():
print("Collecting environment information...")
output = get_pretty_env_info()
print(output)
if TORCH_AVAILABLE and hasattr(torch, 'utils') and hasattr(
torch.utils, '_crash_handler'):
minidump_dir = torch.utils._crash_handler.DEFAULT_MINIDUMP_DIR
if sys.platform == "linux" and os.path.exists(minidump_dir):
dumps = [
os.path.join(minidump_dir, dump)
for dump in os.listdir(minidump_dir)
]
latest = max(dumps, key=os.path.getctime)
ctime = os.path.getctime(latest)
creation_time = datetime.datetime.fromtimestamp(ctime).strftime(
'%Y-%m-%d %H:%M:%S')
msg = "\n*** Detected a minidump at {} created on {}, ".format(latest, creation_time) + \
"if this is related to your bug please include it when you file a report ***"
print(msg, file=sys.stderr)
if __name__ == '__main__':
main()

338
csrc/camem_allocator.cpp Normal file
View File

@ -0,0 +1,338 @@
/*
* Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <iostream>
extern "C" {
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <sys/types.h>
#include "acl/acl.h"
// Global references to Python callables
// NOTE: this is borrowed reference, so we don't need to DECREF them.
// This brings the limitation that the allocator needs to be singleton.
static PyObject* g_python_malloc_callback = nullptr;
static PyObject* g_python_free_callback = nullptr;
// ---------------------------------------------------------------------------
// Helper functions:
void ensure_context(unsigned long long device) {
aclrtContext pctx;
aclrtGetCurrentContext(&pctx);
if (!pctx) {
// Ensure device context.
aclrtCreateContext(&pctx, device);
aclrtSetCurrentContext(pctx);
}
}
void create_and_map(unsigned long long device, ssize_t size, void* d_mem,
aclrtDrvMemHandle* p_memHandle) {
ensure_context(device);
// Define memory allocation properties
aclrtPhysicalMemProp prop = {};
prop.handleType = ACL_MEM_HANDLE_TYPE_NONE ;
prop.allocationType = ACL_MEM_ALLOCATION_TYPE_PINNED;
prop.memAttr = ACL_HBM_MEM_HUGE;
prop.location.id = device;
prop.location.type = ACL_MEM_LOCATION_TYPE_DEVICE;
prop.reserve = 0;
// Allocate memory using aclrtMallocPhysical
aclError error_code = aclrtMallocPhysical(p_memHandle, size, &prop, 0);
if (error_code != 0) {
std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
<< __LINE__ << std::endl;
return;
}
error_code = aclrtMapMem(d_mem, size, 0, *p_memHandle, 0);
if (error_code != 0) {
std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
<< __LINE__ << std::endl;
return;
}
}
void unmap_and_release(unsigned long long device, ssize_t size,
void* d_mem,
aclrtDrvMemHandle* p_memHandle) {
// std::cout << "unmap_and_release: device=" << device << ", size=" << size <<
// ", d_mem=" << d_mem << ", p_memHandle=" << p_memHandle << std::endl;
ensure_context(device);
aclError error_code = aclrtUnmapMem(d_mem);
if (error_code != 0) {
std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
<< __LINE__ << std::endl;
return;
}
error_code = aclrtFreePhysical(*p_memHandle);
if (error_code != 0) {
std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
<< __LINE__ << std::endl;
return;
}
}
PyObject* create_tuple_from_c_integers(unsigned long long a,
unsigned long long b,
unsigned long long c,
unsigned long long d) {
// Create a new tuple of size 4
PyObject* tuple = PyTuple_New(4);
if (!tuple) {
return NULL; // Return NULL on failure
}
// Convert integers to Python objects and set them in the tuple
PyTuple_SetItem(
tuple, 0,
PyLong_FromUnsignedLongLong(a)); // Steals reference to the PyLong
PyTuple_SetItem(tuple, 1, PyLong_FromUnsignedLongLong(b));
PyTuple_SetItem(tuple, 2, PyLong_FromUnsignedLongLong(c));
PyTuple_SetItem(tuple, 3, PyLong_FromUnsignedLongLong(d));
// Note: PyTuple_SetItem "steals" a reference to each object,
// so we do not need to Py_DECREF the PyLong objects explicitly.
return tuple; // Return the created tuple
}
// ---------------------------------------------------------------------------
// Our exported C functions that call Python:
__attribute__ ((visibility("default"))) void* my_malloc(ssize_t size, int device, aclrtStream stream) {
ensure_context(device);
// first allocation, align the size, and reserve an address, and also allocate
// a aclrtDrvMemHandle
// Define memory allocation properties
aclrtPhysicalMemProp prop = {};
prop.handleType = ACL_MEM_HANDLE_TYPE_NONE ;
prop.allocationType = ACL_MEM_ALLOCATION_TYPE_PINNED;
prop.memAttr = ACL_HBM_MEM_HUGE;
prop.location.id = device;
prop.location.type = ACL_MEM_LOCATION_TYPE_DEVICE;
prop.reserve = 0;
// Check if the allocation is supported
size_t granularity;
aclError error_code = aclrtMemGetAllocationGranularity(&prop,
ACL_RT_MEM_ALLOC_GRANULARITY_MINIMUM,
&granularity);
if (error_code != 0) {
std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
<< __LINE__ << std::endl;
return nullptr;
}
size_t alignedSize = ((size + granularity - 1) / granularity) * granularity;
void *d_mem;
error_code = aclrtReserveMemAddress(&d_mem, alignedSize, 0, nullptr, 0);
if (error_code != 0) {
std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
<< __LINE__ << std::endl;
return nullptr;
}
// allocate the aclrtDrvMemHandle
aclrtDrvMemHandle* p_memHandle =
(aclrtDrvMemHandle*)malloc(sizeof(aclrtDrvMemHandle));
if (!g_python_malloc_callback) {
std::cerr << "ERROR: g_python_malloc_callback not set.\n";
return nullptr;
}
// Acquire GIL (not in stable ABI officially, but often works)
PyGILState_STATE gstate = PyGILState_Ensure();
PyObject* arg_tuple = create_tuple_from_c_integers(
(unsigned long long)device, (unsigned long long)alignedSize,
(unsigned long long)d_mem, (unsigned long long)p_memHandle);
// Call g_python_malloc_callback
PyObject* py_result =
PyObject_CallFunctionObjArgs(g_python_malloc_callback, arg_tuple, NULL);
Py_DECREF(arg_tuple);
if (!py_result) {
PyErr_Print();
PyGILState_Release(gstate);
return nullptr;
}
PyGILState_Release(gstate);
// do the final mapping
create_and_map(device, alignedSize, d_mem, p_memHandle);
return (void*)d_mem;
}
__attribute__ ((visibility("default"))) void my_free(void* ptr, ssize_t size, int device, aclrtStream stream) {
// get memory handle from the pointer
if (!g_python_free_callback) {
std::cerr << "ERROR: g_python_free_callback not set.\n";
return;
}
// Acquire GIL (not in stable ABI officially, but often works)
PyGILState_STATE gstate = PyGILState_Ensure();
PyObject* py_ptr =
PyLong_FromUnsignedLongLong(reinterpret_cast<unsigned long long>(ptr));
PyObject* py_result =
PyObject_CallFunctionObjArgs(g_python_free_callback, py_ptr, NULL);
if (!py_result || !PyTuple_Check(py_result) || PyTuple_Size(py_result) != 4) {
PyErr_SetString(PyExc_TypeError, "Expected a tuple of size 4");
return;
}
unsigned long long recv_device, recv_size;
unsigned long long recv_d_mem, recv_p_memHandle;
// Unpack the tuple into four C integers
if (!PyArg_ParseTuple(py_result, "KKKK", &recv_device, &recv_size,
&recv_d_mem, &recv_p_memHandle)) {
// PyArg_ParseTuple sets an error if it fails
return;
}
PyGILState_Release(gstate);
// recv_size == size
// recv_device == device
// Free memory
void *d_mem = (void*)recv_d_mem;
// allocate the aclrtDrvMemHandle
aclrtDrvMemHandle* p_memHandle =
(aclrtDrvMemHandle*)recv_p_memHandle;
unmap_and_release(device, size, d_mem, p_memHandle);
// free address and the handle
aclError error_code = aclrtReleaseMemAddress(d_mem);
if (error_code != 0) {
std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
<< __LINE__ << std::endl;
return;
}
free(p_memHandle);
}
// ---------------------------------------------------------------------------
// Python extension boilerplate:
// Python-exposed function: init_module(python_malloc, python_free)
static PyObject* py_init_module(PyObject* self, PyObject* args) {
PyObject* malloc_callback = nullptr;
PyObject* free_callback = nullptr;
if (!PyArg_ParseTuple(args, "OO", &malloc_callback, &free_callback)) {
return nullptr;
}
if (!PyCallable_Check(malloc_callback) || !PyCallable_Check(free_callback)) {
PyErr_SetString(PyExc_TypeError, "Both arguments must be callables");
return nullptr;
}
// Save the Python callables
// This module does not handle GC of these objects, so they must be kept alive
// outside of this module.
g_python_malloc_callback = malloc_callback;
g_python_free_callback = free_callback;
Py_RETURN_NONE;
}
static PyObject* python_unmap_and_release(PyObject* self, PyObject* args) {
if (!args || !PyTuple_Check(args) || PyTuple_Size(args) != 4) {
PyErr_SetString(PyExc_TypeError, "Expected a tuple of size 4");
return nullptr;
}
unsigned long long recv_device, recv_size;
unsigned long long recv_d_mem, recv_p_memHandle;
// Unpack the tuple into four C integers
if (!PyArg_ParseTuple(args, "KKKK", &recv_device, &recv_size, &recv_d_mem,
&recv_p_memHandle)) {
// PyArg_ParseTuple sets an error if it fails
return nullptr;
}
void *d_mem_ptr = (void*)recv_d_mem;
aclrtDrvMemHandle* p_memHandle =
(aclrtDrvMemHandle*)recv_p_memHandle;
unmap_and_release(recv_device, recv_size, d_mem_ptr, p_memHandle);
Py_RETURN_NONE;
}
static PyObject* python_create_and_map(PyObject* self, PyObject* args) {
if (!args || !PyTuple_Check(args) || PyTuple_Size(args) != 4) {
PyErr_SetString(PyExc_TypeError, "Expected a tuple of size 4");
return nullptr;
}
unsigned long long recv_device, recv_size;
unsigned long long recv_d_mem, recv_p_memHandle;
// Unpack the tuple into four C integers
if (!PyArg_ParseTuple(args, "KKKK", &recv_device, &recv_size, &recv_d_mem,
&recv_p_memHandle)) {
// PyArg_ParseTuple sets an error if it fails
return nullptr;
}
void *d_mem_ptr = (void*)recv_d_mem;
aclrtDrvMemHandle* p_memHandle =
(aclrtDrvMemHandle*)recv_p_memHandle;
create_and_map(recv_device, recv_size, d_mem_ptr, p_memHandle);
Py_RETURN_NONE;
}
static PyMethodDef module_methods[] = {
{"init_module", (PyCFunction)py_init_module, METH_VARARGS,
"Initialize module with python_malloc and python_free callables."},
{"python_create_and_map", (PyCFunction)python_create_and_map, METH_VARARGS,
"Create and map memory on the device."},
{"python_unmap_and_release", (PyCFunction)python_unmap_and_release,
METH_VARARGS, "Unmap and release memory on the device."},
{NULL, NULL, 0, NULL} // sentinel
};
static struct PyModuleDef camem_allocator_module = {
PyModuleDef_HEAD_INIT, "camem_allocator",
"CANN-mem-based allocator for NPUPluggableAllocator", -1, module_methods};
PyMODINIT_FUNC PyInit_vllm_ascend_C(void) {
// Initialize the module
PyObject* module = PyModule_Create(&camem_allocator_module);
if (!module) {
return NULL;
}
return module;
}
} // extern "C"

View File

@ -0,0 +1,378 @@
/*
* Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
*/
#include "kernel_operator.h"
#include "kernel_tensor_impl.h"
#include "kernel_type.h"
#include "types.h"
#include "utils.h"
using vllm_ascend::AccType;
template<typename scalar_t>
class GetMaskedInputAndMask {
public:
__aicore__ inline GetMaskedInputAndMask() {}
__aicore__ inline ~GetMaskedInputAndMask() {
pipe.Reset();
}
__aicore__ inline void Init(
__gm__ scalar_t* input,
__gm__ scalar_t* masked_input,
__gm__ bool* mask_out,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index,
const int64_t size)
{
// Initialize basic parameters
input_ = input;
masked_input_ = masked_input;
mask_out_ = mask_out;
org_vocab_start_index_ = org_vocab_start_index;
org_vocab_end_index_ = org_vocab_end_index;
size_ = ((size + 31) / 32) * 32;
added_offset_ = added_vocab_start_index -
(org_vocab_end_index - org_vocab_start_index) -
num_org_vocab_padding;
added_vocab_start_index_ = added_vocab_start_index;
added_vocab_end_index_ = added_vocab_end_index;
// Initialize global tensors
inputGlobal.SetGlobalBuffer(input);
maskedOutputGlobal.SetGlobalBuffer(masked_input);
maskOutGlobal.SetGlobalBuffer(mask_out);
// Initialize queues
pipe.InitBuffer(inQueue, 1, size_ * sizeof(scalar_t));
pipe.InitBuffer(outQueue, 1, size_ * sizeof(scalar_t));
pipe.InitBuffer(maskQueue, 1, size_ * sizeof(bool));
// Initialize calculation buffers
// NOTE: calc_buf_1 and calc_buf_2 are also used for int16 casting on older archs.
pipe.InitBuffer(calc_buf_1, size_ * sizeof(float));
pipe.InitBuffer(calc_buf_2, size_ * sizeof(float));
// Initialize result queues
pipe.InitBuffer(result_ge_que, BUFFER_NUM, size_ * sizeof(float));
pipe.InitBuffer(result_le_que, BUFFER_NUM, size_ * sizeof(float));
pipe.InitBuffer(result_org_mask_que, BUFFER_NUM, size_ * sizeof(float));
pipe.InitBuffer(result_add_mask_que, BUFFER_NUM, size_ * sizeof(float));
// Initialize temporary buffers
pipe.InitBuffer(start_buf, size_ * sizeof(float));
pipe.InitBuffer(end_buf, size_ * sizeof(float));
pipe.InitBuffer(inputFloat_buf, size_ * sizeof(float)); // Also used for half intermediate in casting
pipe.InitBuffer(validOffset_buf, size_ * sizeof(float));
pipe.InitBuffer(vocabMask_buf_, size_ * sizeof(int8_t));
pipe.InitBuffer(ones_buf_, size_ * sizeof(float));
}
__aicore__ inline void Process()
{
CopyIn();
Compute();
CopyOut();
}
private:
__aicore__ inline void CopyIn()
{
AscendC::LocalTensor<scalar_t> inputLocal = inQueue.AllocTensor<scalar_t>();
AscendC::DataCopy(inputLocal, inputGlobal, size_);
inQueue.EnQue(inputLocal);
}
__aicore__ inline void CompareWithValue(
AscendC::LocalTensor<int8_t>& result,
const AscendC::LocalTensor<float>& input,
const AscendC::LocalTensor<float>& compare_value,
bool is_greater_equal) {
AscendC::LocalTensor<float> compute_buf = calc_buf_1.Get<float>();
if (is_greater_equal) {
AscendC::Max(compute_buf, input, compare_value, size_);
AscendC::Sub(compute_buf, compare_value, compute_buf, size_);
} else {
AscendC::Max(compute_buf, input, compare_value, size_);
AscendC::Sub(compute_buf, compute_buf, compare_value, size_);
}
AscendC::Abs(compute_buf, compute_buf, size_);
AscendC::Mins(compute_buf, compute_buf, MIN_ACCURACY_FP32, size_);
AscendC::Muls(compute_buf, compute_buf, MAX_MUL_1_FP32, size_);
AscendC::Muls(compute_buf, compute_buf, MAX_MUL_1_FP32, size_);
AscendC::Muls(compute_buf, compute_buf, MAX_MUL_2_FP32, size_);
AscendC::Adds(compute_buf, compute_buf, NEGATIVE_ONE_FP32, size_);
AscendC::Abs(compute_buf, compute_buf, size_);
AscendC::LocalTensor<half> compute_buf_fp16 = calc_buf_2.Get<half>();
AscendC::Cast(compute_buf_fp16, compute_buf, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(result, compute_buf_fp16, AscendC::RoundMode::CAST_NONE, size_);
}
__aicore__ inline void ComputeRangeMask(
AscendC::LocalTensor<int8_t>& range_mask,
const AscendC::LocalTensor<float>& input,
const float start_value,
const float end_value) {
AscendC::LocalTensor<float> start_value_tensor = start_buf.Get<float>();
AscendC::LocalTensor<float> end_value_tensor = end_buf.Get<float>();
AscendC::Duplicate(start_value_tensor, start_value, size_);
AscendC::Duplicate(end_value_tensor, end_value, size_);
AscendC::LocalTensor<int8_t> ge_result = result_ge_que.AllocTensor<int8_t>();
AscendC::LocalTensor<int8_t> lt_result = result_le_que.AllocTensor<int8_t>();
CompareWithValue(ge_result, start_value_tensor, input, true);
CompareWithValue(lt_result, input, end_value_tensor, false);
#if (__CCE_AICORE__ >= 220)
AscendC::And(range_mask, ge_result, lt_result, size_);
#else
{
// WORKAROUND for older arch
// No direct int8->int16 cast. Use half as intermediate.
// No direct int8 And. Use int16 And.
AscendC::LocalTensor<int16_t> ge_result_i16 = calc_buf_1.Get<int16_t>();
AscendC::LocalTensor<int16_t> lt_result_i16 = calc_buf_2.Get<int16_t>();
AscendC::LocalTensor<int16_t> range_mask_i16 = ge_result_i16;
// Use a temporary buffer for half type
AscendC::LocalTensor<half> tmp_half = inputFloat_buf.Get<half>();
// 1. Cast inputs: int8_t -> half -> int16_t
AscendC::Cast(tmp_half, ge_result, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(ge_result_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(tmp_half, lt_result, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(lt_result_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
// 2. Perform And on int16_t tensors
AscendC::And(range_mask_i16, ge_result_i16, lt_result_i16, size_);
// 3. Cast result back: int16_t -> half -> int8_t
AscendC::Cast(tmp_half, range_mask_i16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(range_mask, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
}
#endif
}
__aicore__ inline void Compute() {
AscendC::LocalTensor<scalar_t> inputLocal = inQueue.DeQue<scalar_t>();
AscendC::LocalTensor<scalar_t> maskedLocal = outQueue.AllocTensor<scalar_t>();
AscendC::LocalTensor<int8_t> maskLocal = maskQueue.AllocTensor<int8_t>();
AscendC::LocalTensor<float> inputFloat = inputFloat_buf.Get<float>();
AscendC::Cast(inputFloat, inputLocal, AscendC::RoundMode::CAST_NONE, size_);
AscendC::LocalTensor<int8_t> orgVocabMask = result_org_mask_que.AllocTensor<int8_t>();
ComputeRangeMask(orgVocabMask,
inputFloat,
static_cast<float>(org_vocab_start_index_),
static_cast<float>(org_vocab_end_index_));
AscendC::LocalTensor<int8_t> addedVocabMask = result_add_mask_que.AllocTensor<int8_t>();
ComputeRangeMask(addedVocabMask,
inputFloat,
static_cast<float>(added_vocab_start_index_),
static_cast<float>(added_vocab_end_index_));
AscendC::LocalTensor<float> validOffset = validOffset_buf.Get<float>();
AscendC::LocalTensor<float> constOrgStartIndex = start_buf.Get<float>();
AscendC::Duplicate(constOrgStartIndex, float(org_vocab_start_index_), size_);
AscendC::LocalTensor<half> orgVocabMask_fp16;
AscendC::LocalTensor<float> orgVocabMask_fp32;
AscendC::Cast(orgVocabMask_fp16, orgVocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(orgVocabMask_fp32, orgVocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Mul(validOffset, constOrgStartIndex, orgVocabMask_fp32, size_);
AscendC::LocalTensor<float> addedOffset;
AscendC::LocalTensor<float> addedOffsetTensor = end_buf.Get<float>();
AscendC::Duplicate(addedOffsetTensor, float(added_offset_), size_);
AscendC::LocalTensor<half> addedVocabMask_fp16;
AscendC::LocalTensor<float> addedVocabMask_fp32;
AscendC::Cast(addedVocabMask_fp16, addedVocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(addedVocabMask_fp32, addedVocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Mul(addedOffset, addedOffsetTensor, addedVocabMask_fp32, size_);
AscendC::Add(validOffset, validOffset, addedOffset, size_);
AscendC::LocalTensor<int8_t> vocabMask = vocabMask_buf_.Get<int8_t>();
#if (__CCE_AICORE__ >= 220)
AscendC::Or(vocabMask,
orgVocabMask,
addedVocabMask,
size_);
#else
{
// WORKAROUND for older arch
// No direct int8->int16 cast. Use half as intermediate.
// No direct int8 Or. Use int16 Or.
AscendC::LocalTensor<int16_t> orgVocabMask_i16 = calc_buf_1.Get<int16_t>();
AscendC::LocalTensor<int16_t> addedVocabMask_i16 = calc_buf_2.Get<int16_t>();
AscendC::LocalTensor<int16_t> vocabMask_i16 = orgVocabMask_i16;
// Use a temporary buffer for half type. inputFloat_buf is free now.
AscendC::LocalTensor<half> tmp_half = inputFloat_buf.Get<half>();
// 1. Cast inputs: int8_t -> half -> int16_t
AscendC::Cast(tmp_half, orgVocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(orgVocabMask_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(tmp_half, addedVocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(addedVocabMask_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
// 2. Perform Or on int16_t tensors
AscendC::Or(vocabMask_i16, orgVocabMask_i16, addedVocabMask_i16, size_);
// 3. Cast result back: int16_t -> half -> int8_t
AscendC::Cast(tmp_half, vocabMask_i16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(vocabMask, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
}
#endif
AscendC::Sub(inputFloat, inputFloat, validOffset, size_);
AscendC::LocalTensor<half> vocabMask_fp16;
AscendC::LocalTensor<float> vocabMask_fp32;
AscendC::Cast(vocabMask_fp16, vocabMask, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(vocabMask_fp32, vocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Mul(inputFloat, inputFloat, vocabMask_fp32, size_);
AscendC::Cast(maskedLocal, inputFloat, AscendC::RoundMode::CAST_CEIL, size_);
outQueue.EnQue(maskedLocal);
AscendC::LocalTensor<float> ones_tensor = ones_buf_.Get<float>();
AscendC::Duplicate(ones_tensor, (float)1, size_);
AscendC::LocalTensor<float> maskLocal_fp32;
AscendC::Sub(maskLocal_fp32, ones_tensor, vocabMask_fp32, size_);
AscendC::LocalTensor<half> maskLocal_fp16;
AscendC::Cast(maskLocal_fp16, maskLocal_fp32, AscendC::RoundMode::CAST_NONE, size_);
AscendC::Cast(maskLocal, maskLocal_fp16, AscendC::RoundMode::CAST_NONE, size_);
maskQueue.EnQue(maskLocal);
inQueue.FreeTensor(inputLocal);
}
__aicore__ inline void CopyOut()
{
AscendC::LocalTensor<scalar_t> maskedLocal = outQueue.DeQue<scalar_t>();
AscendC::LocalTensor<bool> maskLocal = maskQueue.DeQue<bool>();
AscendC::DataCopy(maskedOutputGlobal, maskedLocal, size_);
AscendC::DataCopy(maskOutGlobal, maskLocal, size_);
outQueue.FreeTensor(maskedLocal);
maskQueue.FreeTensor(maskLocal);
}
private:
static constexpr int32_t BUFFER_NUM = 2;
AscendC::TPipe pipe;
AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueue;
AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue, maskQueue;
AscendC::GlobalTensor<scalar_t> inputGlobal, maskedOutputGlobal;
AscendC::GlobalTensor<bool> maskOutGlobal;
AscendC::TBuf<AscendC::TPosition::VECCALC> calc_buf_1;
AscendC::TBuf<AscendC::TPosition::VECCALC> calc_buf_2;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_ge_que;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_le_que;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_org_mask_que;
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_add_mask_que;
// Temporary buffers
AscendC::TBuf<AscendC::TPosition::VECCALC> start_buf;
AscendC::TBuf<AscendC::TPosition::VECCALC> end_buf;
AscendC::TBuf<AscendC::TPosition::VECCALC> inputFloat_buf;
AscendC::TBuf<AscendC::TPosition::VECCALC> validOffset_buf;
AscendC::TBuf<AscendC::TPosition::VECCALC> vocabMask_buf_;
AscendC::TBuf<AscendC::TPosition::VECCALC> ones_buf_;
__gm__ scalar_t *input_, *masked_input_;
__gm__ bool *mask_out_;
int64_t size_;
int64_t org_vocab_start_index_, org_vocab_end_index_;
int64_t added_vocab_start_index_, added_vocab_end_index_;
int64_t added_offset_;
static constexpr float MIN_ACCURACY_FP32 = 1.1754943508222875e-38;
static constexpr float MAX_MUL_1_FP32 = 1125899906842624;
static constexpr float MAX_MUL_2_FP32 = 67108864;
static constexpr float NEGATIVE_ONE_FP32 = -1.0f;
};
extern "C" __global__ __aicore__ void get_masked_input_and_mask_kernel(
__gm__ int32_t* input,
__gm__ int32_t* masked_input,
__gm__ bool* mask_out,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index,
const int64_t size,
const uint32_t loop_cnt,
const uint32_t aiv_num)
{
{
GetMaskedInputAndMask<int32_t> op{};
for (int64_t i = AscendC::GetBlockIdx(); i < loop_cnt; i += aiv_num) {
op.Init(input + i * size/loop_cnt,
masked_input + i * size/loop_cnt,
mask_out + i * size/loop_cnt,
org_vocab_start_index, org_vocab_end_index,
num_org_vocab_padding, added_vocab_start_index,
added_vocab_end_index, size/loop_cnt);
op.Process();
}
} // op destructor called here
}
namespace vllm_ascend {
void get_masked_input_and_mask_impl(
void* stream,
void* input,
void* masked_input,
void* mask_out,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index,
const int64_t size,
const uint32_t loop_cnt,
const uint32_t aiv_num)
{
get_masked_input_and_mask_kernel<<<aiv_num, nullptr, stream>>>(
static_cast<int32_t*>(input),
static_cast<int32_t*>(masked_input),
static_cast<bool*>(mask_out),
org_vocab_start_index,
org_vocab_end_index,
num_org_vocab_padding,
added_vocab_start_index,
added_vocab_end_index,
size,
loop_cnt,
aiv_num);
}
} // namespace vllm_ascend

View File

@ -0,0 +1,377 @@
/*
* Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "kernel_operator.h"
#include "kernel_tpipe_impl.h"
#include "kernel_tensor_impl.h"
#include "kernel_type.h"
#include "kernel_operator_intf.h"
#include "inner_interface/inner_kernel_operator_intf.h"
#include <stdio.h>
#include "types.h"
#include "utils.h"
using vllm_ascend::AccType;
using vllm_ascend::local_mem_copy;
template <typename scalar_t, bool isNeox> class RotaryEmbedding {
// NOTE(ganyi): we use 512B as load stride for pipe, need to find another way to
// retrieve this size from runtime for more Soc support
#if (__CCE_AICORE__ >= 220)
static int constexpr loadSize = 512;
#else
static int constexpr loadSize = 1024 * 4;
#endif
using dst_t = scalar_t;
using acc_t = typename AccType<scalar_t>::type;
// only half tensor have cast instruct to int8, hardcode acc_dst_t as half
using local_scalar_t = AscendC::LocalTensor<scalar_t>;
using local_acc_t = AscendC::LocalTensor<acc_t>;
using local_dst_t = AscendC::LocalTensor<dst_t>;
public:
__aicore__ inline RotaryEmbedding()
{
}
// Allocate buffers for input and output queue and the temp buffer used during kernel compute process,
// this init process happens only in the kernel compute on a single vector core.
__aicore__ inline void init(__gm__ int64_t *positions, __gm__ void *queryDst, __gm__ void *keyDst,
__gm__ scalar_t *query, __gm__ scalar_t *key, __gm__ scalar_t *cosSinCache,
const int rotDim, const int64_t dstQueryStride,
const int64_t dstKeyStride, const int64_t queryStride, const int64_t keyStride,
const int numHeads, const int numKvHeads, const int headSize, AscendC::TPipe *pipe)
{
pipe_ = pipe;
rotDim_ = rotDim;
// query stride and key stride is used to handle the strided tensor which is not contiguous on num_tokens dim
queryStride_ = queryStride;
keyStride_ = keyStride;
dstQueryStride_ = dstQueryStride;
dstKeyStride_ = dstKeyStride;
numHeads_ = numHeads;
numKvHeads_ = numKvHeads;
headSize_ = headSize;
embedDim_ = rotDim / 2;
pipe_->InitBuffer(inQue_, 1 /* buffer_num */, loadSize /* buffer_size */);
pipe_->InitBuffer(inQueSinCos_, 1 /* buffer_num */, rotDim_ * sizeof(scalar_t) /* buffer_size */);
pipe_->InitBuffer(outQue_, 1 /* buffer_num */, loadSize /* buffer_size */);
// 2 temporary calculation buffer
calcTmpBufferOffset_ = 0;
// 1 upcast buffer for bf16 (headSize)
upcastInputBufferOffset_ = calcTmpBufferOffset_ + sizeof(acc_t) * embedDim_ * 2;
// 1 upcast temp buffer for bf16 (2 * embed_dim)
upcastTempBufferOffset_ = upcastInputBufferOffset_ + sizeof(acc_t) * headSize_;
// 2 sin cos upcast buffer for bf16
cosSinUpcastBufferOffset_ = upcastTempBufferOffset_ + sizeof(acc_t) * 2 * embedDim_;
// 2. bf16 path: needs 2 cos sin upcast buffer size
// 3. fp16 path: needs 2 temporary calculation buffer size
tempBufferSize_ = cosSinUpcastBufferOffset_ + 2 * embedDim_ * sizeof(acc_t);
// need to consider upcast the bf16 to fp32, so we might need 4 buffer just in case
// 2 temporary buffer, 2 input buffer, 1 cos buffer, 1 sin buffer, 2 scale buffer (headSize), 2 zp
// buffer(headSize int8), 1 dst_temp buffer(headSize, int32)
pipe_->InitBuffer(calcBuf_, tempBufferSize_ /* buffer_size */);
if constexpr (!std::is_same_v<scalar_t, acc_t>) {
pipe_->InitBuffer(copyBuf_, loadSize);
}
}
__aicore__ inline void update_mem_offset(__gm__ int64_t *positions, __gm__ void *queryDst, __gm__ void *keyDst,
__gm__ scalar_t *query, __gm__ scalar_t *key, __gm__ scalar_t *cosSinCache,
const int rotDim, const int64_t dstQueryStride, const int64_t dstKeyStride,
const int64_t queryStride, const int64_t keyStride, const int numHeads,
const int numKvHeads, const int headSize, const int64_t idx)
{
int64_t pos = positions[idx];
cosSin_.SetGlobalBuffer(cosSinCache + pos * rotDim_, rotDim_);
query_.SetGlobalBuffer(query + queryStride * idx, headSize * numHeads_);
key_.SetGlobalBuffer(key + keyStride * idx, headSize * numKvHeads_);
queryDst_.SetGlobalBuffer(reinterpret_cast<__gm__ dst_t *>(queryDst) + dstQueryStride * idx,
headSize * numHeads_);
keyDst_.SetGlobalBuffer(reinterpret_cast<__gm__ dst_t *>(keyDst) + dstKeyStride * idx, headSize * numKvHeads_);
}
// compute per head for neox on bf16
template <typename acc_t_, typename std::enable_if<!std::is_same_v<acc_t_, scalar_t>, void>::type * = nullptr>
__aicore__ inline void
neox_compute(local_scalar_t src, local_dst_t dst, AscendC::LocalTensor<acc_t_> sin, AscendC::LocalTensor<acc_t_> cos,
AscendC::LocalTensor<acc_t_> upcastInputBuffer, AscendC::LocalTensor<acc_t_> calcTmpBuffer)
{
// slice dst
local_dst_t dstX = dst;
local_dst_t dstY = dst[embedDim_];
// slice src
local_scalar_t srcX = src;
local_scalar_t srcY = src[embedDim_];
// slice temp buffer
local_acc_t calcTmpBufferX = calcTmpBuffer;
local_acc_t calcTmpBufferY = calcTmpBuffer[embedDim_];
// slice upcast input buffer
local_acc_t upcastBufferX = upcastInputBuffer;
local_acc_t upcastBufferY = upcastBufferX[embedDim_];
// dst x calc
Cast(upcastInputBuffer, src, AscendC::RoundMode::CAST_NONE, headSize_);
Mul(calcTmpBufferX, upcastBufferX, cos, embedDim_);
Mul(calcTmpBufferY, upcastBufferY, sin, embedDim_);
Sub(calcTmpBufferX, calcTmpBufferX, calcTmpBufferY, embedDim_);
Cast(dstX, calcTmpBufferX, AscendC::RoundMode::CAST_TRUNC, embedDim_);
// dst y calc
Mul(calcTmpBufferX, upcastBufferX, sin, embedDim_);
Mul(calcTmpBufferY, upcastBufferY, cos, embedDim_);
Add(calcTmpBufferX, calcTmpBufferX, calcTmpBufferY, embedDim_);
Cast(dstY, calcTmpBufferX, AscendC::RoundMode::CAST_TRUNC, embedDim_);
}
// compute per head output for neox
template <typename acc_t_, typename std::enable_if<std::is_same_v<acc_t_, scalar_t>, void>::type * = nullptr>
__aicore__ inline void
neox_compute(local_scalar_t src, local_dst_t dst, AscendC::LocalTensor<acc_t_> sin, AscendC::LocalTensor<acc_t_> cos,
AscendC::LocalTensor<acc_t_> upcastInputBuffer, AscendC::LocalTensor<acc_t_> calcTmpBuffer)
{
// slice dst buffer
local_dst_t dstX = dst;
local_dst_t dstY = dst[embedDim_];
// slice src buffer
local_scalar_t srcX = src;
local_scalar_t srcY = src[embedDim_];
// slice temp buffer
local_acc_t calcTmpBufferX = calcTmpBuffer;
local_acc_t calcTmpBufferY = calcTmpBuffer[embedDim_];
// dst x calc
Mul(calcTmpBufferX, srcX, cos, embedDim_);
Mul(calcTmpBufferY, srcY, sin, embedDim_);
Sub(dstX, calcTmpBufferX, calcTmpBufferY, embedDim_);
// dst y calc
Mul(calcTmpBufferX, srcX, sin, embedDim_);
Mul(calcTmpBufferY, srcY, cos, embedDim_);
Add(dstY, calcTmpBufferX, calcTmpBufferY, embedDim_);
}
__aicore__ inline void compute_qk(AscendC::GlobalTensor<scalar_t> srcG, AscendC::GlobalTensor<dst_t> dstG,
local_acc_t localCos, local_acc_t localSin, local_acc_t upcastInputBuffer,
local_acc_t calcTmpBuffer, int loopCnt, int tailHeads, int loadStride,
int headNumPerLoad)
{
for (int loopNum = 0; loopNum < loopCnt; ++loopNum) {
local_scalar_t src = inQue_.AllocTensor<scalar_t>();
local_dst_t dst = outQue_.AllocTensor<dst_t>();
AscendC::DataCopy(src, srcG[loopNum * loadStride], loadStride);
inQue_.EnQue(src);
local_scalar_t srcDeque = inQue_.DeQue<scalar_t>();
if constexpr (!std::is_same_v<scalar_t, acc_t>) {
int elem_num = loadStride / sizeof(scalar_t);
AscendC::LocalTensor<acc_t> upBuffer = copyBuf_.GetWithOffset<acc_t>(elem_num, 0);
Cast(upBuffer, srcDeque, AscendC::RoundMode::CAST_TRUNC, elem_num);
Cast(dst, upBuffer, AscendC::RoundMode::CAST_TRUNC, elem_num);
} else {
local_mem_copy(dst, srcDeque, loadStride);
}
for (int i = 0; i < headNumPerLoad; ++i) {
neox_compute(srcDeque[i * headSize_], dst[i * headSize_], localSin, localCos, upcastInputBuffer,
calcTmpBuffer);
}
outQue_.EnQue(dst);
local_dst_t dstDeque = outQue_.DeQue<dst_t>();
AscendC::DataCopy(dstG[loopNum * loadStride], dstDeque, loadStride);
outQue_.FreeTensor(dstDeque);
inQue_.FreeTensor(srcDeque);
}
// process tail
{
local_scalar_t src = inQue_.AllocTensor<scalar_t>();
local_dst_t dst = outQue_.AllocTensor<dst_t>();
AscendC::DataCopy(src, srcG[loopCnt * loadStride], tailHeads * headSize_);
inQue_.EnQue(src);
local_scalar_t srcDeque = inQue_.DeQue<scalar_t>();
if constexpr (!std::is_same_v<scalar_t, acc_t>) {
int elem_num = tailHeads * headSize_ / sizeof(scalar_t);
AscendC::LocalTensor<acc_t> upBuffer = copyBuf_.GetWithOffset<acc_t>(elem_num, 0);
Cast(upBuffer, srcDeque, AscendC::RoundMode::CAST_TRUNC, elem_num);
Cast(dst, upBuffer, AscendC::RoundMode::CAST_TRUNC, elem_num);
} else {
local_mem_copy(dst, srcDeque, tailHeads * headSize_);
}
for (int i = 0; i < tailHeads; ++i) {
neox_compute(srcDeque[i * headSize_], dst[i * headSize_], localSin, localCos, upcastInputBuffer,
calcTmpBuffer);
}
outQue_.EnQue(dst);
local_dst_t dstDeque = outQue_.DeQue<dst_t>();
AscendC::DataCopy(dstG[loopCnt * loadStride], dstDeque, tailHeads * headSize_);
outQue_.FreeTensor(dstDeque);
inQue_.FreeTensor(srcDeque);
}
}
__aicore__ inline void compute_function()
{
local_scalar_t cosSinLocal = inQueSinCos_.AllocTensor<scalar_t>();
AscendC::DataCopy(cosSinLocal, cosSin_, embedDim_ * 2);
inQueSinCos_.EnQue(cosSinLocal);
local_scalar_t localSinCosDeque = inQueSinCos_.DeQue<scalar_t>();
local_scalar_t localCos = localSinCosDeque;
local_scalar_t localSin = localSinCosDeque[embedDim_];
local_acc_t calcTmpBuffer;
local_acc_t upcastInputBuffer;
local_acc_t upcastTempBuffer;
local_acc_t cosSinUpcastBuffer;
local_acc_t scaleBuffer;
local_acc_t offsetBuffer;
calcTmpBuffer = calcBuf_.GetWithOffset<acc_t>(embedDim_ * 2, calcTmpBufferOffset_);
upcastInputBuffer = calcBuf_.GetWithOffset<acc_t>(headSize_, upcastInputBufferOffset_);
upcastTempBuffer = calcBuf_.GetWithOffset<acc_t>(embedDim_ * 2, upcastTempBufferOffset_);
cosSinUpcastBuffer = calcBuf_.GetWithOffset<acc_t>(embedDim_ * 2, cosSinUpcastBufferOffset_);
local_acc_t cosAccBuffer;
local_acc_t sinAccBuffer;
if constexpr (!std::is_same_v<scalar_t, acc_t>) {
Cast(cosSinUpcastBuffer, localSinCosDeque, AscendC::RoundMode::CAST_NONE, 2 * embedDim_);
cosAccBuffer = cosSinUpcastBuffer;
sinAccBuffer = cosSinUpcastBuffer[embedDim_];
} else {
cosAccBuffer = localCos;
sinAccBuffer = localSin;
}
constexpr const int loadSizeByElem = loadSize / sizeof(scalar_t);
int64_t headNumPerLoad = loadSizeByElem / headSize_;
int64_t loopCnt = numHeads_ / headNumPerLoad;
int64_t tailHeads = numHeads_ - loopCnt * headNumPerLoad;
int64_t loadStride = headNumPerLoad * headSize_;
int64_t loopCntKv = numKvHeads_ / headNumPerLoad;
int64_t tailHeadsKv = numKvHeads_ - loopCntKv * headNumPerLoad;
compute_qk(query_, queryDst_, cosAccBuffer, sinAccBuffer, upcastInputBuffer,
calcTmpBuffer, loopCnt, tailHeads, loadStride, headNumPerLoad);
compute_qk(key_, keyDst_, cosAccBuffer, sinAccBuffer, upcastInputBuffer, calcTmpBuffer,
loopCntKv, tailHeadsKv, loadStride, headNumPerLoad);
inQueSinCos_.FreeTensor(localSinCosDeque);
}
private:
AscendC::TPipe *pipe_;
AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQue_, inQueSinCos_;
AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQue_;
AscendC::TBuf<AscendC::TPosition::VECCALC> calcBuf_;
AscendC::TBuf<AscendC::TPosition::VECCALC> copyBuf_;
AscendC::GlobalTensor<dst_t> queryDst_;
AscendC::GlobalTensor<dst_t> keyDst_;
AscendC::GlobalTensor<scalar_t> query_;
AscendC::GlobalTensor<scalar_t> key_;
AscendC::GlobalTensor<scalar_t> cosSin_;
int rotDim_;
int embedDim_;
int64_t queryStride_;
int64_t keyStride_;
int64_t dstQueryStride_;
int64_t dstKeyStride_;
int numHeads_;
int numKvHeads_;
int headSize_;
int calcTmpBufferOffset_;
int upcastInputBufferOffset_;
int upcastTempBufferOffset_;
int cosSinUpcastBufferOffset_;
int tempBufferSize_;
};
// Note: Need to use macro to instaniate all the target functions here, for the current build system dose not support template call in cpp
// We use C style symbol here for kernel compilation, cpp style kernel entry may lead to compilation failure
#define ROPE_CUSTOM_KERNEL_TYPE_DECLARE(TYPE, NEOX) \
extern "C" __global__ __aicore__ void rope_custom_##NEOX##_##TYPE( \
__gm__ int64_t* positions, __gm__ void* queryDst, __gm__ void* keyDst, __gm__ TYPE* query, __gm__ TYPE* key, \
__gm__ TYPE* cosSinCache, const int rotDim, const int64_t queryStride, const int64_t keyStride, \
const int64_t dstQueryStride, const int64_t dstKeyStride, const int numHeads, const int numKvHeads, \
const int headSize, const int64_t numTokens, const int loopNum, const int coreNum) \
{ \
AscendC::TPipe pipe; \
RotaryEmbedding<TYPE, NEOX> op{}; \
op.init(positions, queryDst, keyDst, query, key, cosSinCache, rotDim, dstQueryStride, dstKeyStride, \
queryStride, keyStride, numHeads, numKvHeads, headSize, &pipe); \
for (int64_t i = AscendC::GetBlockIdx(); i < numTokens; i += coreNum) { \
op.update_mem_offset(positions, queryDst, keyDst, query, key, cosSinCache, rotDim, dstQueryStride, dstKeyStride, \
queryStride, keyStride, numHeads, numKvHeads, headSize, i); \
op.compute_function(); \
} \
}
#define ROPE_CUSTOM_KERNEL_DECLARE(TYPE) \
ROPE_CUSTOM_KERNEL_TYPE_DECLARE(TYPE, true); \
ROPE_CUSTOM_KERNEL_TYPE_DECLARE(TYPE, false);
// Declare all the kernel entry here
ROPE_CUSTOM_KERNEL_DECLARE(half)
#if (__CCE_AICORE__ >= 220)
ROPE_CUSTOM_KERNEL_DECLARE(bfloat16_t)
#endif
namespace vllm_ascend {
#define ROTARY_EMBEDDING_KERNEL_CALL(TYPE) \
if (isNeox) \
rope_custom_true_##TYPE<<<blockDim, nullptr, stream>>>( \
positions, queryDst, keyDst, reinterpret_cast<TYPE *>(query), reinterpret_cast<TYPE *>(key), \
reinterpret_cast<TYPE *>(cosSinCache), rotDim, queryStride, keyStride, dstQueryStride, dstKeyStride, \
numHeads, numKvHeads, headSize, numTokens, loopCnt, blockDim); \
else \
rope_custom_false_##TYPE<<<blockDim, nullptr, stream>>>( \
positions, queryDst, keyDst, reinterpret_cast<TYPE *>(query), reinterpret_cast<TYPE *>(key), \
reinterpret_cast<TYPE *>(cosSinCache), rotDim, queryStride, keyStride, dstQueryStride, dstKeyStride, \
numHeads, numKvHeads, headSize, numTokens, loopCnt, blockDim);
// maximum number for runtime to launch a ascendc kernel.
// we use this to constrain the maximum number of block size
static const int64_t maxParallelSize = 65535;
extern void rotary_embedding_impl(AscendType type, bool isNeox, void *stream, int64_t *positions, void *queryDst,
void *keyDst, void *query, void *key, void *cosSinCache, const int rotDim,
const int64_t queryStride, const int64_t keyStride, const int64_t dstQueryStride,
const int64_t dstKeyStride, const int numHeads, const int numKvHeads,
const int headSize, const int64_t numTokens, const uint32_t loopCnt,
uint32_t aivNum)
{
int blockDim = maxParallelSize > numTokens ? numTokens : maxParallelSize;
if (type == AscendType::FP16) {
ROTARY_EMBEDDING_KERNEL_CALL(half);
}
#if (__CCE_AICORE__ >= 220)
else if (type == AscendType::BF16) {
ROTARY_EMBEDDING_KERNEL_CALL(bfloat16_t);
}
#endif
else {
return;
}
}
} // namespace vllm_ascend

25
csrc/kernels/types.h Normal file
View File

@ -0,0 +1,25 @@
/*
* Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once
namespace vllm_ascend {
enum struct AscendType {
FP16 = 0,
BF16 = 1,
FP32 = 2,
};
}

51
csrc/kernels/utils.h Normal file
View File

@ -0,0 +1,51 @@
/*
* Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once
#include "kernel_type.h"
namespace vllm_ascend {
template <typename scalar_t> struct AccType;
#if (__CCE_AICORE__ >= 220)
template <> struct AccType<bfloat16_t> {
using type = float;
};
#endif
template <> struct AccType<half> {
using type = half;
};
template <> struct AccType<float> {
using type = float;
};
template <> struct AccType<int8_t> {
using type = int;
};
template <typename scalar_t>
__aicore__ inline void local_mem_copy(AscendC::LocalTensor<scalar_t> dst, AscendC::LocalTensor<scalar_t> src, int size)
{
constexpr int loadSize = 256 / sizeof(scalar_t);
int loopCnt = size / loadSize;
int tailSize = size % loadSize;
if (loopCnt)
AscendC::Copy(dst, src, loadSize, loopCnt, {1, 1, 8, 8});
AscendC::Copy(dst[loopCnt * loadSize], src[loopCnt * loadSize], tailSize, 1, {1, 1, 8, 8});
}
} // namespace vllm_ascend

63
csrc/ops.h Normal file
View File

@ -0,0 +1,63 @@
/*
* Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once
#include <optional>
#include <torch/library.h>
#include <vector>
#include "kernels/types.h"
#include "torch_npu/csrc/aten/common/from_blob.h"
namespace vllm_ascend {
extern void rotary_embedding_impl(AscendType type, bool isNeox, void *stream, int64_t *positions, void *queryDst,
void *keyDst, void *query, void *key, void *cosSinCache, const int rotDim,
const int64_t queryStride, const int64_t keyStride, const int64_t dstQueryStride,
const int64_t dstKeyStride, const int numHeads, const int numKvHeads,
const int headSize, const int64_t numTokens, const uint32_t loopCnt,
uint32_t aivNum);
extern void get_masked_input_and_mask_impl(
void* stream,
void* input,
void* masked_input,
void* mask_out,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index,
const int64_t size,
const uint32_t loop_cnt,
const uint32_t aiv_num);
torch::Tensor weak_ref_tensor(torch::Tensor& tensor) {
if (!tensor.is_privateuseone()) {
throw std::runtime_error("Tensor must be on NPU device");
}
// Get the raw data pointer
void* data_ptr = tensor.data_ptr();
// Get tensor sizes and strides
std::vector<int64_t> sizes = tensor.sizes().vec();
std::vector<int64_t> strides = tensor.strides().vec();
// Get tensor options (dtype, device)
auto options = tensor.options();
// Create a new tensor from the raw data pointer
auto new_tensor = at_npu::native::from_blob(data_ptr, sizes, strides, options);
return new_tensor;
}
}

233
csrc/torch_binding.cpp Normal file
View File

@ -0,0 +1,233 @@
/*
* Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <torch/extension.h>
#include <torch/library.h>
#include <torch/version.h>
#include <torch_npu/csrc/core/npu/NPUStream.h>
#include <torch_npu/csrc/framework/OpCommand.h>
#include <torch_npu/csrc/npu/Module.h>
#include <pybind11/pybind11.h>
#include "acl/acl.h"
#include "tiling/platform/platform_ascendc.h"
#include "aclnn/opdev/platform.h"
#include "ops.h"
#include "utils.h"
namespace vllm_ascend {
std::tuple<at::Tensor, at::Tensor> rotary_embedding(at::Tensor &positions, at::Tensor &query, at::Tensor &key,
int64_t head_size, at::Tensor &cos_sin_cache, bool is_neox)
{
int32_t deviceId = 0;
int64_t num_tokens = positions.numel();
int positions_ndim = positions.dim();
TORCH_CHECK(
positions_ndim == 1 || positions_ndim == 2,
"positions must have shape [num_tokens] or [batch_size, seq_len]");
if (positions_ndim == 1) {
TORCH_CHECK(
query.size(0) == positions.size(0) && key.size(0) == positions.size(0),
"query, key and positions must have the same number of tokens");
}
if (positions_ndim == 2) {
TORCH_CHECK(
query.size(0) == positions.size(0) &&
key.size(0) == positions.size(0) &&
query.size(1) == positions.size(1) &&
key.size(1) == positions.size(1),
"query, key and positions must have the same batch_size and seq_len");
}
TORCH_CHECK(head_size % 32 == 0, "rotary_embedding: headSize should be divisible by 32");
int query_hidden_size = query.numel() / num_tokens;
int key_hidden_size = key.numel() / num_tokens;
TORCH_CHECK(query_hidden_size % head_size == 0);
TORCH_CHECK(key_hidden_size % head_size == 0);
TORCH_CHECK(is_neox == true, "rotary_embedding: neox=false is not supported as custom kernel in vllm-ascend");
// Make sure query and key have consistent number of heads
int num_heads = query_hidden_size / head_size;
int num_kv_heads = key_hidden_size / head_size;
TORCH_CHECK(num_heads % num_kv_heads == 0);
at::Tensor query_dst = at::empty({num_tokens, num_heads, head_size}, query.options());
at::Tensor key_dst = at::empty({num_tokens, num_kv_heads, head_size}, key.options());
int rot_dim = cos_sin_cache.size(1);
int seq_dim_idx = positions_ndim - 1;
int64_t *position_ids_ptr = positions.data_ptr<int64_t>();
void *query_dst_ptr = query_dst.data_ptr();
void *key_dst_ptr = key_dst.data_ptr();
void *query_ptr = query.data_ptr();
void *key_ptr = key.data_ptr();
void *cos_sin_cache_ptr = cos_sin_cache.data_ptr();
int64_t query_stride = query.stride(seq_dim_idx);
int64_t key_stride = key.stride(seq_dim_idx);
int64_t dst_query_stride = query_dst.stride(0);
int64_t dst_key_stride = key_dst.stride(0);
at::ScalarType scalar_type = query.scalar_type();
aclrtStream stream = c10_npu::getCurrentNPUStream().stream();
at_npu::native::OpCommand cmd;
cmd.Name("rotary_embedding");
cmd.SetCustomHandler([scalar_type, is_neox, num_tokens, stream, position_ids_ptr, query_dst_ptr, key_dst_ptr,
query_ptr, key_ptr, cos_sin_cache_ptr, rot_dim, query_stride, key_stride,
dst_query_stride, dst_key_stride, num_heads, num_kv_heads, head_size]() -> int {
auto dtype_num = get_dtype_from_torch(scalar_type);
fe::PlatFormInfos platform_infos;
int device_id = 0;
fe::PlatformInfoManager::GeInstance().GetRuntimePlatformInfosByDevice(device_id, platform_infos);
uint32_t aivNum = platform_infos.GetCoreNumByType("aiv");
uint32_t loop_cnt = (num_tokens + aivNum - 1) / aivNum;
rotary_embedding_impl(dtype_num, is_neox, stream, position_ids_ptr, query_dst_ptr, key_dst_ptr, query_ptr,
key_ptr, cos_sin_cache_ptr, rot_dim, query_stride, key_stride, dst_query_stride,
dst_key_stride, num_heads, num_kv_heads, head_size, num_tokens, loop_cnt, aivNum);
return 0;
});
cmd.Run();
return {query_dst, key_dst};
}
std::tuple<at::Tensor, at::Tensor> get_masked_input_and_mask(
at::Tensor &input,
const int64_t org_vocab_start_index,
const int64_t org_vocab_end_index,
const int64_t num_org_vocab_padding,
const int64_t added_vocab_start_index,
const int64_t added_vocab_end_index)
/*
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L161-L198
Embedding parallelized in the vocabulary dimension.
Adapted from torch.nn.Embedding, note that we pad the vocabulary size to
make sure it is divisible by the number of model parallel GPUs.
In order to support various loading methods, we ensure that LoRA-added
embeddings are always at the end of TP-sharded tensors. In other words,
we shard base embeddings and LoRA embeddings separately (both padded),
and place them in the same tensor.
In this example, we will have the original vocab size = 1010,
added vocab size = 16 and padding to 64. Therefore, the total
vocab size with padding will be 1088 (because we first pad 1010 to
1024, add 16, and then pad to 1088).
Therefore, the tensor format looks like the following:
TP1, rank 0 (no sharding):
|< --------BASE-------- >|< -BASE PADDING-- >|< -----LORA------ >|< -LORA PADDING-- >|
corresponding token_id: | 0 | 1 | ... | 1009 | -1 | ... | -1 | 1010 | ... | 1015 | -1 | ... | -1 |
index: | 0 | 1 | ... | 1009 | 1010 | ... | 1023 | 1024 | ... | 1039 | 1040 | ... | 1087 |
TP2, rank 0:
|< --------------------BASE--------------------- >|< -----LORA------ >|< -LORA PADDING- >|
corresponding token_id: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 1000 | ... | 1015 | -1 | ... | -1 |
index: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 512 | ... | 527 | 520 | ... | 543 |
TP2, rank 1:
|< -----------BASE----------- >|< -BASE PADDING- >|< -----------LORA PADDING----------- >|
corresponding token_id: | 512 | 513 | 514 | ... | 1009 | -1 | ... | -1 | -1 | ... | -1 | -1 | ... | -1 |
index: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 512 | ... | 519 | 520 | ... | 543 |
Parameters:
org_vocab_start_index //base embeddings start
org_vocab_end_index //base embeddings end
num_org_vocab_padding //base embeddings padding
added_vocab_start_index //LoRA embeddings start
added_vocab_end_index //LoRA embeddings end
*/
{
// Input validation
TORCH_CHECK(input.dim() >= 1, "input must have at least 1 dimension");
TORCH_CHECK(org_vocab_start_index >= 0, "org_vocab_start_index must be non-negative");
TORCH_CHECK(org_vocab_end_index >= org_vocab_start_index, "org_vocab_end_index must be greater than org_vocab_start_index");
TORCH_CHECK(num_org_vocab_padding >= 0, "num_org_vocab_padding must be non-negative");
TORCH_CHECK(added_vocab_start_index >= org_vocab_end_index, "added_vocab_start_index must be greater than org_vocab_end_index");
TORCH_CHECK(added_vocab_end_index >= added_vocab_start_index, "added_vocab_end_index must be greater than added_vocab_start_index");
// Get total number of elements
int64_t size = input.numel();
// Create output tensors
at::Tensor masked_input = at::empty_like(input);
at::Tensor mask = at::empty_like(input).to(at::kBool);
// Get data pointers
void *input_ptr = input.data_ptr();
void *masked_input_ptr = masked_input.data_ptr();
void *mask_ptr = mask.data_ptr();
// Get current stream
aclrtStream stream = c10_npu::getCurrentNPUStream().stream();
// Get scalar type
at::ScalarType scalar_type = input.scalar_type();
// Create and configure OpCommand
at_npu::native::OpCommand cmd;
cmd.Name("get_masked_input_and_mask");
cmd.SetCustomHandler([scalar_type, size, stream,
input_ptr, masked_input_ptr, mask_ptr,
org_vocab_start_index, org_vocab_end_index,
num_org_vocab_padding, added_vocab_start_index,
added_vocab_end_index]() -> int {
// Get platform info
fe::PlatFormInfos platform_infos;
int device_id = 0;
fe::PlatformInfoManager::GeInstance().GetRuntimePlatformInfosByDevice(device_id, platform_infos);
uint32_t aivNum = platform_infos.GetCoreNumByType("aiv");
uint32_t loop_cnt = (size + aivNum - 1) / aivNum;
// Call implementation
get_masked_input_and_mask_impl(
stream,
input_ptr,
masked_input_ptr,
mask_ptr,
org_vocab_start_index,
org_vocab_end_index,
num_org_vocab_padding,
added_vocab_start_index,
added_vocab_end_index,
size,
loop_cnt,
aivNum);
return 0;
});
cmd.Run();
return {masked_input, mask};
}
} // namespace vllm_ascend
TORCH_LIBRARY_EXPAND(_C, ops)
{
// vLLM-Ascend custom ops
ops.def("weak_ref_tensor(Tensor input) -> Tensor");
ops.impl("weak_ref_tensor", torch::kPrivateUse1, &vllm_ascend::weak_ref_tensor);
// Rotary embedding
// Apply GPT-NeoX style rotary embedding to query and key.
ops.def(
"rotary_embedding(Tensor positions, Tensor! query,"
" Tensor! key, int head_size,"
" Tensor cos_sin_cache, bool is_neox) -> (Tensor query, Tensor key)");
ops.impl("rotary_embedding", torch::kPrivateUse1, &vllm_ascend::rotary_embedding);
ops.def(
"get_masked_input_and_mask(Tensor input, "
" int org_vocab_start_index, "
" int org_vocab_end_index, "
" int num_org_vocab_padding, "
" int added_vocab_start_index, "
" int added_vocab_end_index) -> (Tensor masked_input, Tensor mask)");
ops.impl("get_masked_input_and_mask", torch::kPrivateUse1, &vllm_ascend::get_masked_input_and_mask);
}
REGISTER_EXTENSION(_C)

43
csrc/utils.h Normal file
View File

@ -0,0 +1,43 @@
#pragma once
#include "kernels/types.h"
#include <c10/core/ScalarType.h>
#include <Python.h>
#define _CONCAT(A, B) A##B
#define CONCAT(A, B) _CONCAT(A, B)
#define _STRINGIFY(A) #A
#define STRINGIFY(A) _STRINGIFY(A)
// A version of the TORCH_LIBRARY macro that expands the NAME, i.e. so NAME
// could be a macro instead of a literal token.
#define TORCH_LIBRARY_EXPAND(NAME, MODULE) TORCH_LIBRARY(NAME, MODULE)
// A version of the TORCH_LIBRARY_IMPL macro that expands the NAME, i.e. so NAME
// could be a macro instead of a literal token.
#define TORCH_LIBRARY_IMPL_EXPAND(NAME, DEVICE, MODULE) \
TORCH_LIBRARY_IMPL(NAME, DEVICE, MODULE)
// REGISTER_EXTENSION allows the shared library to be loaded and initialized
// via python's import statement.
#define REGISTER_EXTENSION(NAME) \
PyMODINIT_FUNC CONCAT(PyInit_, NAME)() { \
static struct PyModuleDef module = {PyModuleDef_HEAD_INIT, \
STRINGIFY(NAME), nullptr, 0, nullptr}; \
return PyModule_Create(&module); \
}
namespace vllm_ascend {
AscendType get_dtype_from_torch(at::ScalarType scalarType)
{
if (scalarType == at::ScalarType::Float) {
return AscendType::FP32;
} else if (scalarType == at::ScalarType::BFloat16) {
return AscendType::BF16;
} else {
return AscendType::FP16;
}
}
} // namespace vllm_ascend

View File

@ -16,7 +16,7 @@ make html
## Open the docs with your browser
```bash
python -m http.server -d build/html/
python -m http.server -d _build/html/
```
Launch your browser and open http://localhost:8000/.

View File

@ -1,2 +1,2 @@
pytest-asyncio
pytest-mock

View File

@ -0,0 +1,58 @@
<!--
**********************************************************************
* Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
* Copyright 2023 The vLLM team.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
* This file is a part of the vllm-ascend project.
* Adapted from https://github.com/vllm-project/vllm/blob/main/docs/source/_templates/sections/header.html
**********************************************************************
-->
<style>
.notification-bar {
width: 100vw;
display: flex;
justify-content: center;
align-items: center;
font-size: 16px;
}
.notification-bar p {
margin: 0;
}
.notification-bar a {
font-weight: bold;
text-decoration: none;
}
/* Light mode styles (default) */
.notification-bar {
background-color: #fff3cd;
color: #856404;
}
.notification-bar a {
color: #d97706;
}
/* Dark mode styles */
html[data-theme=dark] .notification-bar {
background-color: #333;
color: #ddd;
}
html[data-theme=dark] .notification-bar a {
color: #ffa500; /* Brighter color for visibility */
}
</style>
<div class="notification-bar">
<p>You are viewing the latest developer preview docs. <a href="https://vllm-ascend.readthedocs.io/en/v0.7.3-dev">Click here</a> to view docs for the latest stable release(v0.7.3.post1).</p>
</div>

Binary file not shown.

After

Width:  |  Height:  |  Size: 115 KiB

View File

@ -0,0 +1,102 @@
# Maintainers and contributors
## Maintainers
| Name | Github ID | Date |
|:-----------:|:-----:|:-----:|
| Xiyuan Wang| [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
| Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 |
| Yi Gan| [@ganyi1996ppo](https://github.com/ganyi1996ppo) | 2025/02 |
| Shoujian Zheng| [@jianzs](https://github.com/jianzs) | 2025/06 |
## Contributors
vLLM Ascend every release would not have been possible without the following contributors:
Updated on 2025-06-10:
| Number | Contributor | Date | Commit ID |
|:------:|:-----------:|:-----:|:---------:|
| 83 | [@ZhengWG](https://github.com/) | 2025/7/7 | [3a469de](https://github.com/vllm-project/vllm-ascend/commit/9c886d0a1f0fc011692090b0395d734c83a469de) |
| 82 | [@wm901115nwpu](https://github.com/) | 2025/7/7 | [a2a47d4](https://github.com/vllm-project/vllm-ascend/commit/f08c4f15a27f0f27132f4ca7a0c226bf0a2a47d4) |
| 81 | [@Agonixiaoxiao](https://github.com/) | 2025/7/2 | [6f84576](https://github.com/vllm-project/vllm-ascend/commit/7fc1a984890bd930f670deedcb2dda3a46f84576) |
| 80 | [@zhanghw0354](https://github.com/zhanghw0354) | 2025/7/2 | [d3df9a5](https://github.com/vllm-project/vllm-ascend/commit/9fb3d558e5b57a3c97ee5e11b9f5dba6ad3df9a5) |
| 79 | [@GDzhu01](https://github.com/GDzhu01) | 2025/6/28 | [de256ac](https://github.com/vllm-project/vllm-ascend/commit/b308a7a25897b88d4a23a9e3d583f4ec6de256ac) |
| 78 | [@leo-pony](https://github.com/leo-pony) | 2025/6/26 | [3f2a5f2](https://github.com/vllm-project/vllm-ascend/commit/10253449120307e3b45f99d82218ba53e3f2a5f2) |
| 77 | [@zeshengzong](https://github.com/zeshengzong) | 2025/6/26 | [3ee25aa](https://github.com/vllm-project/vllm-ascend/commit/192dbbcc6e244a8471d3c00033dc637233ee25aa) |
| 76 | [@sharonyunyun](https://github.com/sharonyunyun) | 2025/6/25 | [2dd8666](https://github.com/vllm-project/vllm-ascend/commit/941269a6c5bbc79f6c1b6abd4680dc5802dd8666) |
| 75 | [@Pr0Wh1teGivee](https://github.com/Pr0Wh1teGivee) | 2025/6/25 | [c65dd40](https://github.com/vllm-project/vllm-ascend/commit/2fda60464c287fe456b4a2f27e63996edc65dd40) |
| 74 | [@xleoken](https://github.com/xleoken) | 2025/6/23 | [c604de0](https://github.com/vllm-project/vllm-ascend/commit/4447e53d7ad5edcda978ca6b0a3a26a73c604de0) |
| 73 | [@lyj-jjj](https://github.com/lyj-jjj) | 2025/6/23 | [5cbd74e](https://github.com/vllm-project/vllm-ascend/commit/5177bef87a21331dcca11159d3d1438075cbd74e) |
| 72 | [@farawayboat](https://github.com/farawayboat)| 2025/6/21 | [bc7d392](https://github.com/vllm-project/vllm-ascend/commit/097e7149f75c0806774bc68207f0f6270bc7d392)
| 71 | [@yuancaoyaoHW](https://github.com/yuancaoyaoHW) | 2025/6/20 | [7aa0b94](https://github.com/vllm-project/vllm-ascend/commit/00ae250f3ced68317bc91c93dc1f1a0977aa0b94)
| 70 | [@songshanhu07](https://github.com/songshanhu07) | 2025/6/18 | [5e1de1f](https://github.com/vllm-project/vllm-ascend/commit/2a70dbbdb8f55002de3313e17dfd595e1de1f)
| 69 | [@wangyanhui-cmss](https://github.com/wangyanhui-cmss) | 2025/6/12| [40c9e88](https://github.com/vllm-project/vllm-ascend/commit/2a5fb4014b863cee6abc3009f5bc5340c9e88) |
| 68 | [@chenwaner](https://github.com/chenwaner) | 2025/6/11 | [c696169](https://github.com/vllm-project/vllm-ascend/commit/e46dc142bf1180453c64226d76854fc1ec696169) |
| 67 | [@yzim](https://github.com/yzim) | 2025/6/11 | [aaf701b](https://github.com/vllm-project/vllm-ascend/commit/4153a5091b698c2270d160409e7fee73baaf701b) |
| 66 | [@Yuxiao-Xu](https://github.com/Yuxiao-Xu) | 2025/6/9 | [6b853f1](https://github.com/vllm-project/vllm-ascend/commit/6b853f15fe69ba335d2745ebcf14a164d0bcc505) |
| 65 | [@ChenTaoyu-SJTU](https://github.com/ChenTaoyu-SJTU) | 2025/6/7 | [20dedba](https://github.com/vllm-project/vllm-ascend/commit/20dedba5d1fc84b7ae8b49f9ce3e3649389e2193) |
| 64 | [@zxdukki](https://github.com/zxdukki) | 2025/6/7 | [87ebaef](https://github.com/vllm-project/vllm-ascend/commit/87ebaef4e4e519988f27a6aa378f614642202ecf) |
| 63 | [@sdmyzlp](https://github.com/sdmyzlp) | 2025/6/7 | [3640c60](https://github.com/vllm-project/vllm-ascend/commit/3640c60b0eb4d4cb104e20bfa406d3f1d17920a7) |
| 62 | [@weijinqian0](https://github.com/weijinqian0) | 2025/6/7 | [e9ada68](https://github.com/vllm-project/vllm-ascend/commit/e9ada685ece798f9fe0d4a287e3f5246a8a7207b) |
| 61 | [@hahazhky](https://github.com/hahazhky) | 2025/6/6 | [0b12c2a](https://github.com/vllm-project/vllm-ascend/commit/0b12c2acf7d9fd192beebebf662298067d9a5435) |
| 60 | [@depeng1994](https://github.com/depeng1994) | 2025/6/6 | [6b094a2](https://github.com/vllm-project/vllm-ascend/commit/6b094a2bd49a8a41eb3647568b2d9e5b337db81f) |
| 59 | [@David9857](https://github.com/David9857) | 2025/6/5 | [78431b3](https://github.com/vllm-project/vllm-ascend/commit/78431b34694dfa3c8f54ed7cc626660318557927) |
| 58 | [@momo609](https://github.com/momo609) | 2025/6/5 | [908a851](https://github.com/vllm-project/vllm-ascend/commit/908a851a776cfd9051cc062119e6ec481561c6f7) |
| 57 | [@zhangxinyuehfad](https://github.com/zhangxinyuehfad) | 2025/6/5 | [7737aaa](https://github.com/vllm-project/vllm-ascend/commit/7737aaa40f699b233a35fb61e908b687adc1e2e5) |
| 56 | [@NINGBENZHE](https://github.com/NINGBENZHE) | 2025/6/3 | [6ec64a3](https://github.com/vllm-project/vllm-ascend/commit/6ec64a3f9686df65b5a23a41aa301e669db19099) |
| 55 | [@XWFAlone](https://github.com/XWFAlone) | 2025/5/30 | [3442fbd](https://github.com/vllm-project/vllm-ascend/commit/3442fbdb235b4c6d72c2bc64a49707a7bd89958e) |
| 54 | [@YisongJiang](https://github.com/YisongJiang) | 2025/5/29 | [90afaf6](https://github.com/vllm-project/vllm-ascend/commit/90afaf6306f680307462becf3c78585737579851) |
| 53 | [@ponix-j](https://github.com/ponix-j) | 2025/5/23 | [df58fb8](https://github.com/vllm-project/vllm-ascend/commit/df58fb80eee24139fc61c495be3ce79cf81b3f73) |
| 52 | [@ttanzhiqiang](https://github.com/ttanzhiqiang) | 2025/5/23 | [dc6172e](https://github.com/vllm-project/vllm-ascend/commit/dc6172efd3860ce95b40a7b3e93611f875f06d40) |
| 51 | [@yangpuPKU](https://github.com/yangpuPKU) | 2025/5/23 | [46df67a](https://github.com/vllm-project/vllm-ascend/commit/46df67a5e9ab73fade08cbb2d8c0155cee7316d1) |
| 50 | [@wonderful199082](https://github.com/wonderful199082) | 2025/5/20 | [5cf9ff1](https://github.com/vllm-project/vllm-ascend/commit/5cf9ff18e91b0b7031c258d71a257b8e24689763) |
| 49 | [@22dimensions](https://github.com/22dimensions) | 2025/5/17 | [a8730e7](https://github.com/vllm-project/vllm-ascend/commit/a8730e7a3c4ac6c4b39a5946c943252fdea6cce5) |
| 48 | [@cxcxflying](https://github.com/cxcxflying) | 2025/5/13 | [e564470](https://github.com/vllm-project/vllm-ascend/commit/e56447033889ca95df512208cab22ef832bfdf07) |
| 47 | [@NeverRaR](https://github.com/NeverRaR) | 2025/5/12 | [efabd72](https://github.com/vllm-project/vllm-ascend/commit/efabd722eb757e49aa309c173bbec91ca8c4ced1) |
| 46 | [@chris668899](https://github.com/chris668899) | 2025/5/8 | [6c02088](https://github.com/vllm-project/vllm-ascend/commit/6c020883a8332b5c519f4f6502733edd9b391c2b) |
| 45 | [@sunbaosong](https://github.com/sunbaosong) | 2025/5/6 | [d6bfae8](https://github.com/vllm-project/vllm-ascend/commit/d6bfae8eeebedf677b643b712d367a3a69c9cce4) |
| 44 | [@ApsarasX](https://github.com/ApsarasX) | 2025/4/29 | [87975fa](https://github.com/vllm-project/vllm-ascend/commit/87975fa058fe3f90d204ded42a08989a8dcb413e) |
| 43 | [@zouyida2052](https://github.com/zouyida2052) | 2025/4/28 | [b9528e6](https://github.com/vllm-project/vllm-ascend/commit/b9528e6ecdc417cf444e55a0ce4a2bafdef0ea3b) |
| 42 | [@ZhengJun9](https://github.com/ZhengJun9) | 2025/4/28 | [1791113](https://github.com/vllm-project/vllm-ascend/commit/17911138c90d78a76bd691e9dcb56763db35b19f) |
| 41 | [@linfeng-yuan](https://github.com/linfeng-yuan) | 2025/4/28 | [2204e4d](https://github.com/vllm-project/vllm-ascend/commit/2204e4d08f8e10cf9c30154a14eaa5ca956c2acd) |
| 40 | [@jianzs](https://github.com/jianzs) | 2025/4/27 | [fa4a5d9](https://github.com/vllm-project/vllm-ascend/commit/fa4a5d980e8845a88b9162cf169f0a5ab230f8a5) |
| 39 | [@fakeYan](https://github.com/fakeYan) | 2025/4/23 | [05bdcbe](https://github.com/vllm-project/vllm-ascend/commit/05bdcbeae47c7fcb9b1c30cad059abf1d40b5421) |
| 38 | [@RongRongStudio](https://github.com/RongRongStudio) | 2025/4/22 | [848e041](https://github.com/vllm-project/vllm-ascend/commit/848e041a54732c923660dd02daf8e9bf439736a2) |
| 37 | [@paulyu12](https://github.com/paulyu12) | 2025/4/17 | [697908f](https://github.com/vllm-project/vllm-ascend/commit/697908f5cd7c65a3a917ec1a962b0886efc98c7e) |
| 36 | [@heartStrive1998](https://github.com/heartStrive1998) | 2025/4/16 | [2f15503](https://github.com/vllm-project/vllm-ascend/commit/2f155039dc3997640854daef469bbf0cb77dc6ed) |
| 35 | [@eeethenQ](https://github.com/eeethenQ) | 2025/4/15 | [44a8301](https://github.com/vllm-project/vllm-ascend/commit/44a8301424ded94dae83e13b837f5bfc0a1bfc15) |
| 34 | [@wxsIcey](https://github.com/wxsIcey) | 2025/4/10 | [d05ea17](https://github.com/vllm-project/vllm-ascend/commit/d05ea17427b82a506b97409a7de8359f18f565f7) |
| 33 | [@yx0716](https://github.com/yx0716) | 2025/4/8 | [5d62393](https://github.com/vllm-project/vllm-ascend/commit/5d6239306be9b0f5ac6dbaa137048c372a92ff20) |
| 32 | [@celestialli](https://github.com/celestialli) | 2025/4/7 | [2b765dc](https://github.com/vllm-project/vllm-ascend/commit/2b765dcc4974b1bafc26ff5da817ce7e652f0eb0) |
| 31 | [@hfadzxy](https://github.com/hfadzxy) | 2025/3/30 | [7beb433](https://github.com/vllm-project/vllm-ascend/commit/7beb4339dc8047af9ef64db1d0a8c59ddbb3709f) |
| 30 | [@wuhuikx](https://github.com/wuhuikx) | 2025/3/28 | [57a84bb](https://github.com/vllm-project/vllm-ascend/commit/57a84bb7befeaa0dc62aa35fa406e4d6affbfcca) |
| 29 | [@zzzzwwjj](https://github.com/zzzzwwjj) | 2025/3/28 | [12390af](https://github.com/vllm-project/vllm-ascend/commit/12390af075962456ecc8233d8dcce7064b75f390) |
| 28 | [@ganyi1996ppo](https://github.com/ganyi1996ppo) | 2025/3/28 | [27e86b9](https://github.com/vllm-project/vllm-ascend/commit/27e86b993a6a810d818143ec9dbfc439a419fa77) |
| 27 | [@ZhengZhenyu](https://github.com/ZhengZhenyu) | 2025/3/26 | [0b5a964](https://github.com/vllm-project/vllm-ascend/commit/0b5a9643fd6c3240d7ede669e37209d7ff433841) |
| 26 | [@baifanxxx](https://github.com/baifanxxx) | 2025/3/26 | [1225052](https://github.com/vllm-project/vllm-ascend/commit/122505208ff6284f409846ca7294f4a4b9883285) |
| 25 | [@rjg-lyh](https://github.com/rjg-lyh) | 2025/3/13 | [6512470](https://github.com/vllm-project/vllm-ascend/commit/65124705fb39d4cc2c94c80254421e067a82fe50) |
| 24 | [@xiemingda-1002](https://github.com/xiemingda-1002) | 2025/3/12 | [59ea23d](https://github.com/vllm-project/vllm-ascend/commit/59ea23d0d394879d7f33de6fd22242539b9c3cc5) |
| 23 | [@yiz-liu](https://github.com/yiz-liu) | 2025/3/11 | [0db6670](https://github.com/vllm-project/vllm-ascend/commit/0db6670bfab8cb1d84c9e7270df0a1d42d6ce7ca) |
| 22 | [@new-TonyWang](https://github.com/new-TonyWang) | 2025/3/11 | [dfb4e23](https://github.com/vllm-project/vllm-ascend/commit/dfb4e23e9d820ac992a071c123bbe983c7b01b2e) |
| 21 | [@mengwei805](https://github.com/mengwei805) | 2025/3/6 | [8fcf3d1](https://github.com/vllm-project/vllm-ascend/commit/8fcf3d1704084626db35c5dc82ade446508598d4) |
| 20 | [@baymax591](https://github.com/baymax591) | 2025/2/28 | [e8131b9](https://github.com/vllm-project/vllm-ascend/commit/e8131b99cf199f50a304e6e6fb125a1b95bcc92b) |
| 19 | [@dependabot](https://github.com/dependabot) | 2025/2/27 | [a5564ed](https://github.com/vllm-project/vllm-ascend/commit/a5564ed5d8fd9818936a22d9ea35951a27513b4c) |
| 18 | [@shink](https://github.com/shink) | 2025/2/27 | [6aed833](https://github.com/vllm-project/vllm-ascend/commit/6aed83335cbe92fd0b8ef07c28966a753d012ccb) |
| 17 | [@wwfu109](https://github.com/wwfu109) | 2025/2/27 | [b074047](https://github.com/vllm-project/vllm-ascend/commit/b07404766bdaf6e3cebc5cb0aba89a247501302e) |
| 16 | [@kunpengW-code](https://github.com/kunpengW-code) | 2025/2/26 | [ca807ce](https://github.com/vllm-project/vllm-ascend/commit/ca807ce49ed64aa89242f5ae29b9862a77648b45) |
| 15 | [@Yaphets24](https://github.com/Yaphets24) | 2025/2/22 | [d0b3cb4](https://github.com/vllm-project/vllm-ascend/commit/d0b3cb4fa79d5fc7f8245a3c68885ce1fa030ba4) |
| 14 | [@noemotiovon](https://github.com/noemotiovon) | 2025/2/21 | [202b39a](https://github.com/vllm-project/vllm-ascend/commit/202b39a38c2869b0ecc3df486550fb555a2eb0c0) |
| 13 | [@SidaoY](https://github.com/SidaoY) | 2025/2/18 | [718c763](https://github.com/vllm-project/vllm-ascend/commit/718c7638555d12cd43ea2a9e497e185778b68595) |
| 12 | [@ShiyaNiu](https://github.com/ShiyaNiu) | 2025/2/17 | [36ea38f](https://github.com/vllm-project/vllm-ascend/commit/36ea38fde56437ff1745bd95cd8d9e02a6578d38) |
| 11 | [@ji-huazhong](https://github.com/ji-huazhong) | 2025/2/12 | [c8b57d1](https://github.com/vllm-project/vllm-ascend/commit/c8b57d10b24efcd9b4fadeb66cfbf66aa3dd5f82) |
| 10 | [@Angazenn](https://github.com/Angazenn) | 2025/2/11 | [7637759](https://github.com/vllm-project/vllm-ascend/commit/7637759056028839c74960d9cfd3ce6275ee5d35) |
| 9 | [@whx-sjtu](https://github.com/whx-sjtu) | 2025/2/7 | [8fc5dc9](https://github.com/vllm-project/vllm-ascend/commit/8fc5dc966aaf4e174d1ec0d1902c40289411ec0e) |
| 8 | [@zouyida2002](https://github.com/zouyida2002) | 2025/2/7 | [4495fc6](https://github.com/vllm-project/vllm-ascend/commit/4495fc68389e3fb1ef14534c202948931e38446b) |
| 7 | [@hw_whx](https://github.com/hw_whx) | 2025/2/7 | [7d16772](https://github.com/vllm-project/vllm-ascend/commit/7d1677263bc6628ade33bb780455e0f6e5b9b27a) |
| 6 | [@MengqingCao](https://github.com/MengqingCao) | 2025/2/6 | [7d9ae22](https://github.com/vllm-project/vllm-ascend/commit/7d9ae22ecb6dc3ea4e720e5109cf46e1ae7da730) |
| 5 | [@Potabk](https://github.com/Potabk) | 2025/2/6 | [8cb5615](https://github.com/vllm-project/vllm-ascend/commit/8cb5615fb010b34c2f4f89e03e6257bfee851f86) |
| 4 | [@wangxiyuan](https://github.com/wangxiyuan) | 2025/2/6 | [a48b9ad](https://github.com/vllm-project/vllm-ascend/commit/a48b9addefd292af523644411d4ff4142dd4bc66) |
| 3 | [@shen-shanshan](https://github.com/shen-shanshan) | 2025/2/6 | [bfccf73](https://github.com/vllm-project/vllm-ascend/commit/bfccf739e2fe121b54d9b198c2ec205a9379190e) |
| 2 | [@Yikun](https://github.com/Yikun) | 2025/2/5 | [d5e7756](https://github.com/vllm-project/vllm-ascend/commit/d5e7756028bd5884ade96b654555c375770a2f64) |
| 1 | [@simon-mo](https://github.com/simon-mo) | 2025/1/29 | [eb28342](https://github.com/vllm-project/vllm-ascend/commit/eb283428ddc17207b6866118f9bc15454b5b8801) |

View File

@ -0,0 +1,48 @@
# Governance
## Mission
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
## Principles
vLLM Ascend follows the vLLM community's code of conduct[vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
## Governance - Mechanics
vLLM Ascend is an open-source project under the vLLM community, where the authority to appoint roles is ultimately determined by the vLLM community. It adopts a hierarchical technical governance structure.
- Contributor:
**Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs, code
**Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement.
Contributors will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` permissions (`Can read and clone this repository. Can also manage issues and pull requests`) to help community developers collaborate more efficiently.
- Maintainer:
**Responsibility:** Develop the project's vision and mission. Maintainers are responsible for driving the technical direction of the entire project and ensuring its overall success, possessing code merge permissions. They formulate the roadmap, review contributions from community members, continuously contribute code, and actively engage in community activities (such as regular meetings/events).
**Requirements:** Deep understanding of vLLM and vLLM Ascend codebases, with a commitment to sustained code contributions. Competency in design/development/PR review workflows.
- **Review Quality:** Actively participate in community code reviews, ensuring high-quality code integration.
- **Quality Contribution:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
- **Community Involvement:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.
Maintainer will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo write permissions (`Can read, clone, and push to this repository. Can also manage issues and pull requests`).
## Nominating and Removing Maintainers
### The Principles
- Membership in vLLM Ascend is given to individuals on merit basis after they demonstrated strong expertise of the vLLM / vLLM Ascend through contributions, reviews and discussions.
- For membership in the maintainer group the individual has to demonstrate strong and continued alignment with the overall vLLM / vLLM Ascend principles.
- Light criteria of moving module maintenance to emeritus status if they dont actively participate over long periods of time.
- The membership is for an individual, not a company.
### Nomination and Removal
- Nomination: Anyone can nominate someone to become a maintainer (include self-nominate). All existing maintainers are responsible for evaluating the nomination. The nominator should provide nominee's info around the strength of the candidate to be a maintainer, include but not limited to review quality, quality contribution, community involvement.
- Removal: Anyone can nominate a person to be removed from maintainer position (include self-nominate). All existing maintainers are responsible for evaluating the nomination. The nominator should provide nominee's info, include but not limited to lack of activity, conflict with the overall direction and other information that makes them unfit to be a maintainer.

View File

@ -0,0 +1,19 @@
# User Stories
Read case studies on how users and developers solves real, everyday problems with vLLM Ascend
- [LLaMA-Factory](./llamafactory.md) is an easy-to-use and efficient platform for training and fine-tuning large language models, it supports vLLM Ascend to speed up inference since [LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739), gain 2x performance enhancement of inference.
- [Huggingface/trl](https://github.com/huggingface/trl) is a cutting-edge library designed for post-training foundation models using advanced techniques like SFT, PPO and DPO, it uses vLLM Ascend since [v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) to support RLHF on Ascend NPU.
- [MindIE Turbo](https://pypi.org/project/mindie-turbo) is an LLM inference engine acceleration plug-in library developed by Huawei on Ascend hardware, which includes self-developed large language model optimization algorithms and optimizations related to the inference engine framework. It supports vLLM Ascend since [2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev/mindie-turbo-0001.html).
- [GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU cluster manager for running AI models. It supports vLLM Ascend since [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2), see more GPUStack performance evaluation info on [link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew).
- [verl](https://github.com/volcengine/verl) is a flexible, efficient and production-ready RL training library for large language models (LLMs), uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0), see more info on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html).
:::{toctree}
:caption: More details
:maxdepth: 1
llamafactory
:::

View File

@ -0,0 +1,19 @@
# LLaMA-Factory
**About / Introduction**
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
**The Business Challenge**
LLaMA-Factory used transformers to perform inference on Ascend NPU, but the speed was slow.
**Solving Challenges and Benefits with vLLM Ascend**
With the joint efforts of LLaMA-Factory and vLLM Ascend ([LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)), the performance of LLaMA-Factory in the model inference stage has been significantly improved. According to the test results, the inference speed of LLaMA-Factory has been increased to 2x compared to the transformers version.
**Learn more**
See more about LLaMA-Factory and how it uses vLLM Ascend for inference on the Ascend NPU in the following documentation: [LLaMA-Factory Ascend NPU Inference](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html).

View File

@ -0,0 +1,110 @@
# Versioning policy
Starting with vLLM 0.7.x, the vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) project follows the [PEP 440](https://peps.python.org/pep-0440/) to publish matching with vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)).
## vLLM Ascend Plugin versions
Each vLLM Ascend release will be versioned: `v[major].[minor].[micro][rcN][.postN]` (such as
`v0.7.3rc1`, `v0.7.3`, `v0.7.3.post1`)
- **Final releases**: will typically be released every **3 months**, will take the vLLM upstream release plan and Ascend software product release plan into comprehensive consideration.
- **Pre releases**: will typically be released **on demand**, ending with rcN, represents the Nth release candidate version, to support early testing by our users prior to a final release.
- **Post releases**: will typically be released **on demand** to support to address minor errors in a final release. It's different from [PEP-440 post release note](https://peps.python.org/pep-0440/#post-releases) suggestion, it will contain actual bug fixes considering that the final release version should be matched strictly with the vLLM final release version (`v[major].[minor].[micro]`). The post version has to be published as a patch version of the final release.
For example:
- `v0.7.x`: it's the first final release to match the vLLM `v0.7.x` version.
- `v0.7.3rc1`: will be the first pre version of vLLM Ascend.
- `v0.7.3.post1`: will be the post release if the `v0.7.3` release has some minor errors.
## Release Compatibility Matrix
Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | MindIE Turbo |
|-------------|--------------|------------------|-------------|--------------------|--------------|
| v0.9.2rc1 | v0.9.2 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1.post1.dev20250619 | |
| v0.9.1rc1 | v0.9.1 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1.post1.dev20250528 | |
| v0.9.0rc2 | v0.9.0 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | |
| v0.9.0rc1 | v0.9.0 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | |
| v0.8.5rc1 | v0.8.5.post1 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | |
| v0.8.4rc2 | v0.8.4 | >= 3.9, < 3.12 | 8.0.0 | 2.5.1 / 2.5.1 | |
| v0.7.3.post1| v0.7.3 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | 2.0rc1 |
| v0.7.3 | v0.7.3 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | 2.0rc1 |
## Release cadence
### release window
| Date | Event |
|------------|-------------------------------------------|
| 2025.07.11 | Release candidates, v0.9.2rc1 |
| 2025.06.22 | Release candidates, v0.9.1rc1 |
| 2025.06.10 | Release candidates, v0.9.0rc2 |
| 2025.06.09 | Release candidates, v0.9.0rc1 |
| 2025.05.29 | v0.7.x post release, v0.7.3.post1 |
| 2025.05.08 | v0.7.x Final release, v0.7.3 |
| 2025.05.06 | Release candidates, v0.8.5rc1 |
| 2025.04.28 | Release candidates, v0.8.4rc2 |
| 2025.04.18 | Release candidates, v0.8.4rc1 |
| 2025.03.28 | Release candidates, v0.7.3rc2 |
| 2025.03.14 | Release candidates, v0.7.3rc1 |
| 2025.02.19 | Release candidates, v0.7.1rc1 |
## Branch policy
vLLM Ascend has main branch and dev branch.
- **main**: main branchcorresponds to the vLLM main branch and latest 1 or 2 release version. It is continuously monitored for quality through Ascend CI.
- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.3-dev` is the dev branch for vLLM `v0.7.3` version.
Usually, a commit should be ONLY first merged in the main branch, and then backported to the dev branch to reduce maintenance costs as much as possible.
### Maintenance branch and EOL:
The branch status will be in one of the following states:
| Branch | Time frame | Summary |
|-------------------|----------------------------------|----------------------------------------------------------------------|
| Maintained | Approximately 2-3 minor versions | All bugfixes are appropriate. Releases produced, CI commitment. |
| Unmaintained | Community interest driven | All bugfixes are appropriate. No Releases produced, No CI commitment |
| End of Life (EOL) | N/A | Branch no longer accepting changes |
### Branch state
Note that vLLM Ascend will only be released for a certain vLLM release version rather than all versions. Hence, You might see only part of versions have dev branches (such as only `0.7.1-dev` / `0.7.3-dev` but no `0.7.2-dev`), this is as expected.
Usually, each minor version of vLLM (such as 0.7) will correspond to a vLLM Ascend version branch and support its latest version (for example, we plan to support version 0.7.3) as following shown:
| Branch | Status | Note |
|------------|--------------|--------------------------------------|
| main | Maintained | CI commitment for vLLM main branch and vLLM 0.9.2 branch |
| v0.9.1-dev | Maintained | CI commitment for vLLM 0.9.1 version |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version |
| v0.7.1-dev | Unmaintained | Replaced by v0.7.3-dev |
### Backward compatibility
For main branch, vLLM Ascend should works with vLLM main branch and latest 1 or 2 release version. So to ensure the backward compatibility, we will do the following:
- Both main branch and target vLLM release is tested by Ascend E2E CI. For example, currently, vLLM main branch and vLLM 0.8.4 are tested now.
- For code changes, we will make sure that the changes are compatible with the latest 1 or 2 vLLM release version as well. In this case, vLLM Ascend introduced a version check machinism inner the code. It'll check the version of installed vLLM package first to decide which code logic to use. If users hit the `InvalidVersion` error, it sometimes means that they have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use.
- For documentation changes, we will make sure that the changes are compatible with the latest 1 or 2 vLLM release version as well. Note should be added if there are any breaking changes.
## Document Branch Policy
To reduce maintenance costs, **all branch documentation content should remain consistent, and version differences can be controlled via variables in [docs/source/conf.py](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/conf.py)**. While this is not a simple task, it is a principle we should strive to follow.
| Version | Purpose | Code Branch |
|-----|-----|---------|
| latest | Doc for the latest dev branch | vX.Y.Z-dev (Will be `main` after the first final release) |
| version | Doc for historical released versions | Git tags, like vX.Y.Z[rcN] |
| stablenot yet released | Doc for latest final release branch | Will be `vX.Y.Z-dev` after the first official release |
As shown above:
- `latest` documentation: Matches the current maintenance branch `vX.Y.Z-dev` (Will be `main` after the first final release). Continuously updated to ensure usability for the latest release.
- `version` documentation: Corresponds to specific released versions (e.g., `v0.7.3`, `v0.7.3rc1`). No further updates after release.
- `stable` documentation (**not yet released**): Official release documentation. Updates are allowed in real-time after release, typically based on vX.Y.Z-dev. Once stable documentation is available, non-stable versions should display a header warning: `You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.`.
## Software Dependency Management
- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu)
every 3 months, a development version (aka the POC version) every month, and a nightly version every day.
The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CANN** be used in
vLLM Ascend RC version for rapid iteration, the nightly version **CANNOT** be used in vLLM Ascend any version and branches.

View File

@ -1,7 +1,5 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
# Adapted from vllm-project/vllm/docs/source/conf.py
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -15,6 +13,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
# Adapted from vllm-project/vllm/docs/source/conf.py
#
# -- Path setup --------------------------------------------------------------
@ -23,7 +23,9 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
import json
import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
@ -63,15 +65,19 @@ myst_substitutions = {
# the branch of vllm, used in vllm clone
# - main branch: 'main'
# - vX.Y.Z branch: 'vX.Y.Z'
'vllm_version': 'main',
'vllm_version': 'v0.9.2',
# the branch of vllm-ascend, used in vllm-ascend clone and image tag
# - main branch: 'main'
# - vX.Y.Z branch: latest vllm-ascend release tag
'vllm_ascend_version': 'main',
'vllm_ascend_version': 'v0.9.2rc1',
# the newest release version of vllm-ascend and matched vLLM, used in pip install.
# This value should be updated when cut down release.
'pip_vllm_ascend_version': "v0.7.1rc1",
'pip_vllm_version': "v0.7.1",
'pip_vllm_ascend_version': "0.9.2rc1",
'pip_vllm_version': "0.9.2",
# CANN image tag
'cann_image_tag': "8.1.rc1-910b-ubuntu22.04-py3.10",
# vllm version in ci
'ci_vllm_version': 'v0.9.2',
}
# Add any paths that contain templates here, relative to this directory.
@ -117,6 +123,20 @@ html_theme_options = {
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = ['_static']
READTHEDOCS_VERSION_TYPE = os.environ.get('READTHEDOCS_VERSION_TYPE')
if READTHEDOCS_VERSION_TYPE == "tag":
# remove the warning banner if the version is a tagged release
header_file = os.path.join(os.path.dirname(__file__),
"_templates/sections/header.html")
# The file might be removed already if the build is triggered multiple times
# (readthedocs build both HTML and PDF versions separately)
if os.path.exists(header_file):
os.remove(header_file)
def setup(app):
pass
if __name__ == "__main__":
print(json.dumps(myst_substitutions))

View File

@ -1,102 +0,0 @@
# 贡献指南
## 构建与测试
我们推荐您在提交PR之前在本地开发环境进行构建和测试。
### 环境准备与构建
理论上vllm-ascend 构建仅支持 Linux因为`vllm-ascend` 依赖项 `torch_npu` 仅支持 Linux。
但是您仍然可以在 Linux/Windows/macOS 上配置开发环境进行代码检查和基本测试,如下命令所示:
```bash
# 选择基础文件夹 (~/vllm-project/) 创建python虚拟环境
cd ~/vllm-project/
python3 -m venv .venv
source ./.venv/bin/activate
# 克隆并安装vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt
VLLM_TARGET_DEVICE="empty" pip install .
cd ..
# 克隆并安装vllm-ascend
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -r requirements-dev.txt
# 通过执行以下脚本以运行 lint 及 mypy 测试
bash format.sh
# 构建:
# - 目前仅支持在Linux上进行完整构建torch_npu 限制)
# pip install -e .
# - 在其他操作系统上构建安装,需要跳过依赖
# - build without deps for debugging in other OS
# pip install -e . --no-deps
# 使用 `-s` 提交更改
git commit -sm "your commit info"
```
### 测试
虽然 vllm-ascend CI 提供了对 [Ascend](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) 的集成测试,但您也可以在本地运行它。在本地运行这些集成测试的最简单方法是通过容器:
```bash
# 基于昇腾NPU环境
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
IMAGE=vllm-ascend-dev-image
CONTAINER_NAME=vllm-ascend-dev
DEVICE=/dev/davinci1
# 首次构建会花费10分钟10MB/s下载基础镜像和包
docker build -t $IMAGE -f ./Dockerfile .
# 您还可以通过设置 VLLM_REPO 来指定镜像仓库以加速
# docker build -t $IMAGE -f ./Dockerfile . --build-arg VLLM_REPO=https://gitee.com/mirrors/vllm
docker run --name $CONTAINER_NAME --network host --device $DEVICE \
--device /dev/davinci_manager --device /dev/devmm_svm \
--device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-ti --rm $IMAGE bash
cd vllm-ascend
pip install -r requirements-dev.txt
pytest tests/
```
## 开发者来源证书(DCO)
在向本项目提交贡献时,您必须同意 DCO。提交必须包含“Signed-off-by:”标头,以证明同意 DCO 的条款。
在`git commit`时使用`-s`将会自动添加该标头。
## PR 标题和分类
仅特定类型的 PR 会被审核。PR 标题会以适当的前缀来表明变更类型。请使用以下之一:
- `[Attention]` 关于`attention`的新特性或优化
- `[Communicator]` 关于`communicators`的新特性或优化
- `[ModelRunner]` 关于`model runner`的新特性或优化
- `[Platform]` 关于`platform`的新特性或优化
- `[Worker]` 关于`worker`的新特性或优化
- `[Core]` 关于`vllm-ascend`核心逻辑 (如 `platform, attention, communicators, model runner`)的新特性或优化
- `[Kernel]` 影响计算内核和操作的更改.
- `[Bugfix]` bug修复
- `[Doc]` 文档的修复与更新
- `[Test]` 测试 (如:单元测试)
- `[CI]` 构建或持续集成改进
- `[Misc]` 适用于更改内容对于上述类别均不适用的PR请谨慎使用该前缀
> [!注意]
> 如果 PR 涉及多个类别,请添加所有相关前缀
## 其他
您可以在 [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html) 上找到更多有关为 vLLM 昇腾插件贡献的信息。
如果您在贡献过程中发现任何问题,您可以随时提交 PR 来改进文档以帮助其他开发人员。

View File

@ -4,7 +4,7 @@
It's recommended to set up a local development environment to build and test
before you submit a PR.
### Prepare environment and build
### Setup development environment
Theoretically, the vllm-ascend build is only supported on Linux because
`vllm-ascend` dependency `torch_npu` only supports Linux.
@ -12,68 +12,64 @@ Theoretically, the vllm-ascend build is only supported on Linux because
But you can still set up dev env on Linux/Windows/macOS for linting and basic
test as following commands:
#### Run lint locally
```bash
# Choose a base dir (~/vllm-project/) and set up venv
cd ~/vllm-project/
python3 -m venv .venv
source ./.venv/bin/activate
# Clone vllm code and install
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt
VLLM_TARGET_DEVICE="empty" pip install .
cd ..
# Clone vllm-ascend and install
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -r requirements-dev.txt
# Then you can run lint and mypy test
# Install lint requirement and enable pre-commit hook
pip install -r requirements-lint.txt
# Run lint (You need install pre-commits deps via proxy network at first time)
bash format.sh
```
# Build:
# - only supported on Linux (torch_npu available)
# pip install -e .
# - build without deps for debugging in other OS
# pip install -e . --no-deps
#### Run CI locally
After complete "Run lint" setup, you can run CI locally:
```{code-block} bash
:substitutions:
cd ~/vllm-project/
# Run CI need vLLM installed
git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/build.txt
VLLM_TARGET_DEVICE="empty" pip install .
cd ..
# Install requirements
cd vllm-ascend
# For Linux:
pip install -r requirements-dev.txt
# For non Linux:
cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
# Run ci:
bash format.sh ci
```
#### Submit the commit
```bash
# Commit changed files using `-s`
git commit -sm "your commit info"
```
### Testing
🎉 Congratulations! You have completed the development environment setup.
Although vllm-ascend CI provide integration test on [Ascend](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml), you can run it
locally. The simplest way to run these integration tests locally is through a container:
### Test locally
```bash
# Under Ascend NPU environment
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
IMAGE=vllm-ascend-dev-image
CONTAINER_NAME=vllm-ascend-dev
DEVICE=/dev/davinci1
# The first build will take about 10 mins (10MB/s) to download the base image and packages
docker build -t $IMAGE -f ./Dockerfile .
# You can also specify the mirror repo via setting VLLM_REPO to speedup
# docker build -t $IMAGE -f ./Dockerfile . --build-arg VLLM_REPO=https://gitee.com/mirrors/vllm
docker run --name $CONTAINER_NAME --network host --device $DEVICE \
--device /dev/davinci_manager --device /dev/devmm_svm \
--device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-ti --rm $IMAGE bash
cd vllm-ascend
pip install -r requirements-dev.txt
pytest tests/
```
You can refer to [Testing](./testing.md) doc to help you setup testing environment and running tests locally.
## DCO and Signed-off-by
@ -106,3 +102,10 @@ If the PR spans more than one category, please include all relevant prefixes.
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
:::{toctree}
:caption: Index
:maxdepth: 1
testing
:::

View File

@ -0,0 +1,280 @@
# Testing
This secition explains how to write e2e tests and unit tests to verify the implementation of your feature.
## Setup test environment
The fastest way to setup test environment is to use the main branch container image:
:::::{tab-set}
:sync-group: e2e
::::{tab-item} Local (CPU)
:selected:
:sync: cpu
You can run the unit tests on CPU with the following steps:
```{code-block} bash
:substitutions:
cd ~/vllm-project/
# ls
# vllm vllm-ascend
# Use mirror to speedup download
# docker pull quay.nju.edu.cn/ascend/cann:|cann_image_tag|
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm --name vllm-ascend-ut \
-v $(pwd):/vllm-project \
-v ~/.cache:/root/.cache \
-ti $IMAGE bash
# (Optional) Configure mirror to speedup download
sed -i 's|ports.ubuntu.com|mirrors.huaweicloud.com|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.huaweicloud.com/repository/pypi/simple/
# For torch-npu dev version or x86 machine
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
apt-get update -y
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
# Install vllm
cd /vllm-project/vllm
VLLM_TARGET_DEVICE=empty python3 -m pip -v install .
# Install vllm-ascend
cd /vllm-project/vllm-ascend
# [IMPORTANT] Import LD_LIBRARY_PATH to enumerate the CANN environment under CPU
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
python3 -m pip install -r requirements-dev.txt
python3 -m pip install -v .
```
::::
::::{tab-item} Single card
:sync: single
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:main
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
After starting the container, you should install the required packages:
```bash
# Prepare
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# Install required packages
pip install -r requirements-dev.txt
```
::::
::::{tab-item} Multi cards
:sync: multi
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:main
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
After starting the container, you should install the required packages:
```bash
cd /vllm-workspace/vllm-ascend/
# Prepare
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# Install required packages
pip install -r requirements-dev.txt
```
::::
:::::
## Running tests
### Unit test
There are several principles to follow when writing unit tests:
- The test file path should be consistent with source file and start with `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
- The vLLM Ascend test are using unittest framework, see [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPU, so you must mock the device-related function to host.
- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
- You can run the unit tests using `pytest`:
:::::{tab-set}
:sync-group: e2e
::::{tab-item} Local (CPU)
:selected:
:sync: cpu
```bash
# Run unit tests
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
```
::::
::::{tab-item} Single card
:sync: single
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
pytest -sv tests/ut
# Run single test
pytest -sv tests/ut/test_ascend_config.py
```
::::
::::{tab-item} Multi cards test
:sync: multi
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
pytest -sv tests/ut
# Run single test
pytest -sv tests/ut/test_ascend_config.py
```
::::
:::::
### E2E test
Although vllm-ascend CI provide [e2e test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
locally.
:::::{tab-set}
:sync-group: e2e
::::{tab-item} Local (CPU)
:sync: cpu
You can't run e2e test on CPU.
::::
::::{tab-item} Single card
:selected:
:sync: single
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
# Run a certain test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py
# Run a certain case in test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models
```
::::
::::{tab-item} Multi cards test
:sync: multi
```bash
cd /vllm-workspace/vllm-ascend/
# Run all single card the tests
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/
# Run a certain test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_dynamic_npugraph_batchsize.py
# Run a certain case in test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models
```
::::
:::::
This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
#### E2E test example:
- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)
- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
The CI resource is limited, you might need to reduce layer number of the model, below is an example of how to generate a reduced layer model:
1. Fork the original model repo in modelscope, we need all the files in the repo except for weights.
2. Set `num_hidden_layers` to the expected number of layers, e.g., `{"num_hidden_layers": 2,}`
3. Copy the following python script as `generate_random_weight.py`. Set the relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and `DIST_MODEL_PATH` as needed:
```python
import torch
from transformers import AutoTokenizer, AutoConfig
from modeling_deepseek import DeepseekV3ForCausalLM
from modelscope import snapshot_download
MODEL_LOCAL_PATH = "~/.cache/modelscope/models/vllm-ascend/DeepSeek-V3-Pruning"
DIST_DTYPE = torch.bfloat16
DIST_MODEL_PATH = "./random_deepseek_v3_with_2_hidden_layer"
config = AutoConfig.from_pretrained(MODEL_LOCAL_PATH, trust_remote_code=True)
model = DeepseekV3ForCausalLM(config)
model = model.to(DIST_DTYPE)
model.save_pretrained(DIST_MODEL_PATH)
```
### Run doctest
vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to run all doctests in the doc files.
The doctest is a good way to make sure the docs are up to date and the examples are executable, you can run it locally as follows:
```bash
# Run doctest
/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh
```
This will reproduce the same environment as the CI: [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).

View File

@ -0,0 +1,6 @@
# Accuracy Report
:::{toctree}
:caption: Accuracy Report
:maxdepth: 1
:::

View File

@ -0,0 +1,10 @@
# Accuracy
:::{toctree}
:caption: Accuracy
:maxdepth: 1
using_evalscope
using_lm_eval
using_opencompass
accuracy_report/index
:::

View File

@ -0,0 +1,173 @@
# Using EvalScope
This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
## 1. Online serving
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
If your service start successfully, you can see the info shown below:
```
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts in new terminal:
```
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
## 2. Install EvalScope using pip
You can install EvalScope by using:
```bash
python3 -m venv .venv-evalscope
source .venv-evalscope/bin/activate
pip install gradio plotly evalscope
```
## 3. Run gsm8k accuracy test using EvalScope
You can `evalscope eval` run gsm8k accuracy test:
```
evalscope eval \
--model Qwen/Qwen2.5-7B-Instruct \
--api-url http://localhost:8000/v1 \
--api-key EMPTY \
--eval-type service \
--datasets gsm8k \
--limit 10
```
After 1-2 mins, the output is as shown below:
```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=====================+===========+=================+==========+=======+=========+=========+
| Qwen2.5-7B-Instruct | gsm8k | AverageAccuracy | main | 10 | 0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+
```
See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
## 4. Run model inference stress testing using EvalScope
### Install EvalScope[perf] using pip
```shell
pip install evalscope[perf] -U
```
### Basic usage
You can use `evalscope perf` run perf test:
```
evalscope perf \
--url "http://localhost:8000/v1/chat/completions" \
--parallel 5 \
--model Qwen/Qwen2.5-7B-Instruct \
--number 20 \
--api openai \
--dataset openqa \
--stream
```
### Output results
After 1-2 mins, the output is as shown below:
```shell
Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
| Key | Value |
+===================================+===============================================================+
| Time taken for tests (s) | 38.3744 |
+-----------------------------------+---------------------------------------------------------------+
| Number of concurrency | 5 |
+-----------------------------------+---------------------------------------------------------------+
| Total requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Succeed requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Failed requests | 0 |
+-----------------------------------+---------------------------------------------------------------+
| Output token throughput (tok/s) | 132.6926 |
+-----------------------------------+---------------------------------------------------------------+
| Total token throughput (tok/s) | 158.8819 |
+-----------------------------------+---------------------------------------------------------------+
| Request throughput (req/s) | 0.5212 |
+-----------------------------------+---------------------------------------------------------------+
| Average latency (s) | 8.3612 |
+-----------------------------------+---------------------------------------------------------------+
| Average time to first token (s) | 0.1035 |
+-----------------------------------+---------------------------------------------------------------+
| Average time per output token (s) | 0.0329 |
+-----------------------------------+---------------------------------------------------------------+
| Average input tokens per request | 50.25 |
+-----------------------------------+---------------------------------------------------------------+
| Average output tokens per request | 254.6 |
+-----------------------------------+---------------------------------------------------------------+
| Average package latency (s) | 0.0324 |
+-----------------------------------+---------------------------------------------------------------+
| Average package per request | 254.6 |
+-----------------------------------+---------------------------------------------------------------+
| Expected number of requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Result DB path | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+
Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| 10% | 0.0962 | 0.031 | 4.4571 | 42 | 135 | 29.9767 |
| 25% | 0.0971 | 0.0318 | 6.3509 | 47 | 193 | 30.2157 |
| 50% | 0.0987 | 0.0321 | 9.3387 | 49 | 285 | 30.3969 |
| 66% | 0.1017 | 0.0324 | 9.8519 | 52 | 302 | 30.5182 |
| 75% | 0.107 | 0.0328 | 10.2391 | 55 | 313 | 30.6124 |
| 80% | 0.1221 | 0.0329 | 10.8257 | 58 | 330 | 30.6759 |
| 90% | 0.1245 | 0.0333 | 13.0472 | 62 | 404 | 30.9644 |
| 95% | 0.1247 | 0.0336 | 14.2936 | 66 | 432 | 31.6691 |
| 98% | 0.1247 | 0.0353 | 14.2936 | 66 | 432 | 31.6691 |
| 99% | 0.1247 | 0.0627 | 14.2936 | 66 | 432 | 31.6691 |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
```
See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).

View File

@ -0,0 +1,62 @@
# Using lm-eval
This document will guide you have a accuracy testing using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness).
## 1. Run docker container
You can run docker container on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```
## 2. Run ceval accuracy test using lm-eval
Install lm-eval in the container.
```bash
pip install lm-eval
```
Run the following command:
```
# Only test ceval-valid-computer_network dataset in this demo
lm_eval \
--model vllm \
--model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4,tensor_parallel_size=1 \
--tasks ceval-valid_computer_network \
--batch_size 8
```
After 1-2 mins, the output is as shown below:
```
The markdown format results is as below:
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|----------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|ceval-valid_computer_network| 2|none | 0|acc |↑ |0.6842|± |0.1096|
| | |none | 0|acc_norm|↑ |0.6842|± |0.1096|
```
You can see more usage on [Lm-eval Docs](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md).

View File

@ -0,0 +1,120 @@
# Using OpenCompass
This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
## 1. Online Serving
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
If your service start successfully, you can see the info shown below:
```
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts in new terminal:
```
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
## 2. Run ceval accuracy test using OpenCompass
Install OpenCompass and configure the environment variables in the container.
```bash
# Pin Python 3.10 due to:
# https://github.com/open-compass/opencompass/issues/1976
conda create -n opencompass python=3.10
conda activate opencompass
pip install opencompass modelscope[framework]
export DATASET_SOURCE=ModelScope
git clone https://github.com/open-compass/opencompass.git
```
Add `opencompass/configs/eval_vllm_ascend_demo.py` with the following content:
```python
from mmengine.config import read_base
from opencompass.models import OpenAISDK
with read_base():
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
# Only test ceval-computer_network dataset in this demo
datasets = ceval_datasets[:1]
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = [
dict(
abbr='Qwen2.5-7B-Instruct-vLLM-API',
type=OpenAISDK,
key='EMPTY', # API key
openai_api_base='http://127.0.0.1:8000/v1',
path='Qwen/Qwen2.5-7B-Instruct',
tokenizer_path='Qwen/Qwen2.5-7B-Instruct',
rpm_verbose=True,
meta_template=api_meta_template,
query_per_second=1,
max_out_len=1024,
max_seq_len=4096,
temperature=0.01,
batch_size=8,
retry=3,
)
]
```
Run the following command:
```
python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
```
After 1-2 mins, the output is as shown below:
```
The markdown format results is as below:
| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
|----- | ----- | ----- | ----- | -----|
| ceval-computer_network | db9ce2 | accuracy | gen | 68.42 |
```
You can see more usage on [OpenCompass Docs](https://opencompass.readthedocs.io/en/latest/index.html).

View File

@ -0,0 +1,9 @@
# Feature Guide
This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
:::{toctree}
:caption: Feature Guide
:maxdepth: 1
patch
:::

View File

@ -0,0 +1,82 @@
# Patch in vLLM Ascend
vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM and vLLM Ascend is different, and the hardware limitation in some case, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to address the change for vLLM.
## Principle
We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend originally. In vLLM Ascend, we have the basic principle for Patch strategy:
1. Less is more. Please do not patch unless it's the only way currently.
2. Once a patch is added, it's required to describe the future plan for removing the patch.
3. Anytime, clean the patch code is welcome.
## How it works
In `vllm_ascend/patch`, you can see the code structure as follows:
```
vllm_ascend
├── patch
│ ├── platform
│ │ ├── patch_0_9_2
│ │ ├── patch_common
│ │ ├── patch_main
│ ├── worker
│ │ ├── patch_0_9_2
│ │ ├── patch_common
│ │ ├── patch_main
└───────────
```
- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
- For online mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
- For offline mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker_v1::NPUWorker::__init__` when the vLLM worker process is initialized.
- For both online and offline mode, vLLM engine core process calls the worker patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
In both **platform** and **worker** folder, there are several patch modules. They are used for patching different version of vLLM.
- `patch_0_9_2`: This module is used for patching vLLM 0.9.2. The version is always the nearest version of vLLM. Once vLLM is released, we will drop this patch module and bump to a new version. For example, `patch_0_9_2` is used for patching vLLM 0.9.2.
- `patch_main`: This module is used for patching the code in vLLM main branch.
- `patch_common`: This module is used for patching both vLLM 0.9.2 and vLLM main branch.
## How to write a patch
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.9.2 and main of vLLM.
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
4. Write your patch code in the new file. Here is an example:
```python
import vllm
def patch_destroy_model_parallel():
# your patch code
...
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
```
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
```
# ** File: <The patch file name> **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `<The target patch module in vLLM>`
# Why:
# <Describe the reason why we need to patch>
# How
# <Describe the way to patch>
# Related PR (if no, explain why):
# <Add a link to the related PR in vLLM. If there is no related PR, explain why>
# Future Plan:
# <Describe the future plan to remove the patch>
```
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
## Limitation
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.n, the version of vLLM may be change to v0.9.nxxx, in this case, the patch for v0.9.n in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.9.2 should work.

View File

@ -0,0 +1,258 @@
# Adding a New Model
This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to
[vllm official doc: Adding a New Model](https://docs.vllm.ai/en/stable/contributing/model/) first.
## Step 1: Implementing Models with `torch` and `torch_npu`
This section provides instructions for implementing new models compatible with vllm and vllm-ascend.
**Before starting:**
- Verify whether your model already exists in vllm's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
- Use existing models' implementation as templates to accelerate your development.
### Method 1: Implementing New Models from Scratch
Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
**Key implementation requirements:**
1. Place model files in `vllm_ascend/models/` directory.
2. Standard module structure for decoder-only LLMs (please checkout vllm's implementations for other kinds of model):
- `*ModelForCausalLM` (top-level wrapper)
- `*Model` (main architecture)
- `*DecoderLayer` (transformer block)
- `*Attention` and `*MLP` (specific computation unit)
:::{note}
`*` denotes your model's unique identifier.
:::
3. Critical Implementation Details:
All modules must include a `prefix` argument in `__init__()`.
**Required interfaces:**
| Module Type | Required Methods |
| :------------------- | :---------------------------------------- |
| `*ModelForCausalLM` | `get_input_embeddings`, `compute_logits`, `load_weights` |
| `*Model` | `get_input_embeddings`, `load_weights` |
4. Attention Backend Integration:
Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py`).
5. Tensor Parallelism:
Use vllm's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py`):
```python
from collections.abc import Iterable
from typing import Optional, Union
import torch
from torch import nn
from vllm.attention import Attention
from vllm.config import VllmConfig
from vllm.sequence import IntermediateTensors
from vllm.model_executor.sampling_metadata import SamplingMetadata
class CustomAttention(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.attn = Attention(prefix=f"{prefix}.attn")
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# Implement attention logic
...
class CustomDecoderLayer(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn")
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# Implement decoder layer
...
class CustomModel(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.layers = nn.ModuleList([
CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}")
for i in range(vllm_config.model_config.hf_config.num_hidden_layers)
])
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
...
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
...
def load_weights(self,
weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
...
class CustomModelForCausalLM(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
self.model = CustomModel(vllm_config, prefix=f"{prefix}.model")
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
...
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
...
def compute_logits(self,
hidden_states: torch.Tensor,
sampling_metadata: SamplingMetadata) -> torch.Tensor:
...
def load_weights(self,
weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
...
```
### Method 2: Customizing Existing vLLM Models
For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
```python
from typing import List, Optional
import torch
from vllm.attention import AttentionMetadata
from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM
from vllm.sequence import IntermediateTensors
class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
# Define merged weights for quantization/efficiency
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
"experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
}
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: Optional[List[torch.Tensor]] = None,
attn_metadata: Optional[AttentionMetadata] = None,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
# Custom forward logic
hidden_states = self.model(
input_ids,
positions,
kv_caches,
attn_metadata,
intermediate_tensors,
inputs_embeds
)
return hidden_states
```
:::{note}
For a complete implementation reference, see: `vllm_ascend/models/deepseek_v2.py`.
:::
## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM
vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.
To integrate your implemented model from `vllm_ascend/models/` directory:
1. Import your model implementation in `vllm_ascend/models/__init__.py` using relative imports.
2. Register the model wrapper class via `vllm.ModelRegistry.register_model()` function.
**Reference Registration Template** (an example of registering new models in `vllm_ascend/models/__init__.py`):
```python
from vllm import ModelRegistry
def register_model():
from .custom_model import CustomModelForCausalLM # New custom model
from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM # Customized Deepseek
# For NEW architectures: Register with unique name
ModelRegistry.register_model(
"CustomModelForCausalLM", # Must match config.json's 'architectures'
"vllm_ascend.models.custom_model:CustomModelForCausalLM"
)
# For MODIFIED architectures: Use original name
ModelRegistry.register_model(
"DeepseekV2ForCausalLM", # Original architecture identifier in vLLM
"vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM "
)
```
:::{note}
The first argument of `vllm.ModelRegistry.register_model()` indicates the unique architecture identifier which must match `architectures` in `config.json` of the model.
```json
{
"architectures": [
"CustomModelForCausalLM"
],
}
```
:::
## Step 3: Verification
### Case 1: Overriding Existing vLLM Model Architecture
If you're registering a customized model architecture based on vllm's existing implementation (overriding vllm's original class), when executing vllm offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
```bash
Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
```
### Case 2: Registering New Model Architecture
If you're registering a novel model architecture not present in vllm (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
```python
logger.info(f"model_arch: {model_arch} has been registered here!")
```
After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).
```bash
model_arch: CustomModelForCausalLM has been registered here!
```
This log output confirms your novel model architecture has been successfully registered in vllm.
## Step 4: Testing
After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.
Find more details at:
- [Accuracy test guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/evaluation/index.html)
- [Performance benchmark guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/performance/performance_benchmark.html)
## Step 5: Updating Supported Models Doc
At last, if all the steps above are completed, you should add the new model into our [Supported Models](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html) doc.

View File

@ -0,0 +1,3 @@
# Adding a New Multi-Modal Model
**_Comming soon ..._**

View File

@ -0,0 +1,10 @@
# Modeling
This section provides tutorials of how to implement and register a new model into vllm-ascend.
:::{toctree}
:caption: Modeling
:maxdepth: 1
adding_a_new_model
adding_a_new_multimodal_model
:::

View File

@ -0,0 +1,8 @@
# Performance
:::{toctree}
:caption: Performance
:maxdepth: 1
performance_benchmark
profile_execute_duration
:::

View File

@ -0,0 +1,187 @@
# Performance Benchmark
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
## 1. Run docker container
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```
## 2. Install dependencies
```bash
cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r benchmarks/requirements-bench.txt
```
## 3. (Optional)Prepare model weights
For faster running speed, we recommend downloading the model in advance
```bash
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
```
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
```bash
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "your local model path",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
}
]
```
## 4. Run benchmark script
Run benchmark script:
```bash
bash benchmarks/scripts/run-performance-benchmarks.sh
```
After about 10 mins, the output is as shown below:
```bash
online serving:
qps 1:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 212.77
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 0.94
Output token throughput (tok/s): 204.66
Total Token throughput (tok/s): 405.16
---------------Time to First Token----------------
Mean TTFT (ms): 104.14
Median TTFT (ms): 102.22
P99 TTFT (ms): 153.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 38.78
Median TPOT (ms): 38.70
P99 TPOT (ms): 48.03
---------------Inter-token Latency----------------
Mean ITL (ms): 38.46
Median ITL (ms): 36.96
P99 ITL (ms): 75.03
==================================================
qps 4:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 72.55
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 2.76
Output token throughput (tok/s): 600.24
Total Token throughput (tok/s): 1188.27
---------------Time to First Token----------------
Mean TTFT (ms): 115.62
Median TTFT (ms): 109.39
P99 TTFT (ms): 169.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 51.48
Median TPOT (ms): 52.40
P99 TPOT (ms): 69.41
---------------Inter-token Latency----------------
Mean ITL (ms): 50.47
Median ITL (ms): 43.95
P99 ITL (ms): 130.29
==================================================
qps 16:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 47.82
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 4.18
Output token throughput (tok/s): 910.62
Total Token throughput (tok/s): 1802.70
---------------Time to First Token----------------
Mean TTFT (ms): 128.50
Median TTFT (ms): 128.36
P99 TTFT (ms): 187.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 83.60
Median TPOT (ms): 77.85
P99 TPOT (ms): 165.90
---------------Inter-token Latency----------------
Mean ITL (ms): 65.72
Median ITL (ms): 54.84
P99 ITL (ms): 289.63
==================================================
qps inf:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 41.26
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1055.44
Total Token throughput (tok/s): 2089.40
---------------Time to First Token----------------
Mean TTFT (ms): 3394.37
Median TTFT (ms): 3359.93
P99 TTFT (ms): 3540.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 66.28
Median TPOT (ms): 64.19
P99 TPOT (ms): 97.66
---------------Inter-token Latency----------------
Mean ITL (ms): 56.62
Median ITL (ms): 55.69
P99 ITL (ms): 82.90
==================================================
offline:
latency:
Avg latency: 4.944929537673791 seconds
10% percentile latency: 4.894104263186454 seconds
25% percentile latency: 4.909652255475521 seconds
50% percentile latency: 4.932477846741676 seconds
75% percentile latency: 4.9608619548380375 seconds
90% percentile latency: 5.035418218374252 seconds
99% percentile latency: 5.052476694583893 seconds
throughput:
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
Total num prompt tokens: 42659
Total num output tokens: 43545
```
The result json files are generated into the path `benchmark/results`
These files contain detailed benchmarking results for further analysis.
```bash
.
|-- latency_llama8B_tp1.json
|-- serving_llama8B_tp1_qps_1.json
|-- serving_llama8B_tp1_qps_16.json
|-- serving_llama8B_tp1_qps_4.json
|-- serving_llama8B_tp1_qps_inf.json
`-- throughput_llama8B_tp1.json
```

View File

@ -0,0 +1,39 @@
# Profile Execute Duration
The execution duration of each stage (including pre/post-processing, model forward, etc.) usually needs to be captured during a complete inference process. Typically, this is done by using `torch.npu.synchronize()` and obtaining CPU timestamps, which increases the performance overhead of host/device synchronization.
**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
## Usage
* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
```
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
```
## Example Output
```
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
```

View File

@ -1,65 +0,0 @@
# Versioning policy
Starting with vLLM 0.7.x, the vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) project follows the [PEP 440](https://peps.python.org/pep-0440/) to publish matching with vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)).
## vLLM Ascend Plugin versions
Each vllm-ascend release will be versioned: `v[major].[minor].[micro][rcN][.postN]` (such as
`v0.7.1rc1`, `v0.7.1`, `v0.7.1.post1`)
- **Final releases**: will typically be released every **3 months**, will take the vLLM upstream release plan and Ascend software product release plan into comprehensive consideration.
- **Pre releases**: will typically be released **on demand**, ending with rcN, represents the Nth release candidate version, to support early testing by our users prior to a final release.
- **Post releases**: will typically be released **on demand** to support to address minor errors in a final release. It's different from [PEP-440 post release note](https://peps.python.org/pep-0440/#post-releases) suggestion, it will contain actual bug fixes considering that the final release version should be matched strictly with the vLLM final release version (`v[major].[minor].[micro]`). The post version has to be published as a patch version of the final release.
For example:
- `v0.7.x`: it's the first final release to match the vLLM `v0.7.x` version.
- `v0.7.1rc1`: will be the first pre version of vllm-ascend.
- `v0.7.1.post1`: will be the post release if the `v0.7.1` release has some minor errors.
## Branch policy
vllm-ascend has main branch and dev branch.
- **main**: main branchcorresponds to the vLLM main branch, and is continuously monitored for quality through Ascend CI.
- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.1-dev` is the dev branch for vLLM `v0.7.1` version.
Usually, a commit should be ONLY first merged in the main branch, and then backported to the dev branch to reduce maintenance costs as much as possible.
### Maintenance branch and EOL:
The branch status will be in one of the following states:
| Branch | Time frame | Summary |
|-------------------|----------------------------------|----------------------------------------------------------------------|
| Maintained | Approximately 2-3 minor versions | All bugfixes are appropriate. Releases produced, CI commitment. |
| Unmaintained | Community interest driven | All bugfixes are appropriate. No Releases produced, No CI commitment |
| End of Life (EOL) | N/A | Branch no longer accepting changes |
### Branch state
Note that vllm-ascend will only be released for a certain vLLM release version rather than all versions. Hence, You might see only part of versions have dev branches (such as only `0.7.1-dev` / `0.7.3-dev` but no `0.7.2-dev`), this is as expected.
Usually, each minor version of vLLM (such as 0.7) will correspond to a vllm-ascend version branch and support its latest version (for example, we plan to support version 0.7.3) as following shown:
| Branch | Status | Note |
|------------|--------------|--------------------------------------|
| main | Maintained | CI commitment for vLLM main branch |
| v0.7.3-dev | Maintained | CI commitment for vLLM 0.7.3 version |
| v0.7.1-dev | Unmaintained | Replaced by v0.7.3-dev |
## Release Compatibility Matrix
Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
| vllm-ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu |
|--------------|--------------| --- | --- | --- |
| v0.7.1rc1 | v0.7.1 | 3.9 - 3.12 | 8.0.0 | 2.5.1 / 2.5.1.dev20250218 |
## Release cadence
### Next final release (`v0.7.x`) window
| Date | Event |
|------------|------------------------------------------------------------------|
| March 2025 | Release candidates, v0.7.3rc1 |
| March 2025 | Final release passes, match vLLM v0.7.x latest: v0.7.1 or v0.7.3 |

View File

@ -1,64 +0,0 @@
# 版本策略
从vLLM的0.7.x版本开始vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) 整体遵循[PEP 440](https://peps.python.org/pep-0440/)的版本策略与vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)) 配套发布。
## vLLM Ascend Plugin版本
vllm-ascend的版本号为`v[major].[minor].[micro][rcN][.postN]`(比如`v0.7.1rc1`, `v0.7.1`, `v0.7.1.post1`
- **Final releases (正式版)**: 通常3个月发布一次正式版将会综合考虑vLLM上游发布及昇腾产品软件发布策略。
- **Pre releases (尝鲜版)**: 通常为按需发布以rcN结尾代表第N个Release Candidate版本提供在final release之前的尝鲜版早期试用版
- **Post releases (补丁版)**: 通常在final release发布后按需发布主要是修复最终版本的错误。这个策略与[PEP-440提到的策略](https://peps.python.org/pep-0440/#post-releases)有所不同它会包含实际的bug修复考虑到正式版与vLLM的版本`v[major].[minor].[micro]`配套发布。因此Post releases通常是Final release的补丁版本。
例如:
- `v0.7.x`: 是配套 vLLM `v0.7.x` 版本的正式版。
- `v0.7.1rc1`: 是vllm-ascend第一个尝鲜版早期试用版
- `v0.7.1.post1`: 是`v0.7.1`版本的post release。
## 分支管理策略
vllm-ascend有主干和开发两种分支。
- **main**: 主干分支与vLLM的主干分支对应并通过昇腾CI持续进行质量看护。
- **vX.Y.Z-dev**: 开发分支随vLLM部分新版本发布而创建比如`v0.7.1-dev`是vllm-ascend针对vLLM `v0.7.1`版本的开发分支。
通常一个commit需要先合入到主干分支然后再反合到开发分支从而尽可能地减少版本维护成本。
### 分支维护和EOL
某个分支的状态将会以下三种之一:
| 分支 | 维护时间 | 备注 |
|-------------------|----------------------------|----------------------------------------------------------------------|
| Maintained (维护中) | 大概2-3个minor版本 | 合入所有已解决的问题发布版本CI保证 |
| Unmaintained (无维护) | 社区诉求/兴趣驱动 | 合入所有已解决的问题无版本发布无CI承诺 |
| End of Life (EOL, 生命周期终止) | 无 | 分支不再接受任何代码 |
### 分支状态
注意:对于`*-dev`分支vllm-ascend将仅针对 vLLM 某个特定版本创建开发分支,而非全量版本。 因此您可能看到部分vLLM版本没有对应的开发分支比如只能看到`0.7.1-dev` / `0.7.3-dev`分支,而没有`0.7.2-dev`分支),这是符合预期的。
通常来说vLLM每个minor版本比如0.7均会对应一个vllm-ascend版本分支并支持其最新的版本例如我们计划支持0.7.3版本)。如下所示:
| 分支 | 状态 | 备注 |
|------------|--------------|---------------------|
| main | Maintained | 基于vLLM main分支CI看护 |
| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护 |
| v0.7.1-dev | Unmaintained | 被v0.7.3-dev分支代替 |
## 版本配套
vLLM Ascend Plugin (`vllm-ascend`) 的关键配套关系如下:
| vllm-ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu |
|--------------|---------| --- | --- | --- |
| v0.7.1rc1 | v0.7.1 | 3.9 - 3.12 | 8.0.0 | 2.5.1 / 2.5.1.dev20250218 |
## 发布节奏
### 下一个正式版(`v0.7.x`)发布窗口
| 时间 | 事件 |
|----------|-------------------------------|
| 2025年03月 | RC版本, v0.7.3rc1 |
| 2025年03月 | 正式版, 匹配0.7.3最新的vLLM版本: v0.7.3 |

169
docs/source/faqs.md Normal file
View File

@ -0,0 +1,169 @@
# FAQs
## Version Specific FAQs
- [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
- [[v0.9.2rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1742)
## General FAQs
### 1. What devices are currently supported?
Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b) and Atlas 300I(Ascend-cann-kernels-310p) series are supported:
- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 Inference series (Atlas 800I A2)
- Atlas 300I Inference series (Atlas 300I Duo)
Below series are NOT supported yet:
- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet
From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We are also welcome to join us to improve together.
### 2. How to get our docker containers?
You can get our containers at `Quay.io`, e.g., [<u>vllm-ascend</u>](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and [<u>cann</u>](https://quay.io/repository/ascend/cann?tab=tags).
If you are in China, you can use `daocloud` to accelerate your downloading:
```bash
# Replace with tag you want to pull
TAG=v0.7.3rc2
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
```
### 3. What models does vllm-ascend supports?
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_models.html).
### 4. How to get in touch with our community?
There are many channels that you can communicate with our community developers / users:
- Submit a GitHub [<u>issue</u>](https://github.com/vllm-project/vllm-ascend/issues?page=1).
- Join our [<u>weekly meeting</u>](https://docs.google.com/document/d/1hCSzRTMZhIB8vRq1_qOOjx4c9uYUxvdQvDsMV2JcSrw/edit?tab=t.0#heading=h.911qu8j8h35z) and share your ideas.
- Join our [<u>WeChat</u>](https://github.com/vllm-project/vllm-ascend/issues/227) group and ask your quenstions.
- Join our ascend channel in [<u>vLLM forums</u>](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support/6) and publish your topics.
### 5. What features does vllm-ascend V1 supports?
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).
### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
Basically, the reason is that the NPU environment is not configured correctly. You can:
1. try `source /usr/local/Ascend/nnal/atb/set_env.sh` to enable NNAL package.
2. try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
3. try `npu-smi info` to check whether the NPU is working.
If all above steps are not working, you can try the following code with python to check whether there is any error:
```
import torch
import torch_npu
import vllm
```
If all above steps are not working, feel free to submit a GitHub issue.
### 7. How does vllm-ascend perform?
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
### 8. How vllm-ascend work with vllm?
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
### 9. Does vllm-ascend support Prefill Disaggregation feature?
Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
### 10. Does vllm-ascend support quantization method?
Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
### 11. How to run w8a8 DeepSeek model?
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
### 13. How vllm-ascend is tested
vllm-ascend is tested by functional test, performance test and accuracy test.
- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit testson vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) via e2e test
- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website to show the performance test results for each pull request
- **Accuracy test**: we're working on adding accuracy test to CI as well.
Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
### 15. How to handle Out Of Memory?
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
- **Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
### 16. Failed to enable NPU graph mode when running DeepSeek?
You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
```bash
[rank0]: RuntimeError: EZ9999: Inner Error!
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
```
### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
### 18. How to generate determinitic results when using vllm-ascend?
There are several factors that affect output certainty:
1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
2. Set the following enveriments parameters:
```bash
export LCCL_DETERMINISTIC = 1
export HCCL_DETERMINISTIC = 1
export ATB_MATMUL_SHUFFLE_K_ENABLE = 0
export ATB_LLM_LCOC_ENABLE = 0
```
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.

View File

@ -35,22 +35,37 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
:maxdepth: 1
quick_start
installation
tutorials
tutorials/index.md
faqs
:::
% What does vLLM Ascend Plugin support?
:::{toctree}
:caption: User Guide
:maxdepth: 1
user_guide/suppoted_features
user_guide/supported_models
user_guide/support_matrix/index
user_guide/configuration/index
user_guide/feature_guide/index
user_guide/release_notes
:::
% How to contribute to the vLLM project
% How to contribute to the vLLM Ascend project
:::{toctree}
:caption: Developer Guide
:maxdepth: 1
developer_guide/contributing
developer_guide/versioning_policy
:::
developer_guide/contribution/index
developer_guide/feature_guide/index
developer_guide/evaluation/index
developer_guide/performance/index
developer_guide/modeling/index
:::
% How to involve vLLM Ascend
:::{toctree}
:caption: Community
:maxdepth: 1
community/governance
community/contributors
community/versioning_policy
community/user_stories/index
:::

View File

@ -5,15 +5,15 @@ This document describes how to install vllm-ascend manually.
## Requirements
- OS: Linux
- Python: 3.9 or higher
- Python: >= 3.9, < 3.12
- A hardware with Ascend NPU. It's usually the Atlas 800 A2 series.
- Software:
| Software | Supported version | Note |
| ------------ | ----------------- | ---- |
| CANN | >= 8.0.0 | Required for vllm-ascend and torch-npu |
| torch-npu | >= 2.5.1rc1 | Required for vllm-ascend |
| torch | >= 2.5.1 | Required for torch-npu and vllm |
| Software | Supported version | Note |
|---------------|----------------------------------|-------------------------------------------|
| CANN | >= 8.1.RC1 | Required for vllm-ascend and torch-npu |
| torch-npu | >= 2.5.1.post1.dev20250619 | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
| torch | >= 2.5.1 | Required for torch-npu and vllm |
You have 2 way to install:
- **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
@ -44,10 +44,12 @@ Refer to [Ascend Environment Setup Guide](https://ascend.github.io/docs/sources/
The easiest way to prepare your software environment is using CANN image directly:
```bash
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm \
--name vllm-ascend-env \
--device $DEVICE \
@ -59,40 +61,42 @@ docker run --rm \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-it quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py3.10 bash
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
:::{dropdown} Click here to see "Install CANN manually"
:animate: fade-in-slide-down
You can also install CANN manually:
:::{note}
This guide takes aarch64 as an example. If you run on x86, you need to replace `aarch64` with `x86_64` for the package name shown below.
:::
```bash
# Create a virtual environment
python -m venv vllm-ascend-env
source vllm-ascend-env/bin/activate
# Install required python packages.
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs numpy<2.0.0 decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs 'numpy<2.0.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
# Download and install the CANN package.
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.0.0/Ascend-cann-toolkit_8.0.0_linux-aarch64.run
chmod +x ./Ascend-cann-toolkit_8.0.0_linux-aarch64.run
./Ascend-cann-toolkit_8.0.0_linux-aarch64.run --full
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.0.0/Ascend-cann-kernels-910b_8.0.0_linux-aarch64.run
chmod +x ./Ascend-cann-kernels-910b_8.0.0_linux-aarch64.run
./Ascend-cann-kernels-910b_8.0.0_linux-aarch64.run --install
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.0.0/Ascend-cann-nnal_8.0.0_linux-aarch64.run
chmod +x. /Ascend-cann-nnal_8.0.0_linux-aarch64.run
./Ascend-cann-nnal_8.0.0_linux-aarch64.run --install
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
./Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run --full
source /usr/local/Ascend/ascend-toolkit/set_env.sh
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
./Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run --install
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
./Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run --install
source /usr/local/Ascend/nnal/atb/set_env.sh
```
:::
::::
::::{tab-item} Before using docker
@ -112,50 +116,63 @@ Once it's done, you can start to set up `vllm` and `vllm-ascend`.
:selected:
:sync: pip
You can install `vllm` and `vllm-ascend` from **pre-built wheel**:
First install system dependencies and config pip mirror:
```bash
# Using apt-get with mirror
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
# Or using yum
# yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
# Config pip mirror
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
**[Optional]** Then config the extra-index of `pip` if you are working on a x86 machine or using torch-npu dev version:
```bash
# For torch-npu dev version or x86 machine
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
```
Then you can install `vllm` and `vllm-ascend` from **pre-built wheel**:
```{code-block} bash
:substitutions:
# Install vllm from source, since `pip install vllm` doesn't work on CPU currently.
# It'll be fixed in the next vllm release, e.g. v0.7.3.
git clone --branch |pip_vllm_version| https://github.com/vllm-project/vllm
# Install vllm-project/vllm from pypi
pip install vllm==|pip_vllm_version|
cd vllm
VLLM_TARGET_DEVICE=empty pip install . --extra-index https://download.pytorch.org/whl/cpu/
# Install vllm-ascend from pypi.
pip install vllm-ascend==|pip_vllm_ascend_version| --extra-index https://download.pytorch.org/whl/cpu/
# Once the packages are installed, you need to install `torch-npu` manually,
# because that vllm-ascend relies on an unreleased version of torch-npu.
# This step will be removed in the next vllm-ascend release.
#
# Here we take python 3.10 on aarch64 as an example. Feel free to install the correct version for your environment. See:
#
# https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py39.tar.gz
# https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py310.tar.gz
# https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py311.tar.gz
#
mkdir pta
cd pta
wget https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py310.tar.gz
tar -xvf pytorch_v2.5.1_py310.tar.gz
pip install ./torch_npu-2.5.1.dev20250218-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
# Install vllm-project/vllm-ascend from pypi.
pip install vllm-ascend==|pip_vllm_ascend_version|
```
:::{dropdown} Click here to see "Build from source code"
or build from **source code**:
```{code-block} bash
:substitutions:
# Install vLLM
git clone --depth 1 --branch |vllm_version| https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install . --extra-index https://download.pytorch.org/whl/cpu/
VLLM_TARGET_DEVICE=empty pip install -v -e .
cd ..
# Install vLLM Ascend
git clone --depth 1 --branch |vllm_ascend_version| https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
pip install -v -e .
cd ..
```
vllm-ascend will build custom ops by default. If you don't want to build it, set `COMPILE_CUSTOM_KERNELS=0` environment to disable it.
:::
```{note}
If you are building from v0.7.3-dev and intend to use sleep mode feature, you should set `COMPILE_CUSTOM_KERNELS=1` manually.
To build custom ops, gcc/g++ higher than 8 and c++ 17 or higher is required. If you're using `pip install -e .` and encourage a torch-npu version conflict, please install with `pip install --no-build-isolation -e .` to build on system env.
If you encounter other problems during compiling, it is probably because unexpected compiler is being used, you may export `CXX_COMPILER` and `C_COMPILER` in env to specify your g++ and gcc locations before compiling.
```
::::
@ -165,14 +182,23 @@ pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
You can just pull the **prebuilt image** and run it with bash.
:::{dropdown} Click here to see "Build from Dockerfile"
or build IMAGE from **source code**:
```bash
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile .
```
:::
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
DEVICE=/dev/davinci7
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker pull $IMAGE
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend-env \
--device $DEVICE \
@ -184,17 +210,11 @@ docker run --rm \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
or build IMAGE from **source code**:
```bash
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile .
```
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
::::
:::::
@ -231,7 +251,8 @@ for output in outputs:
Then run:
```bash
# export VLLM_USE_MODELSCOPE=true to speed up download if huggingface is not reachable.
# Try `export VLLM_USE_MODELSCOPE=true` and `pip install modelscope`
# to speed up download if huggingface is not reachable.
python example.py
```

View File

@ -8,15 +8,19 @@
## Setup environment using container
:::::{tab-set}
::::{tab-item} Ubuntu
```{code-block} bash
:substitutions:
# You can change version a suitable one base on your requirement, e.g. main
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run \
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
@ -28,23 +32,61 @@ docker run \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
# Install curl
apt-get update -y && apt-get install -y curl
```
::::
::::{tab-item} openEuler
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-openeuler
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
# Install curl
yum update -y && yum install -y curl
```
::::
:::::
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
## Usage
There are two ways to start vLLM on Ascend NPU:
### Offline Batched Inference with vLLM
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).
You can use Modelscope mirror to speed up download:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash
# Use Modelscope mirror to speed up download
export VLLM_USE_MODELSCOPE=true
```
There are two ways to start vLLM on Ascend NPU:
:::::{tab-set}
::::{tab-item} Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).
Try to run below Python script directly or use `python3` shell to generate texts:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```python
from vllm import LLM, SamplingParams
@ -64,15 +106,16 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
### OpenAI Completions API with vLLM
::::
::::{tab-item} OpenAI Completions API
vLLM can also be deployed as a server that implements the OpenAI API protocol. Run
the following command to start the vLLM server with the
[Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash
# Use Modelscope mirror to speed up download
export VLLM_USE_MODELSCOPE=true
# Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models)
vllm serve Qwen/Qwen2.5-0.5B-Instruct &
```
@ -89,12 +132,14 @@ Congratulations, you have successfully started the vLLM server!
You can query the list the models:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash
curl http://localhost:8000/v1/models | python3 -m json.tool
```
You can also query the model with input prompts:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
@ -109,10 +154,10 @@ curl http://localhost:8000/v1/completions \
vLLM is serving as background process, you can use `kill -2 $VLLM_PID` to stop the background process gracefully,
it's equal to `Ctrl-C` to stop foreground vLLM process:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash
ps -ef | grep "/.venv/bin/vllm serve" | grep -v grep
VLLM_PID=`ps -ef | grep "/.venv/bin/vllm serve" | grep -v grep | awk '{print $2}'`
kill -2 $VLLM_PID
VLLM_PID=$(pgrep -f "vllm serve")
kill -2 "$VLLM_PID"
```
You will see output as below:
@ -124,3 +169,5 @@ INFO: Application shutdown complete.
```
Finally, you can exit container by using `ctrl-D`.
::::
:::::

View File

@ -1,311 +0,0 @@
# Tutorials
## Run vllm-ascend on Single NPU
### Offline Inference on Single NPU
Run docker container:
```{code-block} bash
:substitutions:
docker run \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it quay.io/ascend/vllm-ascend:|vllm_ascend_version| bash
```
Setup environment variables:
```bash
# Use Modelscope mirror to speed up model download
export VLLM_USE_MODELSCOPE=True
# To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate for Qwen2.5-7B-Instruct
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
:::{note}
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::
Run the following script to execute offline inference on a single NPU:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", max_model_len=26240)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```
### Online Serving on Single NPU
Run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
docker run \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it quay.io/ascend/vllm-ascend:|vllm_ascend_version| \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:
```bash
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
If you query the server successfully, you can see the info shown below (client):
```bash
{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"text":" here. Its not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}
```
Logs of the vllm server:
```bash
INFO: 172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-13 08:34:35 logger.py:39] Received request cmpl-574f00e342904692a73fb6c1c986c521-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
```
## Run vllm-ascend on Multi-NPU
### Distributed Inference on Multi-NPU
Run docker container:
```{code-block} bash
:substitutions:
docker run \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it quay.io/ascend/vllm-ascend:|vllm_ascend_version| bash
```
Setup environment variables:
```bash
# Use Modelscope mirror to speed up model download
export VLLM_USE_MODELSCOPE=True
# To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate for Qwen2.5-7B-Instruct
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
Run the following script to execute offline inference on multi-NPU:
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct",
tensor_parallel_size=2,
distributed_executor_backend="mp",
max_model_len=26240)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```
## Online Serving on Multi Machine
Run docker container on each machine:
```shell
docker run \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2\
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it quay.io/ascend/vllm-ascend:v0.7.1rc1 bash
```
Choose one machine as head node, the other are worker nodes, then start ray on each machine:
:::{note} Check out your `nic_name` by command `ip addr` :::
```shell
# Head node
export HCCL_IF_IP={local_ip}
export GLOO_SOCKET_IFNAME={nic_name}
export TP_SOCKET_IFNAME={nic_name}
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ray start --head --num-gpus=8
# Worker node
export HCCL_IF_IP={local_ip}
export ASCEND_PROCESS_LOG_PATH={plog_save_path}
export GLOO_SOCKET_IFNAME={nic_name}
export TP_SOCKET_IFNAME={nic_name}
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
```
Start the vLLM server on head node:
```shell
export VLLM_HOST_IP={head_node_ip}
export HCCL_CONNECT_TIMEOUT=120
export ASCEND_PROCESS_LOG_PATH={plog_save_path}
export HCCL_IF_IP={head_node_ip}
if [ -d "{plog_save_path}" ]; then
rm -rf {plog_save_path}
echo ">>> remove {plog_save_path}"
fi
LOG_FILE="multinode_$(date +%Y%m%d_%H%M).log"
VLLM_TORCH_PROFILER_DIR=./vllm_profile
python -m vllm.entrypoints.openai.api_server \
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
--trust-remote-code \
--enforce-eager \
--max-model-len {max_model_len} \
--distributed_executor_backend "ray" \
--tensor-parallel-size 16 \
--disable-log-requests \
--disable-log-stats \
--disable-frontend-multiprocessing \
--port {port_num} \
```
Once your server is started, you can query the model with input prompts:
```shell
curl -X POST http://127.0.0.1:{prot_num}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Deepseek/DeepSeek-V2-Lite-Chat",
"prompt": "The future of AI is",
"max_tokens": 24
}'
```
If you query the server successfully, you can see the info shown below (client):
```
{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
```
Logs of the vllm server:
```
INFO: 127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-19 17:37:35 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
```

View File

@ -0,0 +1,16 @@
# Tutorials
:::{toctree}
:caption: Deployment
:maxdepth: 1
single_npu
single_npu_multimodal
single_npu_audio
single_npu_qwen3_embedding
multi_npu
multi_npu_moge
multi_npu_qwen3_moe
multi_npu_quantization
single_node_300i
multi_node
:::

View File

@ -0,0 +1,198 @@
# Multi-Node-DP (DeepSeek)
## Getting Start
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which dont currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
## Verify Multi-Node Communication Environment
### Physical Layer Requirements:
- The physical machines must be located on the same WLAN, with network connectivity.
- All NPUs are connected with optical modules, and the connection status must be normal.
### Verification Process:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
### NPU Interconnect Verification:
#### 1. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
```
#### 2. Cross-Node PING Test
```bash
# Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Run with docker
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```
:::{note}
Before launch the inference server, ensure some environment variables are set for multi node communication
:::
Run the following scripts on two nodes respectively
**node0**
```shell
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /root/.cache/ds_v3 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--quantization ascend \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
**node1**
```shell
#!/bin/sh
nic_name="xxx"
local_ip="xxx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
vllm serve /root/.cache/ds_v3 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address { node0 ip } \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--seed 1024 \
--quantization ascend \
--served-model-name deepseek_v3 \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
![alt text](../assets/multi_node_dp.png)
Once your server is started, you can query the model with input prompts:
```shell
curl http://{ node0 ip:8004 }/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/ds_v3",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
```
## Run benchmarks
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
```shell
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \
--dataset-name random --random-input-len 128 --random-output-len 128 \
--num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
```

View File

@ -0,0 +1,107 @@
# Multi-NPU (QwQ 32B)
## Run vllm-ascend on Multi-NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
```bash
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/QwQ-32B",
"prompt": "QwQ-32B是什么",
"max_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.6"
}'
```
### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="Qwen/QwQ-32B",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=4096)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```

Some files were not shown because too many files have changed in this diff Show More