[Doc] Add a doc for qwen omni (#1867 )

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com> ### What this PR does / why we need it? Add FAQ note for qwen omni Fixes https://github.com/vllm-project/vllm-ascend/issues/1760 issue1 - vLLM version: v0.9.2 - vLLM main: b9a21e9173
[CI] Fix broken CI (#1889 )
2025-07-20 09:05:41 +08:00 · 2025-07-20 02:11:57 +08:00 · 2025-07-19 11:39:48 +08:00 · 2025-07-19 11:37:03 +08:00 · 2025-07-19 09:42:32 +08:00 · 2025-07-18 23:09:54 +08:00
326 changed files with 46100 additions and 3774 deletions
--- a/.github/Dockerfile.buildwheel
+++ b/.github/Dockerfile.buildwheel
@ -0,0 +1,45 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+ARG PY_VERSION=3.10
+FROM quay.io/ascend/manylinux:8.0.0-910b-manylinux_2_28-py${PY_VERSION}
+
+ARG COMPILE_CUSTOM_KERNELS=1
+
+# Define environments
+ENV DEBIAN_FRONTEND=noninteractive
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+RUN yum update -y && \
+    yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
+    rm -rf /var/cache/yum
+
+WORKDIR /workspace
+
+COPY . /workspace/vllm-ascend/
+
+# Install req
+RUN python3 -m pip install -r vllm-ascend/requirements.txt --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip install twine
+
+# Install vllm-ascend
+RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    cd vllm-ascend && \
+    python3 setup.py bdist_wheel && \
+    ls -l dist 
+
+CMD ["/bin/bash"]
--- a/.github/ISSUE_TEMPLATE/110-user-story.yml
+++ b/.github/ISSUE_TEMPLATE/110-user-story.yml
@ -0,0 +1,37 @@
+name: 📚 User Story
+description: Apply for an user story to be displayed on https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html
+title: "[User Story]: "
+labels: ["user-story"]
+
+body:
+- type: textarea
+  attributes:
+    label: 📚 Title
+    description: >
+      A clear title about what your user story is about.
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: About / Introduction
+    description: >
+      A brief introduction about the background of your use case, like your scenario, hardware size etc.
+- type: textarea
+  attributes:
+    label: Bussiness Challenges
+    description: >
+      Tell us how what kind of challenge you faced in this user story.
+- type: textarea
+  attributes:
+    label: Solving challenges with vLLM Ascend and benefits
+    description: >
+      Tell us how vLLM Ascend helped you overcome the challenges, including details like how you use it, what version you used, hardware info, etc. And what kind of benefit do you get from using vLLM Ascend
+- type: textarea
+  attributes:
+    label: Extra Info
+    description: >
+      Any extra infomation you want to include in this story
+- type: markdown
+  attributes:
+    value: >
+      Thanks for contributing 🎉!
--- a/.github/ISSUE_TEMPLATE/400-bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/400-bug-report.yml
@ -14,9 +14,7 @@ body:
    description: |
      Please run the following and paste the output below.
      ```sh
-      npu-smi info
-      cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
-      wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
+      wget https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/collect_env.py
      # For security purposes, please feel free to check the contents of collect_env.py before running it.
      python collect_env.py
      ```
--- a/.github/ISSUE_TEMPLATE/900-release-checklist.yml
+++ b/.github/ISSUE_TEMPLATE/900-release-checklist.yml
@ -0,0 +1,100 @@
+name: Release Checklist
+description: Generate a release checklist issue when prepare a new release.(Used for release team)
+title: "[Release]: Release checklist for v"
+
+body:
+- type: textarea
+  attributes:
+    description: >
+      Brief info for the new release.
+    label: Release Checklist
+    value: >
+      **Release Version**: 
+
+      **Release Branch**: 
+
+      **Release Date**: 
+
+      **Release Manager**: 
+- type: textarea
+  attributes:
+    description: >
+      Release notes.
+    label: Prepare Release Note
+    value: >
+      - [ ] Create a new issue for release feedback
+
+      - [ ] Write the release note PR.
+
+        - [ ] Update the feedback issue link in docs/source/faqs.md
+
+        - [ ] Add release note to docs/source/user_guide/release_notes.md
+
+        - [ ] Update version info in docs/source/community/versioning_policy.md
+
+        - [ ] Update contributor info in docs/source/community/contributors.md
+
+        - [ ] Update package version in docs/conf.py
+- type: textarea
+  attributes:
+    description: >
+      Make sure the code is merged.
+    label: PR need Merge
+    value: >
+      - [ ] PR link1
+
+      - [ ] PR link2
+
+      - [ ] ...
+- type: textarea
+  attributes:
+    description: >
+      Make sure the new Feature/Function is tested
+    label: Functional Test
+    value: >
+      - [ ] Feature1
+
+      - [ ] Bug1
+
+      - [ ] ...
+- type: textarea
+  attributes:
+    description: >
+      Make sure the doc is updated.
+    label: Doc Test
+    value: >
+      - [ ] Tutorial is updated.
+
+      - [ ] User Guide is updated.
+
+      - [ ] Developer Guide is updated.
+- type: textarea
+  attributes:
+    description: >
+      Make sure the artifacts is ready
+    label: Prepare Artifacts
+    value: >
+      - [ ] Docker image is ready.
+
+      - [ ] Wheel package is ready.
+- type: textarea
+  attributes:
+    description: >
+      Start to release.
+    label: Release Step
+    value: >
+      - [ ] Release note PR is merged.
+
+      - [ ] Post the release on GitHub release page.
+
+      - [ ] Generate official doc page on https://app.readthedocs.org/dashboard/
+
+      - [ ] Wait for the wheel package to be available on https://pypi.org/project/vllm-ascend
+
+      - [ ] Wait for the docker image to be available on https://quay.io/ascend/vllm-ascend
+
+      - [ ] Upload 310p wheel to Github release page
+
+      - [ ] Broadcast the release news (By message, blog , etc)
+
+      - [ ] Close this issue
--- a/.github/actionlint.yaml
+++ b/.github/actionlint.yaml
@ -0,0 +1,8 @@
+self-hosted-runner:
+  # Labels of self-hosted runner in array of strings.
+  labels:
+    - linux-arm64-npu-1
+    - linux-arm64-npu-2
+    - linux-arm64-npu-4
+    - linux-arm64-npu-static-8
+    - ubuntu-24.04-arm
--- a/.github/dependabot.yml
+++ b/.github/dependabot.yml
@ -0,0 +1,10 @@
+version: 2
+updates:
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      # Check for updates to GitHub Actions every week
+      interval: "weekly"
+    open-pull-requests-limit: 2
+    reviewers:
+      - "Yikun"
--- a/.github/format_pr_body.sh
+++ b/.github/format_pr_body.sh
@ -0,0 +1,59 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+# Adapted from vllm/.github/scripts/cleanup_pr_body.sh
+
+#!/bin/bash
+
+set -eux
+
+# ensure 2 argument is passed
+if [ "$#" -ne 3 ]; then
+    echo "Usage: $0 <pr_number> <vllm_version> <vllm_commit>"
+    exit 1
+fi
+
+PR_NUMBER=$1
+VLLM_VERSION=$2
+VLLM_COMMIT=$3
+OLD=/tmp/orig_pr_body.txt
+NEW=/tmp/new_pr_body.txt
+FINAL=/tmp/final_pr_body.txt
+
+gh pr view --json body --template "{{.body}}" "${PR_NUMBER}" > "${OLD}"
+cp "${OLD}" "${NEW}"
+
+# Remove notes in pr description and add vLLM version and commit
+sed -i '/<!--/,/-->/d' "${NEW}"
+sed -i '/- vLLM .*$/d' "${NEW}"
+{
+    echo ""
+    echo "- vLLM version: $VLLM_VERSION"
+    echo "- vLLM main: $VLLM_COMMIT"
+} >> "${NEW}"
+
+# Remove redundant empty lines
+uniq "${NEW}" > "${FINAL}"
+
+# Run this only if ${NEW} is different than ${OLD}
+if ! cmp -s "${OLD}" "${FINAL}"; then
+    echo
+    echo "Updating PR body:"
+    echo
+    cat "${NEW}"
+    gh pr edit --body-file "${FINAL}" "${PR_NUMBER}"
+else
+    echo "No changes needed"
+fi
--- a/.github/labeler.yml
+++ b/.github/labeler.yml
@ -0,0 +1,38 @@
+---
+documentation:
+  - changed-files:
+      - any-glob-to-any-file:
+          - 'docs/**'
+          - '**/*.md'
+
+ci/build:
+  - changed-files:
+      - any-glob-to-any-file:
+          - '.github/actions/*.yml'
+          - '.github/workflows/*.yml'
+
+'module:tests':
+  - changed-files:
+      - any-glob-to-any-file:
+          - 'tests/**'
+
+'module:tools':
+  - changed-files:
+      - any-glob-to-any-file:
+          - 'tools/**'
+
+'module:ops':
+  - changed-files:
+      - any-glob-to-any-file:
+          - 'vllm_ascend/ops/**'
+
+'module:quantization':
+  - changed-files:
+      - any-glob-to-any-file:
+          - 'vllm_ascend/quantization/**'
+
+'module:core':
+  - changed-files:
+      - any-glob-to-any-file:
+          - 'vllm_ascend/*.py'
+
--- a/.github/workflows/accuracy_test.yaml
+++ b/.github/workflows/accuracy_test.yaml
@ -0,0 +1,405 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+# This test will be triggered:
+# 1. PR labeled with: '*accuracy-test' (ONLY 1 label valid) & 'ready-for-test'
+# 2. workflow_dispatch with models input
+# See detail rule in strategy.matrix note
+name: Benchmarks / accuracy
+
+on:
+  schedule:
+    # Runs every 6 hours
+    - cron:  '0 */6 * * *'
+  pull_request:
+    types: [ labeled ]
+  workflow_dispatch:
+    inputs:
+      vllm-version:
+        description: 'vllm version:'
+        required: true
+        type: choice
+        # Please also update this when bump matched version
+        # Current supported vLLM versions
+        options:
+          - main
+          - v0.9.2
+          - v0.9.1
+          - v0.7.3
+      vllm-ascend-version:
+        description: 'vllm-ascend version:'
+        required: true
+        type: choice
+        options:
+          - main
+          - v0.9.1-dev
+          - v0.7.3-dev
+      models:
+        description: 'model:'
+        required: true
+        type: choice
+        options:
+          - all
+          - Qwen/Qwen2.5-VL-7B-Instruct
+          - Qwen/Qwen3-8B-Base
+          - Qwen/Qwen3-30B-A3B
+        default: 'all'
+
+# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
+# declared as "shell: bash -el {0}" on steps that need to be properly activated.
+# It's used to activate ascend-toolkit environment variables.
+defaults:
+  run:
+    shell: bash -el {0}
+
+# only cancel in-progress runs of the same workflow
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  accuracy_tests:
+    # test will be triggered when tag '*-accuracy-test' & 'ready-for-test' or workflow_dispatch job
+    if:  >-
+      ${{
+      (contains(github.event.pull_request.labels.*.name, 'accuracy-test') ||
+      contains(github.event.pull_request.labels.*.name, 'vl-accuracy-test') ||
+      contains(github.event.pull_request.labels.*.name, 'moe-accuracy-test') ||
+      contains(github.event.pull_request.labels.*.name, 'dense-accuracy-test')) &&
+      contains(github.event.pull_request.labels.*.name, 'ready-for-test') ||
+      github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
+      }}
+    runs-on: >-
+      ${{
+          (matrix.model_name == 'Qwen/Qwen3-30B-A3B' && 'linux-arm64-npu-4') ||
+          'linux-arm64-npu-2'
+      }}
+    strategy:
+      matrix:
+        # the accuracy test will run:
+        # 1. workflow_dispatch with models input
+        #   - all: Qwen/Qwen3-30B-A3B, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
+        #   - specified but not all: Qwen/Qwen3-30B-A3B, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
+        # 2. PR labeled with "*-accuracy-test"
+        #   - accuracy-test: Qwen/Qwen3-8B-Base, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-30B-A3B
+        #   - dense-accuracy-test: Qwen/Qwen3-8B-Base
+        #   - vl-accuracy-test: Qwen/Qwen2.5-VL-7B-Instruct
+        #   - moe-accuracy-test: Qwen/Qwen3-30B-A3B
+        model_name: ${{ fromJSON(
+          (github.event_name == 'schedule' &&
+            '["Qwen/Qwen3-30B-A3B","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
+          (github.event.inputs.models == 'all' &&
+            '["Qwen/Qwen3-30B-A3B","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
+          (github.event.inputs.models == 'Qwen/Qwen3-30B-A3B' &&
+            '["Qwen/Qwen3-30B-A3B"]') ||
+          (github.event.inputs.models == 'Qwen/Qwen2.5-VL-7B-Instruct' &&
+            '["Qwen/Qwen2.5-VL-7B-Instruct"]') ||
+          (github.event.inputs.models == 'Qwen/Qwen3-8B-Base' &&
+            '["Qwen/Qwen3-8B-Base"]') ||
+          contains(github.event.pull_request.labels.*.name, 'accuracy-test') &&
+            '["Qwen/Qwen3-8B-Base","Qwen/Qwen2.5-VL-7B-Instruct", "Qwen/Qwen3-30B-A3B"]' ||
+          contains(github.event.pull_request.labels.*.name, 'dense-accuracy-test') &&
+            '["Qwen/Qwen3-8B-Base"]' ||
+          contains(github.event.pull_request.labels.*.name, 'vl-accuracy-test') &&
+            '["Qwen/Qwen2.5-VL-7B-Instruct"]' ||
+          contains(github.event.pull_request.labels.*.name, 'moe-accuracy-test') &&
+            '["Qwen/Qwen3-30B-A3B"]'
+         ) }}
+
+      fail-fast: false
+    name: ${{ matrix.model_name }} accuracy
+    container:
+      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
+      env:
+        DATASET_SOURCE: ModelScope
+        VLLM_USE_MODELSCOPE: True
+        USE_MODELSCOPE_HUB: 1
+        # 1. If version specified (work_dispatch), do specified branch accuracy test
+        # 2. If no version (labeled PR), do accuracy test by default ref:
+        # The branch, tag or SHA to checkout. When checking out the repository that
+        # triggered a workflow, this defaults to the reference or SHA for that event.
+        # Otherwise, uses the default branch.
+        GHA_VLLM_ASCEND_VERSION: ${{ github.event.inputs.vllm-ascend-version }}
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Check npu and CANN info
+        run: |
+          npu-smi info
+          cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
+
+      - name: Config mirrors
+        run: |
+          sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
+          pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
+          pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
+          apt-get update -y
+          apt install git -y
+
+      - name: Install system dependencies
+        run: |
+          apt-get -y install `cat packages.txt`
+          apt-get -y install gcc g++ cmake libnuma-dev
+
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          path: ./vllm-empty
+          # Please also update this when bump matched version
+          ref: ${{ github.event.inputs.vllm-version || 'v0.9.2' }}
+
+      - name: Install vllm-project/vllm from source
+        working-directory: ./vllm-empty
+        run: VLLM_TARGET_DEVICE=empty pip install -e .
+
+      - name: Resolve vllm-ascend version
+        run: |
+          VERSION_INPUT="${{ github.event.inputs.vllm-ascend-version }}"
+          
+          if [[ "$VERSION_INPUT" == "main" ]]; then
+            TAGS=$(git ls-remote --tags --sort=-v:refname https://github.com/vllm-project/vllm-ascend "v*" | cut -f2 | sed 's|refs/tags/||')
+            LATEST_TAG=$(echo "$TAGS" | head -n1)
+            if [[ -z "$LATEST_TAG" ]]; then
+              RESOLVED_VERSION="main"
+            else
+              RESOLVED_VERSION="$LATEST_TAG"
+            fi
+          else
+            RESOLVED_VERSION="$VERSION_INPUT"
+          fi
+          echo "GHA_VLLM_ASCEND_VERSION=$RESOLVED_VERSION" >> $GITHUB_ENV
+
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm-ascend
+          path: ./vllm-ascend
+          ref: ${{ env.GHA_VLLM_ASCEND_VERSION }}
+
+      - name: Install vllm-project/vllm-ascend
+        working-directory: ./vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
+        run: |
+          pip install -r requirements-dev.txt
+          pip install -v -e . 
+            
+      - name: Get vLLM commit hash and URL
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_COMMIT=$(git rev-parse --short=7 HEAD)
+          echo "VLLM_COMMIT=$VLLM_COMMIT" >> $GITHUB_ENV
+
+      - name: Get vLLM-Ascend commit hash and URL
+        working-directory: ./vllm-ascend
+        run: |
+          VLLM_ASCEND_COMMIT=$(git rev-parse --short=7 HEAD)
+          echo "VLLM_ASCEND_COMMIT=$VLLM_ASCEND_COMMIT" >> $GITHUB_ENV
+
+      - name: Print resolved hashes
+        run: |
+          echo "vLLM       : ${{ env.VLLM_COMMIT }}"
+          echo "vLLM-Ascend: ${{ env.VLLM_ASCEND_COMMIT }}"
+
+      - name: Install lm-eval, ray, and datasets
+        run: |
+            pip install lm-eval==0.4.8
+
+      - name: Collect version info
+        run: |
+          for dir in /usr/local/Ascend/ascend-toolkit/*; do
+            dname=$(basename "$dir")
+            if [ "$dname" != "latest" ]; then
+              TOOLKIT_DIR="$dname"
+              break
+            fi
+          done
+          INFO_FILE="/usr/local/Ascend/ascend-toolkit/${TOOLKIT_DIR}/$(uname -i)-linux/ascend_toolkit_install.info"
+          GHA_CANN_VERSION=$(grep "version=" "$INFO_FILE" \
+                           | head -n1 \
+                           | cut -d'=' -f2 \
+                           | tr -d '"')
+          {
+            echo "GHA_CANN_VERSION=$GHA_CANN_VERSION"
+            pip show torch | grep "Version:" | awk '{print "GHA_TORCH_VERSION="$2}'
+            pip show torch_npu | grep "Version:" | awk '{print "GHA_TORCH_NPU_VERSION="$2}'
+            pip show vllm | grep "Version:" | awk '{print "GHA_VLLM_VERSION="$2}' | sed 's/+.*//'
+          } >> "$GITHUB_ENV"
+      
+      - name: Print versions
+        run: |
+          echo "CANN: ${{ env.GHA_CANN_VERSION }}"
+          echo "Torch NPU: ${{ env.GHA_TORCH_NPU_VERSION }}"
+          echo "Torch: ${{ env.GHA_TORCH_VERSION }}"
+          echo "vLLM: ${{ env.GHA_VLLM_VERSION }}"
+          echo "vLLM Ascend: ${{ env.GHA_VLLM_ASCEND_VERSION }}"
+
+      - name: Run Accuracy Test
+        id: report
+        working-directory: ./benchmarks
+        env:
+          PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
+        run: |
+          model_base_name=$(basename ${{ matrix.model_name }})
+          markdown_name="${model_base_name}"
+          echo "markdown_name=$markdown_name"
+          echo "markdown_name=$markdown_name" >> $GITHUB_OUTPUT
+          mkdir -p ./accuracy
+
+          python ./scripts/run_accuracy.py \
+            --model "${{ matrix.model_name }}" \
+            --output "./accuracy/${markdown_name}.md" \
+            --vllm_ascend_version "${{ env.GHA_VLLM_ASCEND_VERSION || github.ref }}" \
+            --cann_version "${{ env.GHA_CANN_VERSION }}" \
+            --torch_npu_version "${{ env.GHA_TORCH_NPU_VERSION }}" \
+            --torch_version "${{ env.GHA_TORCH_VERSION }}" \
+            --vllm_version "${{ env.GHA_VLLM_VERSION }}" \
+            --vllm_commit "${{ env.VLLM_COMMIT }}" \
+            --vllm_ascend_commit "${{ env.VLLM_ASCEND_COMMIT }}" \
+
+      - name: Generate step summary
+        if: ${{ always() }}
+        run: |
+          cat ./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md >> $GITHUB_STEP_SUMMARY
+
+      - name: Sanitize version string for artifact naming
+        run: |
+          SAFE_VLLM_ASCEND_VERSION="${GHA_VLLM_ASCEND_VERSION//\//-}"
+          echo "SAFE_VLLM_ASCEND_VERSION=$SAFE_VLLM_ASCEND_VERSION" >> "$GITHUB_ENV"
+
+      - name: Check report first line for failure
+        id: check_report
+        run: |
+          REPORT_PATH="./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md"
+          echo "Scanning $REPORT_PATH for ❌ …"
+          if grep -q '❌' "$REPORT_PATH"; then
+            echo "contains_fail=true" >> $GITHUB_OUTPUT
+          else
+            echo "contains_fail=false" >> $GITHUB_OUTPUT
+          fi
+
+      - name: Upload Report 
+        if: ${{ github.event_name == 'workflow_dispatch' && steps.check_report.outputs.contains_fail == 'false' }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: "report-${{ env.SAFE_VLLM_ASCEND_VERSION }}-${{ steps.report.outputs.markdown_name }}"
+          path: ./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md
+          if-no-files-found: warn
+          retention-days: 90
+          overwrite: true
+
+  create_pr:
+    runs-on: ubuntu-latest
+    needs: accuracy_tests
+    if: ${{ github.event_name == 'workflow_dispatch' }}
+    env:
+      UPSTREAM_REPO: vllm-project/vllm-ascend
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-ascend-ci/vllm-ascend
+          token: ${{ secrets.PAT_TOKEN }}
+          ref: main
+
+      - name: Add upstream remote
+        run: |
+          git remote add upstream https://github.com/${{ env.UPSTREAM_REPO }}.git
+          git fetch upstream
+          git remote -v
+
+      - name: Set Git user info dynamically
+        run: |
+          git config user.name "${{ github.actor }}"
+          git config user.email "${{ github.actor }}@users.noreply.github.com"
+
+      - name: Create or switch to branch
+        run: |
+          TIMESTAMP=$(date +%Y%m%d%H%M%S)
+          BRANCH_NAME="auto-pr/accuracy-report-${TIMESTAMP}"
+          echo "BRANCH_NAME=${BRANCH_NAME}" >> $GITHUB_ENV
+          git checkout -B "${BRANCH_NAME}" upstream/${{ github.event.inputs.vllm-ascend-version }}
+
+      - name: Download only current run reports
+        uses: actions/download-artifact@v4
+        with:
+          path: ./docs/source/developer_guide/evaluation/accuracy_report
+          pattern: report-*
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          run-id: ${{ github.run_id }}
+
+      - name: Delete old report
+        run: |
+          find ./docs/source/developer_guide/evaluation/accuracy_report -maxdepth 1 -type f -name '*.md' ! -name 'index.md' -delete
+          find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 2 -type f -name '*.md' -exec mv -f {} ./docs/source/developer_guide/evaluation/accuracy_report \;
+          find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 1 -type d -empty -delete
+
+      - name: Update accuracy_report/index.md
+        run: |
+          REPORT_DIR="./docs/source/developer_guide/evaluation/accuracy_report"
+          INDEX_MD="$REPORT_DIR/index.md"
+          {
+            echo "# Accuracy Report"
+            echo ""
+            echo ":::{toctree}"
+            echo ":caption: Accuracy Report"
+            echo ":maxdepth: 1"
+            
+            for report in "$REPORT_DIR"/*.md; do
+              filename="$(basename "$report" .md)"
+              if [ "$filename" != "index" ]; then
+                echo "$filename"
+              fi
+            done
+            echo ":::"
+          } > "$INDEX_MD"
+
+      - name: push accuracy report
+        env:
+          GITHUB_TOKEN: ${{ secrets.PAT_TOKEN }}
+        run: |
+          git add ./docs/source/developer_guide/evaluation/accuracy_report/*.md
+          git commit -s -m "[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-version }}"
+          git push -f origin "${{ env.BRANCH_NAME }}"
+
+      - name: Create PR in upstream via API
+        uses: actions/github-script@v7
+        with:
+          github-token: ${{ secrets.PAT_TOKEN }}
+          script: |
+            const pr = await github.rest.pulls.create({
+              owner: 'vllm-project',
+              repo: 'vllm-ascend',
+              head: `vllm-ascend-ci:${{ env.BRANCH_NAME }}`,
+              base: '${{ github.event.inputs.vllm-ascend-version }}',
+              title: `[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-version }}`,
+              body: `The accuracy results running on NPU Altlas A2 have changed, updating reports for:
+            ${{ 
+              github.event.inputs.models == 'all' 
+                && 'All models (Qwen/Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base)' 
+                || github.event.inputs.models 
+            }}
+            
+            - [Workflow run][1]
+            
+            [1]: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}`
+            });
+            core.info(`Created PR #${pr.data.number}`);
+ 
--- a/.github/workflows/actionlint.yml
+++ b/.github/workflows/actionlint.yml
@ -1,59 +0,0 @@
-#
-# Adapted from vllm-project/vllm/blob/main/.github
-# Copyright 2023 The vLLM team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-name: Lint GitHub Actions workflows
-on:
-  push:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - '.github/workflows/*.ya?ml'
-      - '.github/workflows/actionlint.*'
-      - '.github/workflows/matchers/actionlint.json'
-  pull_request:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - '.github/workflows/*.ya?ml'
-      - '.github/workflows/actionlint.*'
-      - '.github/workflows/matchers/actionlint.json'
-
-env:
-  LC_ALL: en_US.UTF-8
-
-defaults:
-  run:
-    shell: bash
-
-permissions:
-  contents: read
-
-jobs:
-  actionlint:
-    runs-on: ubuntu-latest
-    steps:
-      - name: "Checkout"
-        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
-        with:
-          fetch-depth: 0
-
-      - name: "Run actionlint"
-        run: |
-          echo "::add-matcher::.github/workflows/matchers/actionlint.json"
-          tools/actionlint.sh -color
--- a/.github/workflows/format_pr_body.yaml
+++ b/.github/workflows/format_pr_body.yaml
@ -0,0 +1,63 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+name: format / pr body
+
+on:
+  # The PR updated when PR opened and push new commits
+  pull_request_target:
+    types: [opened, synchronize]
+    branches:
+      - 'main'
+
+permissions:
+  pull-requests: write
+
+jobs:
+  update-description:
+    name: update vLLM version
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          path: ./vllm-empty
+
+      - name: Get vLLM version
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_COMMIT=$(git rev-parse HEAD)
+          echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> $GITHUB_ENV
+
+      - name: Checkout repository
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+
+      - name: Set up Python
+        uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
+
+      - name: Get vLLM release version
+        run: |
+          VLLM_VERSION=$(python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"')
+          echo "VLLM_VERSION=$VLLM_VERSION" >> $GITHUB_ENV
+
+      - name: Update PR description
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          bash .github/format_pr_body.sh "${{ github.event.number }}" "${{ env.VLLM_VERSION }}" "${{ env.VLLM_COMMIT }}"
--- a/.github/workflows/image_310p_openeuler.yml
+++ b/.github/workflows/image_310p_openeuler.yml
@ -0,0 +1,117 @@
+name: 'image / openEuler / 310p'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main-310p-openeuler / vllm-ascend:*-dev-310p-openeuler
+# 3. tags push trigger image publish
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-310p-openeuler / vllm-ascend:v1.2.3rc1-310p-openeuler
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_310p_openeuler.yml'
+      - 'Dockerfile.310p.openEuler'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_310p_openeuler.yml'
+      - 'Dockerfile.310p.openEuler'
+      - 'vllm_ascend/**'
+
+jobs:
+  build:
+    name: vllm-ascend image build
+    runs-on: >-
+      ${{
+          github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+          'ubuntu-latest' ||
+          'ubuntu-24.04-arm'
+      }}
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N-310p-openeuler is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-310p-openeuler
+        #    - pre/post/dev: v0.7.1rc1-310p-openeuler/v0.7.1rc1-310p-openeuler/v0.7.1rc1.dev1-310p-openeuler/v0.7.1.post1-310p-openeuler, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-310p-openeuler
+          type=ref,event=pr,suffix=-310p-openeuler
+          type=pep440,pattern={{raw}},suffix=-310p-openeuler
+        flavor:
+          latest=false
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push 310p
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/arm64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        file: Dockerfile.310p.openEuler
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
--- a/.github/workflows/image_310p_ubuntu.yml
+++ b/.github/workflows/image_310p_ubuntu.yml
@ -0,0 +1,113 @@
+name: 'image / Ubuntu / 310p'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main-310p / vllm-ascend:*-dev-310p
+# 3. tags push trigger image publish
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-310p / vllm-ascend:v1.2.3rc1-310p
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_310p_ubuntu.yml'
+      - 'Dockerfile.310p'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_310p_ubuntu.yml'
+      - 'Dockerfile.310p'
+      - 'vllm_ascend/**'
+jobs:
+
+  build:
+    name: vllm-ascend image build
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-310p
+        #    - pre/post/dev: v0.7.1rc1-310p/v0.7.1rc1-310p/v0.7.1rc1.dev1-310p/v0.7.1.post1-310p, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-310p
+          type=ref,event=pr,suffix=-310p
+          type=pep440,pattern={{raw}},suffix=-310p
+        flavor:
+          latest=false
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push 310p
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/amd64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        file: Dockerfile.310p
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
--- a/.github/workflows/image_a3_openeuler.yml
+++ b/.github/workflows/image_a3_openeuler.yml
@ -0,0 +1,117 @@
+name: 'image / openEuler / a3'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main / vllm-ascend:*-dev
+# 3. tags push trigger image publish
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-a3-openeuler / vllm-ascend:v1.2.3rc1-a3-openeuler
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_a3_openeuler.yml'
+      - 'Dockerfile.a3.openEuler'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_a3_openeuler.yml'
+      - 'Dockerfile.a3.openEuler'
+      - 'vllm_ascend/**'
+
+jobs:
+  build:
+    name: vllm-ascend image build
+    runs-on: >-
+      ${{
+          github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+          'ubuntu-latest' ||
+          'ubuntu-24.04-arm'
+      }}
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N-a3-openeuler is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-a3-openeuler
+        #    - pre/post/dev: v0.7.1rc1-a3-openeuler/v0.7.1rc1-a3-openeuler/v0.7.1rc1.dev1-a3-openeuler/v0.7.1.post1-a3-openeuler, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-a3-openeuler
+          type=ref,event=pr,suffix=-a3-openeuler
+          type=pep440,pattern={{raw}},suffix=-a3-openeuler
+        flavor:
+          latest=false
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push a3
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/arm64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        file: Dockerfile.a3.openEuler
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
+
--- a/.github/workflows/image_a3_ubuntu.yml
+++ b/.github/workflows/image_a3_ubuntu.yml
@ -0,0 +1,113 @@
+name: 'image / Ubuntu / a3'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main / vllm-ascend:*-dev
+# 3. tags push trigger image publish
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-a3|vllm-ascend:v1.2.3rc1-a3
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_a3_ubuntu.yml'
+      - 'Dockerfile.a3'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_a3_ubuntu.yml'
+      - 'Dockerfile.a3'
+      - 'vllm_ascend/**'
+jobs:
+
+  build:
+    name: vllm-ascend image build
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N-a3 is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-a3
+        #    - pre/post/dev: v0.7.1rc1-a3/v0.7.1rc1-a3/v0.7.1rc1.dev1-a3/v0.7.1.post1-a3, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-a3
+          type=ref,event=pr,suffix=-a3
+          type=pep440,pattern={{raw}},suffix=-a3
+        flavor:
+          latest=false
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push a3
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/amd64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        file: Dockerfile.a3
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
+
--- a/.github/workflows/image_openeuler.yml
+++ b/.github/workflows/image_openeuler.yml
@ -0,0 +1,116 @@
+name: 'image / openEuler'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main-openeuler / vllm-ascend:*-dev-openeuler
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-openeuler / vllm-ascend:v1.2.3rc1-openeuler
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_openeuler.yml'
+      - 'Dockerfile.openEuler'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_openeuler.yml'
+      - 'Dockerfile.openEuler'
+      - 'vllm_ascend/**'
+
+jobs:
+  build:
+    name: vllm-ascend image build
+    runs-on: >-
+      ${{
+          github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+          'ubuntu-latest' ||
+          'ubuntu-24.04-arm'
+      }}
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N-openeuler is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-openeuler
+        #    - pre/post/dev: v0.7.1rc1-openeuler/v0.7.1rc1-openeuler/v0.7.1rc1.dev1-openeuler/v0.7.1.post1-openeuler, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-openeuler
+          type=ref,event=pr,suffix=-openeuler
+          type=pep440,pattern={{raw}},suffix=-openeuler
+        flavor:
+          latest=true
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push 910b
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/arm64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        file: Dockerfile.openEuler
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
--- a/.github/workflows/image_ubuntu.yml
+++ b/.github/workflows/image_ubuntu.yml
@ -1,4 +1,4 @@
-name: 'image'
+name: 'image / Ubuntu'
 # This is a docker build check and publish job:
 # 1. PR Triggered docker image build check
 #   - is for image build check
@ -9,16 +9,22 @@ name: 'image'
 #   - commits are merge into main/*-dev  ==> vllm-ascend:main / vllm-ascend:*-dev
 # 3. tags push trigger image publish
 #   - is for final release image
-#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3|latest / vllm-ascend:v1.2.3rc1
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3 / vllm-ascend:v1.2.3rc1
 on:
  pull_request:
    branches:
      - 'main'
      - '*-dev'
    paths:
-      - '.github/workflows/image.yml'
+      - '.github/workflows/image_ubuntu.yml'
      - 'Dockerfile'
      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
  push:
    # Publish image when tagging, the Dockerfile in tag will be build as tag image
    branches:
@ -27,13 +33,13 @@ on:
    tags:
      - 'v*'
    paths:
-      - '.github/workflows/image.yml'
+      - '.github/workflows/image_ubuntu.yml'
      - 'Dockerfile'
      - 'vllm_ascend/**'
 jobs:

  build:
-    name: vllm-ascend image
+    name: vllm-ascend image build
    runs-on: ubuntu-latest

    steps:
@ -63,6 +69,8 @@ jobs:
            type=ref,event=branch
            type=ref,event=pr
            type=pep440,pattern={{raw}}
+        flavor:
+          latest=true

    - name: Free up disk space
      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
@ -71,31 +79,35 @@ jobs:
        docker-images: false

    - name: Build - Set up QEMU
-      uses: docker/setup-qemu-action@v2
-      # TODO(yikun): remove this after https://github.com/docker/setup-qemu-action/issues/198 resolved
-      with:
-        image: tonistiigi/binfmt:qemu-v7.0.0-28
+      uses: docker/setup-qemu-action@v3

    - name: Build - Set up Docker Buildx
-      uses: docker/setup-buildx-action@v2
+      uses: docker/setup-buildx-action@v3

    - name: Publish - Login to Quay Container Registry
-      if: ${{ github.event_name == 'push' }}
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
      uses: docker/login-action@v3
      with:
        registry: quay.io
        username: ${{ vars.QUAY_USERNAME }}
        password: ${{ secrets.QUAY_PASSWORD }}

-    - name: Build and push
+    - name: Build and push 910b
      uses: docker/build-push-action@v6
      with:
-        platforms: linux/amd64,linux/arm64
-        cache-from: type=gha
-        cache-to: type=gha,mode=max
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/amd64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        file: Dockerfile
        # only trigger when tag, branch/main push
-        push: ${{ github.event_name != 'pull_request' }}
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
        labels: ${{ steps.meta.outputs.labels }}
        tags: ${{ steps.meta.outputs.tags }}
        build-args: |
-            PIP_INDEX_URL=https://pypi.org/simple
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
--- a/.github/workflows/label_merge_conflict.yml
+++ b/.github/workflows/label_merge_conflict.yml
@ -0,0 +1,21 @@
+name: "Merge Conflict Labeler"
+on:
+  # So that PRs touching the same files as the push are updated
+  push:
+  # So that the `dirtyLabel` is removed if conflicts are resolve
+  # We recommend `pull_request_target` so that github secrets are available.
+  # In `pull_request` we wouldn't be able to change labels of fork PRs
+  pull_request_target:
+    types: [synchronize]
+
+jobs:
+  main:
+    runs-on: ubuntu-latest
+    steps:
+      - name: check if prs are dirty
+        uses: eps1lon/actions-label-merge-conflict@v3
+        with:
+          dirtyLabel: "merge-conflicts"
+          removeOnDirtyLabel: "ready"
+          repoToken: "${{ secrets.GITHUB_TOKEN }}"
+          commentOnDirty: "This pull request has conflicts, please resolve those before we can evaluate the pull request."
--- a/.github/workflows/labeler.yml
+++ b/.github/workflows/labeler.yml
@ -0,0 +1,18 @@
+name: Pull Request Labeler
+
+on: pull_request_target
+
+jobs:
+  label:
+    name: Label
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      pull-requests: write
+    steps:
+      - name: Label the PR
+        uses: actions/labeler@v5
+        with:
+          repo-token: ${{ secrets.GITHUB_TOKEN }}
+          configuration-path: .github/labeler.yml
+          sync-labels: true
--- a/.github/workflows/mypy.yaml
+++ b/.github/workflows/mypy.yaml
@ -1,78 +0,0 @@
-#
-# Adapted from vllm-project/vllm/blob/main/.github
-# Copyright 2023 The vLLM team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-name: mypy
-
-on:
-  # Trigger the workflow on push or pull request,
-  # but only for the main branch
-  push:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - '**/*.py'
-      - '.github/workflows/mypy.yaml'
-      - 'tools/mypy.sh'
-      - 'mypy.ini'
-  pull_request:
-    branches:
-      - 'main'
-      - '*-dev'
-    # This workflow is only relevant when one of the following files changes.
-    # However, we have github configured to expect and require this workflow
-    # to run and pass before github with auto-merge a pull request. Until github
-    # allows more flexible auto-merge policy, we can just run this on every PR.
-    # It doesn't take that long to run, anyway.
-    paths:
-     - '**/*.py'
-     - '.github/workflows/mypy.yaml'
-     - 'tools/mypy.sh'
-     - 'mypy.ini'
-
-jobs:
-  mypy:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
-    steps:
-    - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
-    - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
-      with:
-        python-version: ${{ matrix.python-version }}
-    - name: Install dependencies
-      run: |
-        pip install -r requirements-dev.txt 
-
-    - name: Checkout vllm-project/vllm repo
-      uses: actions/checkout@v4
-      with:
-        repository: vllm-project/vllm
-        path: vllm-empty
-
-    - name: Install vllm-project/vllm from source
-      working-directory: vllm-empty
-      run: |
-        pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
-        VLLM_TARGET_DEVICE=empty pip install .
-
-    - name: Mypy
-      run: |
-        echo "::add-matcher::.github/workflows/matchers/mypy.json"
-        tools/mypy.sh 1 ${{ matrix.python-version }}
--- a/.github/workflows/nightly_benchmarks.yaml
+++ b/.github/workflows/nightly_benchmarks.yaml
@ -0,0 +1,207 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+name: 'Benchmarks / Performance'
+# This workflow runs nightly benchmarks for vllm-ascend.
+
+on:
+  schedule:
+    # Run benchmarks at 20:00 and 03:00 Beijing time (UTC+8)
+    - cron: "0 12 * * *"
+    - cron: "0 19 * * *"
+
+  workflow_dispatch:
+    # Allow manual triggering of the workflow
+
+  pull_request:
+    types: [ labeled ]
+
+# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
+# declared as "shell: bash -el {0}" on steps that need to be properly activated.
+# It's used to activate ascend-toolkit environment variables.
+defaults:
+  run:
+    shell: bash -el {0}
+
+# only 1 job can runs on static-8-01-cards
+concurrency:
+  group: static-8-01-cards
+  cancel-in-progress: false
+
+jobs:
+  test:
+    if: ${{ contains(github.event.pull_request.labels.*.name, 'performance-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
+
+    name: Benchmarks/vLLM=${{ matrix.vllm_branch }}, vLLM-Ascend=${{ matrix.vllm_ascend_branch }}, use_v1=${{ matrix.vllm_use_v1 }}
+    runs-on: 'linux-arm64-npu-static-8'
+    strategy:
+      matrix:
+        include:
+          - vllm_branch: v0.9.2
+            vllm_ascend_branch: main
+            vllm_use_v1: 1
+      max-parallel: 1
+    container:
+      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
+      volumes:
+        - /usr/local/dcmi:/usr/local/dcmi
+        - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
+        - /usr/local/Ascend/driver/:/usr/local/Ascend/driver/
+        # Use self-host cache speed up pip and model download
+        - /home/action/.cache:/github/home/.cache/
+      options: >-
+        --device /dev/davinci0
+        --device /dev/davinci1
+        --device /dev/davinci_manager
+        --device /dev/devmm_svm
+        --device /dev/hisi_hdc
+      env:
+        VLLM_USE_MODELSCOPE: True
+        ES_OM_DOMAIN: ${{ secrets.ES_OM_DOMAIN }}
+        ES_OM_AUTHORIZATION: ${{ secrets.ES_OM_AUTHORIZATION }}
+        VLLM_USE_V1: ${{ matrix.vllm_use_v1 }}
+    steps:
+      - name: Check npu and CANN info
+        run: |
+          npu-smi info
+          cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
+
+      - name: Config mirrors
+        run: |
+          # keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
+          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
+          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+      - name: Install system dependencies
+        run: |
+          apt-get update -y
+          apt-get -y install git jq wget curl lsof gcc g++ cmake libnuma-dev
+
+      - name: Config git
+        run: |
+          git config --global --add safe.directory "$GITHUB_WORKSPACE"
+          git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
+
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          path: ./vllm-empty
+          ref: ${{  matrix.vllm_branch }}
+
+      - name: Install vllm-project/vllm from source
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_TARGET_DEVICE=empty pip install -e .
+
+      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
+        run: |
+          pip install "transformers<=4.52.4"
+          pip install -e .
+          pip install -r benchmarks/requirements-bench.txt
+
+      - name: Run current commit benchmarks
+        if: github.event_name != 'schedule' && github.event_name != 'workflow_dispatch'
+        run: |
+          # Sometimes we only want to run benchmarks on the current commit
+          # This is useful for debugging or a release benchmark
+          bash benchmarks/scripts/run-performance-benchmarks.sh
+          # Convert the benchmark results to markdown format
+          python3 benchmarks/scripts/convert_json_to_markdown.py
+
+      - name: Generate step summary
+        if: github.event_name != 'schedule' && github.event_name != 'workflow_dispatch'
+        run: |
+          cat ./benchmarks/results/benchmark_results.md >> $GITHUB_STEP_SUMMARY
+
+      - name: Upload benchmark artifacts
+        if: github.event_name != 'schedule' && github.event_name != 'workflow_dispatch'
+        uses: actions/upload-artifact@v4
+        with:
+          name: "benchmark-performance-${{ matrix.vllm_branch }}-${{ matrix.vllm_ascend_branch }}-report"
+          path: ./benchmarks/results/benchmark_results.md
+          if-no-files-found: warn
+          retention-days: 90
+          overwrite: true
+
+      - name: Install elastic_tool
+        if: github.event_name != 'pull_request'
+        run: |
+          pip install escli-tool==0.2.3
+
+      - name: Collect pr info from vllm-project/vllm-ascend
+        if: github.event_name != 'pull_request'
+        run: |
+          # Only get the pull request which may influences performance
+          git log --pretty=format:"%H %s" -- '**/*.py' ':!docs/*' ':!tests/*' ':!examples/*' ':!benchmarks/*' > commit_log.txt
+          escli check commit_log.txt
+      
+      - name: Prepare benchmark script in advance
+        if: github.event_name != 'pull_request'
+        # This is for the benchmark iteration, which will change the benchmark scripts while checkouting each commit.
+        # We need ensure the benchmark scripts always available.
+        run: |
+          # Prepare the benchmark script in advance
+          mkdir -p /github/home/benchmarks
+          cp -r benchmarks/* /github/home/benchmarks/
+
+      - name: Run benchmark iteration
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
+        if: github.event_name != 'pull_request'
+        run: |
+          while IFS= read -r line || [[ -n "$line" ]]; do
+            commit_id=${line%% *}
+            commit_title=${line#* }
+
+            git checkout $commit_id
+            commit_time=$(git show -s --format=%cd $commit_hash --date=iso-strict)
+            commit_time_no_tz=${commit_time::19}
+            pip install -e .
+
+            echo "------------------------"
+            echo "commit_id: $commit_id"
+            echo "commit_title: $commit_title"
+            echo "commit_time: $commit_time_no_tz"
+            echo "vllm branch: ${{ matrix.vllm_branch }}"
+            echo "vllm-ascend branch: ${{ matrix.vllm_ascend_branch }}"
+            echo "------------------------"
+
+            cd /github/home
+            ERROR_MSG=""
+            if ! bash benchmarks/scripts/run-performance-benchmarks.sh; then
+              ERROR_MSG="Benchmark failed to run"
+            fi
+            # send the result to es
+            escli add --vllm_branch ${{ matrix.vllm_branch }} \
+            --vllm_ascend_branch ${{ matrix.vllm_ascend_branch }} \
+            --commit_id $commit_id \
+            --commit_title "$commit_title" \
+            --created_at "$commit_time_no_tz" \
+            --res_dir ./benchmarks/results \
+            --error "$ERROR_MSG" \
+            --extra_feat '{"VLLM_USE_V1": "${{ matrix.vllm_use_v1 }}"}'
+            rm -rf ./benchmarks/results
+            cd -
+          done < commit_log.txt
--- a/.github/workflows/pre-commit.yml
+++ b/.github/workflows/pre-commit.yml
@ -0,0 +1,37 @@
+name: pre-commit
+
+on:
+    workflow_call:
+
+permissions:
+  contents: read
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout vllm-project/vllm-ascend repo
+      uses: actions/checkout@v4
+    - uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
+      with:
+        python-version: "3.10"
+    - run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
+    - run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
+    - name: Checkout vllm-project/vllm repo
+      uses: actions/checkout@v4
+      with:
+        repository: vllm-project/vllm
+        path: ./vllm-empty
+    - name: Install vllm
+      working-directory: vllm-empty
+      run: |
+        pip install -r requirements/build.txt --extra-index-url https://download.pytorch.org/whl/cpu
+        VLLM_TARGET_DEVICE=empty pip install .
+    - name: Install vllm-ascend dev
+      run: |
+        pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu
+    - uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
+      env:
+        SHELLCHECK_OPTS: "--exclude=SC2046,SC2006,SC2086" # Exclude SC2046, SC2006, SC2086 for actionlint
+      with:
+        extra_args: --all-files --hook-stage manual
--- a/.github/workflows/release_code.yml
+++ b/.github/workflows/release_code.yml
@ -0,0 +1,75 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+name: build / sdist
+
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/release_code.yml'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    tags:
+      - 'v*'
+
+jobs:
+  build:
+    name: release code
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10"]
+    steps:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+
+      - name: Print
+        run: |
+          lscpu
+      
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: |
+          python3 -m pip install twine setuptools_scm
+
+      - name: Generate tar.gz
+        run: |
+          python3 setup.py sdist
+          ls dist
+
+      - name: Archive tar.gz
+        uses: actions/upload-artifact@v4
+        with:
+          name: vllm-ascend-src
+          path: dist/*
+
+      - name: Release
+        if: startsWith(github.ref, 'refs/tags/')
+        run: |
+          python3 -m twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}
--- a/.github/workflows/release_whl.yml
+++ b/.github/workflows/release_whl.yml
@ -0,0 +1,118 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+name: build / wheel
+
+on:
+  schedule:
+    # Runs at 23:00 UTC (7:00 AM Beijing) every day
+    - cron: '0 23 * * *'
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/release_whl.yml'
+      - '.github/Dockerfile.buildwheel'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    tags:
+      - 'v*'
+
+jobs:
+  build:
+    name: build and release wheel
+    strategy:
+      matrix:
+        os: [ubuntu-24.04, ubuntu-24.04-arm]
+        # PR only trigger latest version
+        python-version: ${{ fromJSON(
+          (github.event_name == 'pull_request' && '["3.11"]') ||
+          '["3.9", "3.10", "3.11"]'
+         ) }}
+    runs-on: ${{ matrix.os }}
+    steps:
+    - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+
+    - name: Print
+      run: |
+        lscpu
+        
+    - name: Build wheel
+      run: |
+        ls
+        docker build -f ./.github/Dockerfile.buildwheel \
+        --build-arg PY_VERSION=${{ matrix.python-version }} \
+        -t wheel:v1 .
+        docker run --rm \
+        -u $(id -u):$(id -g) \
+        -v $(pwd):/outpwd \
+        wheel:v1 \
+        bash -c "cp -r /workspace/vllm-ascend/dist /outpwd"
+        ls dist
+
+    - name: Set up Python ${{ matrix.python-version }}
+      if: startsWith(github.ref, 'refs/tags/')
+      uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
+      with:
+        python-version: ${{ matrix.python-version }}
+      
+    - name: Repair wheels with auditwheel
+      run: |
+        python3 -m pip install auditwheel
+        python3 -m pip install patchelf
+        mkdir -p dist/repaired
+        for whl in dist/*.whl; do
+          auditwheel repair "$whl" -w dist/repaired/ \
+          --exclude libplatform.so \
+          --exclude libregister.so \
+          --exclude libge_common_base.so \
+          --exclude libc10.so \
+          --exclude libc_sec.so \
+          --exclude "libascend*.so" \
+          --exclude "libtorch*.so"
+        done
+        rm -f dist/*.whl
+        mv dist/repaired/*.whl dist/
+        rmdir dist/repaired
+        ls dist
+
+    - name: Verify automatic platform tags
+      run: |
+        cd dist
+        for wheel in *.whl; do
+          echo "verification file: $wheel"
+          auditwheel show "$wheel"
+        done
+
+    - name: Archive wheel
+      uses: actions/upload-artifact@v4
+      with:
+        name: vllm-ascend-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
+        path: dist/*
+
+    - name: Release
+      if: startsWith(github.ref, 'refs/tags/')
+      run: |
+        python3 -m pip install twine
+        python3 -m twine upload --verbose dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}
--- a/.github/workflows/ruff.yml
+++ b/.github/workflows/ruff.yml
@ -1,59 +0,0 @@
-#
-# Adapted from vllm-project/vllm/blob/main/.github
-# Copyright 2023 The vLLM team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-name: ruff
-
-on:
-  # Trigger the workflow on push or pull request,
-  # but only for the main branch
-  push:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - "**/*.py"
-      - requirements-lint.txt
-      - .github/workflows/matchers/ruff.json
-      - .github/workflows/ruff.yml
-  pull_request:
-    branches:
-      - 'main'
-      - '*-dev'
-
-jobs:
-  ruff:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        python-version: ["3.12"]
-    steps:
-      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
-        with:
-          python-version: ${{ matrix.python-version }}
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install -r requirements-lint.txt
-      - name: Analysing the code with ruff
-        run: |
-          echo "::add-matcher::.github/workflows/matchers/ruff.json"
-          ruff check --output-format github .
-      - name: Run isort
-        run: |
-          isort . --check-only
--- a/.github/workflows/shellcheck.yml
+++ b/.github/workflows/shellcheck.yml
@ -1,56 +0,0 @@
-#
-# Adapted from vllm-project/vllm/blob/main/.github
-# Copyright 2023 The vLLM team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-name: Lint shell scripts
-on:
-  push:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - '**/*.sh'
-      - '.github/workflows/shellcheck.yml'
-  pull_request:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - '**/*.sh'
-      - '.github/workflows/shellcheck.yml'
-
-env:
-  LC_ALL: en_US.UTF-8
-
-defaults:
-  run:
-    shell: bash
-
-permissions:
-  contents: read
-
-jobs:
-  shellcheck:
-    runs-on: ubuntu-latest
-    steps:
-      - name: "Checkout"
-        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
-        with:
-          fetch-depth: 0
-
-      - name: "Check shell scripts"
-        run: |
-          tools/shellcheck.sh
--- a/.github/workflows/vllm_ascend_doctest.yaml
+++ b/.github/workflows/vllm_ascend_doctest.yaml
@ -0,0 +1,87 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+name: 'e2e test / doctest'
+
+on:
+  workflow_dispatch:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      # If we are changing the doctest we should do a PR test
+      - '.github/workflows/vllm_ascend_doctest.yaml'
+      - 'tests/e2e/doctests/**'
+      - 'tests/e2e/common.sh'
+      - 'tests/e2e/run_doctests.sh'
+  schedule:
+    # Runs every 12 hours
+    - cron:  '0 */12 * * *'
+
+# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
+# declared as "shell: bash -el {0}" on steps that need to be properly activated.
+# It's used to activate ascend-toolkit environment variables.
+defaults:
+  run:
+    shell: bash -el {0}
+
+jobs:
+  test:
+    strategy:
+      # Each version should be tested
+      fail-fast: false
+      matrix:
+        vllm_verison: [v0.9.1-dev, v0.9.1-dev-openeuler, main, main-openeuler]
+    name: vLLM Ascend test
+    runs-on: linux-arm64-npu-1
+    container:
+      image: m.daocloud.io/quay.io/ascend/vllm-ascend:${{ matrix.vllm_verison }}
+    steps:
+      - name: Check NPU/CANN and git info
+        run: |
+          echo "====> Print NPU/CANN info"
+          npu-smi info
+          cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
+
+          echo "====> Print vllm-ascend git info"
+          cd /vllm-workspace/vllm-ascend
+          git --no-pager log -1 || true
+          echo "====> Print vllm git info"
+          cd /vllm-workspace/vllm
+          git --no-pager log -1 || true
+
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+
+      - name: Run vllm-ascend/tests/e2e/run_doctests.sh
+        run: |
+          # PWD: /__w/vllm-ascend/vllm-ascend
+          # Make sure e2e tests are latest
+          echo "Replacing /vllm-workspace/vllm-ascend/tests/e2e ..."
+          rm -rf /vllm-workspace/vllm-ascend/tests/e2e
+          mkdir -p /vllm-workspace/vllm-ascend/tests
+          # Overwrite e2e and examples
+          cp -r tests/e2e /vllm-workspace/vllm-ascend/tests/
+          cp -r examples /vllm-workspace/vllm-ascend/
+
+          # Simulate container to enter directory
+          cd /workspace
+
+          # Run real test
+          echo "Test:"
+          /vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh
--- a/.github/workflows/vllm_ascend_test.yaml
+++ b/.github/workflows/vllm_ascend_test.yaml
@ -1,6 +1,5 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
-# This file is a part of the vllm-ascend project.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@ -13,29 +12,19 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+# This file is a part of the vllm-ascend project.
 #

-name: 'e2e test'
+name: 'test'

 on:
  push:
    branches:
      - 'main'
-      - '*-dev'
-    paths:
-      - '*.txt'
-      - '**/*.py'
-      - '.github/workflows/vllm_ascend_test.yaml'
-      - '!docs/**'
  pull_request:
    branches:
      - 'main'
      - '*-dev'
-    paths:
-      - '*.txt'
-      - '**/*.py'
-      - '.github/workflows/vllm_ascend_test.yaml'
-      - '!docs/**'

 # Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
 # declared as "shell: bash -el {0}" on steps that need to be properly activated.
@ -44,26 +33,119 @@ defaults:
  run:
    shell: bash -el {0}

-jobs:
-  test:
-    name: vLLM Ascend test (self-host)
-    runs-on: ascend-arm64  # actionlint-ignore: runner-label
+# only cancel in-progress runs of the same workflow
+# and ignore the lint / 1 card / 4 cards test type
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true

+jobs:
+  lint:
+    uses: ./.github/workflows/pre-commit.yml
+
+  changes:
+    runs-on: ubuntu-latest
+    outputs:
+      e2e_tracker: ${{ steps.filter.outputs.e2e_tracker }}
+      ut_tracker: ${{ steps.filter.outputs.ut_tracker }}
+    steps:
+      - uses: actions/checkout@v4
+      - uses: dorny/paths-filter@v3
+        id: filter
+        with:
+          filters: |
+            e2e_tracker:
+              - '.github/workflows/vllm_ascend_test.yaml'
+              - 'vllm_ascend/**'
+              - 'csrc/**'
+              - 'cmake/**'
+              - 'tests/e2e/**'
+              - 'CMakeLists.txt'
+              - 'setup.py'
+              - 'requirements.txt'
+              - 'requirements-dev.txt'
+              - 'requirements-lint.txt'
+              - 'packages.txt'
+            ut_tracker:
+              - 'tests/ut/**'
+  ut:
+    needs: [lint, changes]
+    name: unit test
+    # only trigger unit test after lint passed and the change is e2e and ut related.
+    if: ${{ needs.lint.result == 'success' && (needs.changes.outputs.e2e_tracker == 'true' || needs.changes.outputs.ut_tracker == 'true') }}
+    runs-on: ubuntu-latest
    container:
-      image: quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py3.10
-      volumes:
-        - /usr/local/dcmi:/usr/local/dcmi
-        - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
-        - /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
-        # Use self-host cache speed up pip and model download
-        - /home/action/actions-runner/_work/cache:/github/home/.cache/
-      options: >-
-        --device /dev/davinci6
-        --device /dev/davinci_manager
-        --device /dev/devmm_svm
-        --device /dev/hisi_hdc
+      image: quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
      env:
-        HF_ENDPOINT: https://hf-mirror.com
+        VLLM_LOGGING_LEVEL: ERROR
+        VLLM_USE_MODELSCOPE: True
+    strategy:
+      matrix:
+        vllm_version: [main, v0.9.2]
+    steps:
+      - name: Install packages
+        run: |
+          apt-get update -y
+          apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
+
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          ref: ${{ matrix.vllm_version }}
+          path: ./vllm-empty
+
+      - name: Install vllm-project/vllm from source
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_TARGET_DEVICE=empty python3 -m pip install . --extra-index https://download.pytorch.org/whl/cpu/
+          python3 -m pip uninstall -y triton
+
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+
+      - name: Install vllm-project/vllm-ascend
+        run: |
+          export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
+          export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
+          python3 -m pip install -r requirements-dev.txt --extra-index https://download.pytorch.org/whl/cpu/
+          python3 -m pip install -v . --extra-index https://download.pytorch.org/whl/cpu/
+
+      - name: Run unit test
+        env:
+          VLLM_WORKER_MULTIPROC_METHOD: spawn
+          TORCH_DEVICE_BACKEND_AUTOLOAD: 0
+        run: |
+          export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
+          pytest -sv --cov --cov-report=xml:unittests-coverage.xml tests/ut
+
+      - name: Upload coverage to Codecov
+        if: ${{ matrix.vllm_version == 'main' }}
+        uses: codecov/codecov-action@v5
+        env:
+          CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
+        with:
+          flags: unittests
+          name: vllm-ascend
+          verbose: true
+
+  e2e:
+    needs: [lint, changes]
+    # only trigger e2e test after lint passed and the change is e2e related with pull request.
+    if: ${{ github.event_name == 'pull_request' && needs.lint.result == 'success' && needs.changes.outputs.e2e_tracker == 'true' }}
+    strategy:
+      max-parallel: 2
+      matrix:
+        os: [linux-arm64-npu-1]
+        vllm_version: [main, v0.9.2]
+    name: singlecard e2e test
+    runs-on: ${{ matrix.os }}
+    container:
+      # TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
+      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
+      env:
+        VLLM_LOGGING_LEVEL: ERROR
+        VLLM_USE_MODELSCOPE: True
    steps:
      - name: Check npu and CANN info
        run: |
@ -72,25 +154,25 @@ jobs:

      - name: Config mirrors
        run: |
-          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
-          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+          sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
+          pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
+          pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
+          apt-get update -y
+          apt install git -y

      - name: Checkout vllm-project/vllm-ascend repo
        uses: actions/checkout@v4

      - name: Install system dependencies
        run: |
-          apt-get update -y
          apt-get -y install `cat packages.txt`
-
-      - name: Install dependencies
-        run: |
-          pip install -r requirements-dev.txt
+          apt-get -y install gcc g++ cmake libnuma-dev

      - name: Checkout vllm-project/vllm repo
        uses: actions/checkout@v4
        with:
          repository: vllm-project/vllm
+          ref: ${{ matrix.vllm_version }}
          path: ./vllm-empty

      - name: Install vllm-project/vllm from source
@ -99,23 +181,106 @@ jobs:
          VLLM_TARGET_DEVICE=empty pip install -e .

      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
        run: |
-          pip install -e .
+          pip install -r requirements-dev.txt
+          pip install -v -e .

-      - name: Install pta
+      - name: Run e2e test
+        env:
+          VLLM_WORKER_MULTIPROC_METHOD: spawn
+          VLLM_USE_MODELSCOPE: True
        run: |
-          mkdir pta
-          cd pta
-          wget https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py310.tar.gz
-          tar -xvf pytorch_v2.5.1_py310.tar.gz
-          pip install ./torch_npu-2.5.1.dev20250218-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-          cd ..
-          rm -rf pta
+          pytest -sv tests/e2e/singlecard/test_offline_inference.py
+          pytest -sv tests/e2e/singlecard/test_ilama_lora.py
+          pytest -sv tests/e2e/singlecard/test_guided_decoding.py
+          pytest -sv tests/e2e/singlecard/test_camem.py
+          pytest -sv tests/e2e/singlecard/test_embedding.py
+          pytest -sv tests/e2e/singlecard/ \
+          --ignore=tests/e2e/singlecard/test_offline_inference.py \
+          --ignore=tests/e2e/singlecard/test_ilama_lora.py \
+          --ignore=tests/e2e/singlecard/test_guided_decoding.py \
+          --ignore=tests/e2e/singlecard/test_camem.py \
+          --ignore=tests/e2e/singlecard/test_embedding.py \
+          --ignore=tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py \
+          --ignore=tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
+          # ------------------------------------ v1 spec decode test ------------------------------------ #
+          VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py
+          # TODO: revert me when test_v1_spec_decode.py::test_ngram_correctness is fixed
+          VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
+
+  e2e-4-cards:
+    needs: [e2e]
+    if: ${{ needs.e2e.result == 'success' }}
+    strategy:
+      max-parallel: 1
+      matrix:
+        os: [linux-arm64-npu-4]
+        vllm_version: [main, v0.9.2]
+    name: multicard e2e test
+    runs-on: ${{ matrix.os }}
+    container:
+      # TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
+      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
+      env:
+        VLLM_LOGGING_LEVEL: ERROR
+        VLLM_USE_MODELSCOPE: True
+    steps:
+      - name: Check npu and CANN info
+        run: |
+          npu-smi info
+          cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
+
+      - name: Config mirrors
+        run: |
+          sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
+          pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
+          pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
+          apt-get update -y
+          apt install git -y
+
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+
+      - name: Install system dependencies
+        run: |
+          apt-get -y install `cat packages.txt`
+          apt-get -y install gcc g++ cmake libnuma-dev
+
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          ref: ${{ matrix.vllm_version }}
+          path: ./vllm-empty
+
+      - name: Install vllm-project/vllm from source
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_TARGET_DEVICE=empty pip install -e .
+
+      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
+        run: |
+          pip install -r requirements-dev.txt
+          pip install -v -e .

      - name: Run vllm-project/vllm-ascend test
+        env:
+          VLLM_WORKER_MULTIPROC_METHOD: spawn
+          VLLM_USE_MODELSCOPE: True
        run: |
-          pytest -sv tests
-
-      - name: Run vllm-project/vllm test
-        run: |
-          pytest -sv
+          pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py
+          # Fixme: run VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py will raise error.
+          # To avoid oom, we need to run the test in a single process.
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W8A8
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_dbo
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeekV3_dbo
+          pytest -sv tests/e2e/multicard/test_data_parallel.py
+          pytest -sv tests/e2e/multicard/ --ignore=tests/e2e/multicard/test_ilama_lora_tp2.py \
+            --ignore=tests/e2e/multicard/test_offline_inference_distributed.py \
+            --ignore=tests/e2e/multicard/test_data_parallel.py
--- a/.github/workflows/vllm_ascend_test_long_term.yaml
+++ b/.github/workflows/vllm_ascend_test_long_term.yaml
@ -0,0 +1,103 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+name: 'e2e test / long-term-test'
+
+on:
+  schedule:
+    # Runs at 23:00 UTC (7:00 AM Beijing) every day
+    - cron: '0 23 * * *'
+  pull_request:
+    types: [ labeled ]
+
+# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
+# declared as "shell: bash -el {0}" on steps that need to be properly activated.
+# It's used to activate ascend-toolkit environment variables.
+defaults:
+  run:
+    shell: bash -el {0}
+
+# only cancel in-progress runs of the same workflow
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  long-term-test:
+    # long-term-test will be triggered when tag 'long-term-test' & 'ready-for-test' or schedule job
+    if: ${{ contains(github.event.pull_request.labels.*.name, 'long-term-test')  && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' }}
+    strategy:
+      max-parallel: 2
+      matrix:
+        os: [linux-arm64-npu-1, linux-arm64-npu-4]
+        vllm_version: [main, v0.9.2]
+    name: vLLM Ascend long term test
+    runs-on: ${{ matrix.os }}
+    container:
+      # TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
+      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
+      env:
+        VLLM_LOGGING_LEVEL: ERROR
+        VLLM_USE_MODELSCOPE: True
+    steps:
+      - name: Check npu and CANN info
+        run: |
+          npu-smi info
+          cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
+
+      - name: Config mirrors
+        run: |
+          sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
+          pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
+          pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
+          apt-get update -y
+          apt install git -y
+
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+
+      - name: Install system dependencies
+        run: |
+          apt-get -y install `cat packages.txt`
+          apt-get -y install gcc g++ cmake libnuma-dev
+
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          ref: ${{ matrix.vllm_version }}
+          path: ./vllm-empty
+
+      - name: Install vllm-project/vllm from source
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_TARGET_DEVICE=empty pip install -e .
+
+      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
+        run: |
+          pip install -r requirements-dev.txt
+          pip install -v -e .
+
+      - name: Run vllm-project/vllm-ascend long term test
+        run: |
+          if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
+            pytest -sv tests/e2e/long_term/accuracy/accuracy_singlecard.py
+          else
+            # accuracy test multi card
+            pytest -sv tests/e2e/long_term/accuracy/accuracy_multicard.py
+          fi
--- a/.github/workflows/vllm_ascend_test_pd.yaml
+++ b/.github/workflows/vllm_ascend_test_pd.yaml
@ -0,0 +1,112 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+name: 'e2e test / pd-disaggregation'
+
+on:
+  schedule:
+    # Runs at 23:00 UTC (7:00 AM Beijing) every day
+    - cron: '0 23 * * *'
+  pull_request:
+    types: [ labeled ]
+
+# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
+# declared as "shell: bash -el {0}" on steps that need to be properly activated.
+# It's used to activate ascend-toolkit environment variables.
+defaults:
+  run:
+    shell: bash -el {0}
+
+# only 1 job can runs on static-8-01-cards
+concurrency:
+  group: static-8-01-cards
+  cancel-in-progress: false
+
+jobs:
+  prefilling-decoding-disaggregation:
+    # pd-test will be triggered when tag 'pd-test' & 'ready-for-test' or schedule job
+    if: ${{ contains(github.event.pull_request.labels.*.name, 'pd-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' }}
+    strategy:
+      matrix:
+        vllm_verison: [
+            # revert me when V1 disaggregation prefill is merged in main
+            # main, 
+            v0.9.1
+          ]
+    name: vLLM Ascend prefilling decoding disaggregation test
+    runs-on: linux-arm64-npu-static-8
+
+    container:
+      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
+      volumes:
+        - /usr/local/dcmi:/usr/local/dcmi
+        - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
+        - /usr/local/Ascend/driver/:/usr/local/Ascend/driver/
+        # Use self-host cache speed up pip and model download
+        - /home/action/.cache:/github/home/.cache/
+      options: >-
+        --device /dev/davinci0
+        --device /dev/davinci1
+        --device /dev/davinci_manager
+        --device /dev/devmm_svm
+        --device /dev/hisi_hdc
+      env:
+        VLLM_USE_MODELSCOPE: True
+    steps:
+      - name: Check npu and CANN info
+        run: |
+          npu-smi info
+          cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
+
+      - name: Config mirrors
+        run: |
+          # keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
+          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
+          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+          apt-get update -y
+          apt install git -y
+          git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
+
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+
+      - name: Install system dependencies
+        run: |
+          apt-get -y install `cat packages.txt`
+          apt-get -y install gcc g++ cmake libnuma-dev
+
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          ref: ${{ matrix.vllm_verison }}
+          path: ./vllm-empty
+
+      - name: Install vllm-project/vllm from source
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_TARGET_DEVICE=empty pip install -e .
+
+      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
+        run: |
+          pip install -r requirements-dev.txt
+          pip install -v -e .
+
+      - name: Run vllm-project/vllm-ascend PD Disaggregation test
+        run: |
+          pytest -sv tests/e2e/pd_disaggreate/test_pd_e2e.py
--- a/.github/workflows/yapf.yml
+++ b/.github/workflows/yapf.yml
@ -1,57 +0,0 @@
-#
-# Adapted from vllm-project/vllm/blob/main/.github
-# Copyright 2023 The vLLM team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-name: yapf
-
-on:
-  # Trigger the workflow on push or pull request,
-  # but only for the main branch
-  push:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - "**/*.py"
-      - .github/workflows/yapf.yml
-  pull_request:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - "**/*.py"
-      - .github/workflows/yapf.yml
-
-jobs:
-  yapf:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        python-version: ["3.12"]
-    steps:
-      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
-        with:
-          python-version: ${{ matrix.python-version }}
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install toml
-          pip install yapf==0.32.0
-      - name: Running yapf
-        run: |
-          yapf --diff --recursive .
--- a/.gitignore
+++ b/.gitignore
@ -196,3 +196,9 @@ kernel_meta/

 # version file generated by setuptools-scm
 /vllm_ascend/_version.py
+# build info file generated by setup.py
+/vllm_ascend/_build_info.py
+/vllm_ascend/include/
+
+# generated by CANN
+fusion_result.json
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -0,0 +1,141 @@
+default_install_hook_types:
+  - pre-commit
+  - commit-msg
+default_stages:
+  - pre-commit # Run locally
+  - manual # Run in CI
+exclude: 'examples/.*' # Exclude examples from all hooks by default
+repos:
+- repo: https://github.com/codespell-project/codespell
+  rev: v2.4.1
+  hooks:
+    - id: codespell
+      args: [
+        --toml, pyproject.toml,
+        '--skip', 'tests/e2e/multicard/test_torchair_graph_mode.py,tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**,.github/**,typos.toml',
+        '-L', 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn'
+      ]
+      additional_dependencies:
+        - tomli
+- repo: https://github.com/google/yapf
+  rev: v0.43.0
+  hooks:
+  - id: yapf
+    args: [--in-place, --verbose]
+    # Keep the same list from yapfignore here to avoid yapf failing without any inputs
+    exclude: '(.github|benchmarks|examples|docs)/.*'
+- repo: https://github.com/astral-sh/ruff-pre-commit
+  rev: v0.11.7
+  hooks:
+  - id: ruff
+    args: [--output-format, github, --fix]
+  - id: ruff-format
+    files: ^(benchmarks|examples)/.*
+- repo: https://github.com/crate-ci/typos
+  rev: v1.32.0
+  hooks:
+  - id: typos
+- repo: https://github.com/PyCQA/isort
+  rev: 6.0.1
+  hooks:
+  - id: isort
+# - repo: https://github.com/pre-commit/mirrors-clang-format
+#   rev: v20.1.3
+#   hooks:
+#   - id: clang-format
+#     files: ^csrc/.*\.(cpp|hpp|cc|hh|cxx|hxx)$
+#     types_or: [c++]
+#     args: [--style=google, --verbose]
+# - repo: https://github.com/jackdewinter/pymarkdown
+#   rev: v0.9.29
+#   hooks:
+#   - id: pymarkdown
+#     args: [fix]
+- repo: https://github.com/rhysd/actionlint
+  rev: v1.7.7
+  hooks:
+  - id: actionlint
+- repo: local
+  hooks:
+  # For local development, you can run mypy using tools/mypy.sh script if needed.
+  # - id: mypy-local
+  #   name: Run mypy for local Python installation
+  #   entry: tools/mypy.sh 0 "local"
+  #   language: system
+  #   types: [python]
+  #   stages: [pre-commit] # Don't run in CI
+  - id: mypy-3.9 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
+    name: Run mypy for Python 3.9
+    entry: tools/mypy.sh 1 "3.9"
+    # Use system python because vllm installation is required
+    language: system
+    types: [python]
+    stages: [manual] # Only run in CI
+  - id: mypy-3.10 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
+    name: Run mypy for Python 3.10
+    entry: tools/mypy.sh 1 "3.10"
+    # Use system python because vllm installation is required
+    language: system
+    types: [python]
+    stages: [manual] # Only run in CI
+  - id: mypy-3.11 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
+    name: Run mypy for Python 3.11
+    entry: tools/mypy.sh 1 "3.11"
+    # Use system python because vllm installation is required
+    language: system
+    types: [python]
+    stages: [manual] # Only run in CI
+  - id: mypy-3.12 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
+    name: Run mypy for Python 3.12
+    entry: tools/mypy.sh 1 "3.12"
+    # Use system python because vllm installation is required
+    language: system
+    types: [python]
+    stages: [manual] # Only run in CI
+  # FIXME: enable shellcheck
+  # - id: shellcheck
+  #   name: Lint shell scripts
+  #   entry: tools/shellcheck.sh
+  #   language: script
+  #   types: [shell]
+  - id: png-lint
+    name: Lint PNG exports from excalidraw
+    entry: tools/png-lint.sh
+    language: script
+    types: [png]
+  - id: signoff-commit
+    name: Sign-off Commit
+    entry: bash
+    args:
+      - -c
+      - |
+        if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" "$(git rev-parse --git-path COMMIT_EDITMSG)"; then
+          printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> "$(git rev-parse --git-path COMMIT_EDITMSG)"
+        fi
+    language: system
+    verbose: true
+    stages: [commit-msg]
+  - id: check-filenames
+    name: Check for spaces in all filenames
+    entry: bash
+    args:
+      - -c
+      - 'git ls-files | grep " " && echo "Filenames should not contain spaces!" && exit 1 || exit 0'
+    language: system
+    always_run: true
+    pass_filenames: false
+  - id: enforce-import-regex-instead-of-re
+    name: Enforce import regex as re
+    entry: python tools/enforce_regex_import.py
+    language: python
+    types: [python]
+    pass_filenames: false
+    additional_dependencies: [regex]
+  # Keep `suggestion` last
+  - id: suggestion
+    name: Suggestion
+    entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
+    language: system
+    verbose: true
+    pass_filenames: false
+  # Insert new entries above the `suggestion` entry
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -0,0 +1,98 @@
+cmake_minimum_required(VERSION 3.16)
+project(vllm_ascend_C)
+
+# include(CheckCXXcompilerFlag)
+# check_cxx_compiler_flag("-std=c++17", COMPILER_SUPPORTS_CXX17)
+set(CMAKE_CXX_STANDARD 17)
+
+include(${CMAKE_CURRENT_LIST_DIR}/cmake/utils.cmake)
+
+# Suppress potential warnings about unused manually-specified variables
+set(ignoreMe "${VLLM_PYTHON_PATH}")
+
+# TODO: Add 3.12 back when torch-npu support 3.12
+set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11")
+
+find_package(pybind11 REQUIRED)
+
+append_cmake_prefix_path("torch" "torch.utils.cmake_prefix_path")
+set(VLLM_ASCEND_INSTALL_PATH "${CMAKE_INSTALL_PREFIX}")
+
+find_package(Torch REQUIRED)
+
+set(RUN_MODE "npu" CACHE STRING "cpu/sim/npu")
+set(SOC_VERSION ${SOC_VERSION})
+message(STATUS "Detected SOC version: ${SOC_VERSION}")
+
+if (NOT CMAKE_BUILD_TYPE)
+  set(CMAKE_BUILD_TYPE "Release" CACHE STRINGS "Build type Release/Debug (default Release)" FORCE)
+endif()
+
+if (CMAKE_INSTALL_PREFIX STREQUAL /usr/local)
+  set(CMAKE_INSTALL_PREFIX "${CMAKE_CURRENT_LIST_DIR}/out" CACHE STRINGS "path to install()")
+endif()
+
+set(ASCEND_CANN_PACKAGE_PATH ${ASCEND_HOME_PATH})
+if(EXISTS ${ASCEND_HOME_PATH}/tools/tikcpp/ascendc_kernel_cmake)
+    set(ASCENDC_CMAKE_DIR ${ASCEND_HOME_PATH}/tools/tikcpp/ascendc_kernel_cmake)
+elseif(EXISTS ${ASCEND_HOME_PATH}/compiler/tikcpp/ascendc_kernel_cmake)
+    set(ASCENDC_CMAKE_DIR ${ASCEND_HOME_PATH}/compiler/tikcpp/ascendc_kernel_cmake)
+elseif(EXISTS ${ASCEND_HOME_PATH}/ascendc_devkit/tikcpp/samples/cmake)
+    set(ASCENDC_CMAKE_DIR ${ASCEND_HOME_PATH}/ascendc_devkit/tikcpp/samples/cmake)
+else()
+    message(FATAL_ERROR "ascendc_kernel_cmake does not exist, please check whether the cann package is installed.")
+endif()
+
+include(${ASCENDC_CMAKE_DIR}/ascendc.cmake)
+file(GLOB KERNEL_FILES
+${CMAKE_CURRENT_SOURCE_DIR}/csrc/kernels/*.cpp)
+
+ascendc_library(vllm_ascend_kernels SHARED
+    ${KERNEL_FILES}
+)
+
+message("TORCH_NPU_PATH is ${TORCH_NPU_PATH}")
+
+file(GLOB VLLM_ASCEND_SRC
+${CMAKE_CURRENT_SOURCE_DIR}/csrc/*.cpp)
+
+include_directories(
+  ${pybind11_INCLUDE_DIRS}
+  ${PYTHON_INCLUDE_PATH}
+  ${TORCH_INCLUDE_DIRS}
+  ${TORCH_NPU_PATH}/include
+  ${ASCEND_HOME_PATH}/include
+  ${ASCEND_HOME_PATH}/aarch64-linux/include/experiment/platform
+  ${ASCEND_HOME_PATH}/x86_64-linux/include/experiment/platform
+)
+
+set(
+  INCLUDES
+  ${TORCH_INCLUDE_DIRS}
+  ${TORCH_NPU_INCLUDE_DIRS}
+  ${ASCEND_HOME_PATH}/include
+  ${ASCEND_HOME_PATH}/aarch64-linux/include/experiment/platform
+)
+
+pybind11_add_module(vllm_ascend_C ${VLLM_ASCEND_SRC})
+
+target_link_directories(
+  vllm_ascend_C
+  PRIVATE
+  ${TORCH_NPU_PATH}/lib/
+  ${ASCEND_HOME_PATH}/lib64
+)
+
+target_link_libraries(
+  vllm_ascend_C
+  PUBLIC
+  ${TORCH_LIBRARIES}
+  libtorch_npu.so
+  vllm_ascend_kernels
+  ascendcl
+  platform
+)
+
+target_link_options(vllm_ascend_C PRIVATE "-Wl,-rpath,$ORIGIN:$ORIGIN/lib")
+
+install(TARGETS vllm_ascend_C vllm_ascend_kernels DESTINATION ${VLLM_ASCEND_INSTALL_PATH})
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,3 @@
+# Contributing to vLLM Ascend
+
+You may find information about contributing to vLLM Ascend on [Developer Guide - Contributing](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html), including step-by-step guide to help you setup development environment, contribute first PR and test locally.
--- a/35
+++ b/35
@ -1,6 +1,5 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
-# This file is a part of the vllm-ascend project.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@ -13,35 +12,49 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+# This file is a part of the vllm-ascend project.
 #

-FROM quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py3.10
+FROM quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10

 ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1

 # Define environments
 ENV DEBIAN_FRONTEND=noninteractive
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}

 RUN apt-get update -y && \
-    apt-get install -y python3-pip git vim && \
+    apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
    rm -rf /var/cache/apt/* && \
    rm -rf /var/lib/apt/lists/*

 WORKDIR /workspace

-COPY . /workspace/vllm-ascend/
+COPY . /vllm-workspace/vllm-ascend/

 RUN pip config set global.index-url ${PIP_INDEX_URL}

-# Install vLLM main
+# Install vLLM
 ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
-RUN git clone --depth 1 $VLLM_REPO /workspace/vllm
-RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install /workspace/vllm/
+ARG VLLM_TAG=v0.9.2
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge

-# Install vllm-ascend main
-RUN python3 -m pip install /workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/
+# Install vllm-ascend
+# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge

-# Install modelscope
-RUN python3 -m pip install modelscope
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge

 CMD ["/bin/bash"]
--- a/Dockerfile.310p
+++ b/Dockerfile.310p
@ -0,0 +1,61 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+# Define environments
+ENV DEBIAN_FRONTEND=noninteractive
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN apt-get update -y && \
+    apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
+    rm -rf /var/cache/apt/* && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    export SOC_VERSION=ASCEND310P3 && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/Dockerfile.310p.openEuler
+++ b/Dockerfile.310p.openEuler
@ -0,0 +1,58 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-310p-openeuler22.03-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN yum update -y && \
+    yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
+    rm -rf /var/cache/yum
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    export SOC_VERSION=ASCEND310P3 && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/Dockerfile.a3
+++ b/Dockerfile.a3
@ -0,0 +1,60 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-a3-ubuntu22.04-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+# Define environments
+ENV DEBIAN_FRONTEND=noninteractive
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN apt-get update -y && \
+    apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
+    rm -rf /var/cache/apt/* && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/Dockerfile.a3.openEuler
+++ b/Dockerfile.a3.openEuler
@ -0,0 +1,57 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-a3-openeuler22.03-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN yum update -y && \
+    yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
+    rm -rf /var/cache/yum
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/Dockerfile.openEuler
+++ b/Dockerfile.openEuler
@ -0,0 +1,57 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-910b-openeuler22.03-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN yum update -y && \
+    yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
+    rm -rf /var/cache/yum
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/README.md
+++ b/README.md
@ -10,7 +10,7 @@ vLLM Ascend Plugin
 </h3>

 <p align="center">
-| <a href="https://www.hiascend.com/en/"><b>About Ascend</b></a> | <a href="https://vllm-ascend.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack (#sig-ascend)</b></a> |
+| <a href="https://www.hiascend.com/en/"><b>About Ascend</b></a> | <a href="https://vllm-ascend.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://slack.vllm.ai"><b>#sig-ascend</b></a> | <a href="https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support"><b>Users Forum</b></a> | <a href="https://tinyurl.com/vllm-ascend-meeting"><b>Weekly Meeting</b></a> |
 </p>

 <p align="center">
@ -20,79 +20,69 @@ vLLM Ascend Plugin
 ---
 *Latest News* 🔥

+- [2025/06] [User stories](https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html) page is now live! It kicks off with ‌LLaMA-Factory/verl//TRL/GPUStack‌ to demonstrate how ‌vLLM Ascend‌ assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
+- [2025/06] [Contributors](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html) page is now live! All contributions deserve to be recorded, thanks for all contributors.
+- [2025/05] We've released first official version [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)! We collaborated with the vLLM community to publish a blog post sharing our practice: [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html).
+- [2025/03] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/VtxO9WXa5fC-mKqlxNUJUQ) with vLLM team! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
+- [2025/02] vLLM community officially created [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) repo for running vLLM seamlessly on the Ascend NPU.
 - [2024/12] We are working with the vLLM community to support [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
 ---
 ## Overview

-vLLM Ascend plugin (`vllm-ascend`) is a backend plugin for running vLLM on the Ascend NPU.
+vLLM Ascend (`vllm-ascend`) is a community maintained hardware plugin for running vLLM seamlessly on the Ascend NPU.

-This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
+It is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.

 By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU.

 ## Prerequisites

 - Hardware: Atlas 800I A2 Inference series, Atlas A2 Training series
+- OS: Linux
 - Software:
-  * Python >= 3.9
-  * CANN >= 8.0.RC2
-  * PyTorch >= 2.4.0, torch-npu >= 2.4.0
+  * Python >= 3.9, < 3.12
+  * CANN >= 8.1.RC1
+  * PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
  * vLLM (the same version as vllm-ascend)

-Find more about how to setup your environment step by step in [here](docs/source/installation.md).
-
 ## Getting Started

-> [!NOTE]
-> Currently, we are actively collaborating with the vLLM community to support the Ascend backend plugin, once supported you can use one line command `pip install vllm vllm-ascend` to compelete installation.
+Please use the following recommended versions to get started quickly:

-Installation from source code:
-```bash
-# Install vllm main branch according:
-# https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html#build-wheel-from-source
-git clone --depth 1 https://github.com/vllm-project/vllm.git
-cd vllm
-pip install -r requirements-build.txt
-VLLM_TARGET_DEVICE=empty pip install .
-
-# Install vllm-ascend main branch
-git clone https://github.com/vllm-project/vllm-ascend.git
-cd vllm-ascend
-pip install -e .
-```
-
-Run the following command to start the vLLM server with the [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model:
-
-```bash
-# export VLLM_USE_MODELSCOPE=true to speed up download
-vllm serve Qwen/Qwen2.5-0.5B-Instruct
-curl http://localhost:8000/v1/models
-```
-
-Please refer to [QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details.
+| Version    | Release type | Doc                                  |
+|------------|--------------|--------------------------------------|
+|v0.9.2rc1|Latest release candidate|[QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details|
+|v0.7.3.post1|Latest stable version|[QuickStart](https://vllm-ascend.readthedocs.io/en/stable/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/stable/installation.html) for more details|

 ## Contributing
-See [CONTRIBUTING](docs/source/developer_guide/contributing.md) for more details, which is a step-by-step guide to help you set up development environment, build and test.
+See [CONTRIBUTING](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html) for more details, which is a step-by-step guide to help you set up development environment, build and test.

 We welcome and value any contributions and collaborations:
- Please feel free comments [here](https://github.com/vllm-project/vllm-ascend/issues/19) about your usage of vLLM Ascend Plugin.
- Please let us know if you encounter a bug by [filing an issue](https://github.com/vllm-project/vllm-ascend/issues).
+- Please let us know if you encounter a bug by [filing an issue](https://github.com/vllm-project/vllm-ascend/issues)
+- Please use [User forum](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support) for usage questions and help.

 ## Branch

 vllm-ascend has main branch and dev branch.

 - **main**: main branch，corresponds to the vLLM main branch, and is continuously monitored for quality through Ascend CI.
- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.1-dev` is the dev branch for vLLM `v0.7.1` version.
+- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.3-dev` is the dev branch for vLLM `v0.7.3` version.

 Below is maintained branches:

 | Branch     | Status       | Note                                 |
 |------------|--------------|--------------------------------------|
-| main       | Maintained   | CI commitment for vLLM main branch   |
-| v0.7.3-dev | Maintained   | CI commitment for vLLM 0.7.3 version |
+| main       | Maintained   | CI commitment for vLLM main branch and vLLM 0.9.x branch   |
+| v0.7.1-dev | Unmaintained | Only doc fixed is allowed |
+| v0.7.3-dev | Maintained   | CI commitment for vLLM 0.7.3 version, only bug fix is allowed and no new release tag any more. |
+| v0.9.1-dev | Maintained   | CI commitment for vLLM 0.9.1 version |

-Please refer to [Versioning policy](docs/source/developer_guide/versioning_policy.md) for more details.
+Please refer to [Versioning policy](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html) for more details.
+
+## Weekly Meeting
+
+- vLLM Ascend Weekly Meeting: https://tinyurl.com/vllm-ascend-meeting
+- Wednesday, 15:00 - 16:00 (UTC+8, [Convert to your timezone](https://dateful.com/convert/gmt8?t=15))

 ## License

--- a/README.zh.md
+++ b/README.zh.md
@ -10,7 +10,7 @@ vLLM Ascend Plugin
 </h3>

 <p align="center">
-| <a href="https://www.hiascend.com/en/"><b>关于昇腾</b></a> | <a href="https://vllm-ascend.readthedocs.io/en/latest/"><b>官方文档</b></a> | <a href="https://slack.vllm.ai"><b>开发者 Slack (#sig-ascend)</b></a> |
+| <a href="https://www.hiascend.com/en/"><b>关于昇腾</b></a> | <a href="https://vllm-ascend.readthedocs.io/en/latest/"><b>官方文档</b></a> | <a href="https://slack.vllm.ai"><b>#sig-ascend</b></a> | <a href="https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support"><b>用户论坛</b></a> | <a href="https://tinyurl.com/vllm-ascend-meeting"><b>社区例会</b></a> |
 </p>

 <p align="center">
@ -20,11 +20,16 @@ vLLM Ascend Plugin
 ---
 *最新消息* 🔥

+- [2025/06] [用户案例](https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html)现已上线！展示了LLaMA-Factory/verl/TRL/GPUStack等用户案例，展示了vLLM Ascend如何帮助昇腾用户在模型微调、评估、强化学习 (RL) 以及部署等场景中提升体验。
+- [2025/06] [贡献者](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html)页面现已上线！所有的贡献都值得被记录，感谢所有的贡献者。
+- [2025/05] 我们发布了首个正式版本 [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)！我们与 vLLM 社区合作发布了一篇博客文章，分享了我们的实践：[Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html)。
+- [2025/03] 我们和vLLM团队举办了[vLLM Beijing Meetup](https://mp.weixin.qq.com/s/CGDuMoB301Uytnrkc2oyjg)! 你可以在[这里](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF)找到演讲材料.
+- [2025/02] vLLM社区正式创建了[vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)仓库，让vLLM可以无缝运行在Ascend NPU。
 - [2024/12] 我们正在与 vLLM 社区合作，以支持 [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
 ---
 ## 总览

-vLLM 昇腾插件 (`vllm-ascend`) 是一个让vLLM在Ascend NPU无缝运行的后端插件。
+vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NPU无缝运行的后端插件。

 此插件是 vLLM 社区中支持昇腾后端的推荐方式。它遵循[[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162)所述原则：通过解耦的方式提供了vLLM对Ascend NPU的支持。

@ -33,67 +38,50 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个让vLLM在Ascend NPU无缝运行的
 ## 准备

 - 硬件：Atlas 800I A2 Inference系列、Atlas A2 Training系列
+- 操作系统：Linux
 - 软件：
-  * Python >= 3.9
-  * CANN >= 8.0.RC2
-  * PyTorch >= 2.4.0, torch-npu >= 2.4.0
+  * Python >= 3.9, < 3.12
+  * CANN >= 8.1.RC1
+  * PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
  * vLLM (与vllm-ascend版本一致)

-在[此处](docs/source/installation.md)，您可以了解如何逐步准备环境。
-
 ## 开始使用

-> [!NOTE]
-> 目前，我们正在积极与 vLLM 社区合作以支持 Ascend 后端插件，一旦支持，您可以使用一行命令: `pip install vllm vllm-ascend` 来完成安装。
+推荐您使用以下版本快速开始使用：

-通过源码安装:
-```bash
-# 安装vllm main 分支参考文档:
-# https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html#build-wheel-from-source
-git clone --depth 1 https://github.com/vllm-project/vllm.git
-cd vllm
-pip install -r requirements-build.txt
-VLLM_TARGET_DEVICE=empty pip install .
+| Version    | Release type | Doc                                  |
+|------------|--------------|--------------------------------------|
+|v0.9.2rc1| 最新RC版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多|
+|v0.7.3.post1| 最新正式/稳定版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/stable/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/stable/installation.html)了解更多|

-# 安装vllm-ascend main 分支
-git clone https://github.com/vllm-project/vllm-ascend.git
-cd vllm-ascend
-pip install -e .
-```
+## 贡献
+请参考 [CONTRIBUTING]((https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。

-运行如下命令使用 [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) 模型启动服务:
-
-```bash
-# 设置环境变量 VLLM_USE_MODELSCOPE=true 加速下载
-vllm serve Qwen/Qwen2.5-0.5B-Instruct
-curl http://localhost:8000/v1/models
-```
-
-请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多.
-
-## 分支
+我们欢迎并重视任何形式的贡献与合作：
+- 请通过[Issue](https://github.com/vllm-project/vllm-ascend/issues)来告知我们您遇到的任何Bug。
+- 请通过[用户论坛](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support)来交流使用问题和寻求帮助。

+## 分支策略
 vllm-ascend有主干分支和开发分支。

 - **main**: 主干分支，与vLLM的主干分支对应，并通过昇腾CI持续进行质量看护。
- **vX.Y.Z-dev**: 开发分支，随vLLM部分新版本发布而创建，比如`v0.7.1-dev`是vllm-asend针对vLLM `v0.7.1`版本的开发分支。
+- **vX.Y.Z-dev**: 开发分支，随vLLM部分新版本发布而创建，比如`v0.7.3-dev`是vllm-asend针对vLLM `v0.7.3`版本的开发分支。

 下面是维护中的分支：

 | 分支         | 状态         | 备注                  |
 |------------|------------|---------------------|
 | main       | Maintained | 基于vLLM main分支CI看护   |
-| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护 |
+| v0.7.1-dev | Unmaintained | 只允许文档修复 |
+| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护, 只允许Bug修复，不会再发布新版本 |
+| v0.9.1-dev | Maintained | 基于vLLM v0.9.1版本CI看护 |

-请参阅[版本策略](docs/source/developer_guide/versioning_policy.zh.md)了解更多详细信息。
+请参阅[版本策略](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html)了解更多详细信息。

-## 贡献
-有关更多详细信息，请参阅 [CONTRIBUTING](docs/source/developer_guide/contributing.zh.md)，可以更详细的帮助您部署开发环境、构建和测试。
+## 社区例会

-我们欢迎并重视任何形式的贡献与合作：
- 您可以在[这里](https://github.com/vllm-project/vllm-ascend/issues/19)反馈您的使用体验。
- 请通过[提交问题](https://github.com/vllm-project/vllm-ascend/issues)来告知我们您遇到的任何错误。
+- vLLM Ascend 每周社区例会: https://tinyurl.com/vllm-ascend-meeting
+- 每周三下午，15:00 - 16:00 (UTC+8, [查看您的时区](https://dateful.com/convert/gmt8?t=15))

 ## 许可证
-
-Apache 许可证 2.0，如 [LICENSE](./LICENSE) 文件中所示。
+Apache 许可证 2.0，如 [LICENSE](./LICENSE) 文件中所示。
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@ -0,0 +1,166 @@
+# Introduction
+This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
+
+# Overview
+**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
+- Latency tests
+    - Input length: 32 tokens.
+    - Output length: 128 tokens.
+    - Batch size: fixed (8).
+    - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
+    - Evaluation metrics: end-to-end latency (mean, median, p99).
+
+- Throughput tests
+    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
+    - Output length: the corresponding output length of these 200 prompts.
+    - Batch size: dynamically determined by vllm to achieve maximum throughput.
+    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
+    - Evaluation metrics: throughput.
+- Serving tests
+    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
+    - Output length: the corresponding output length of these 200 prompts.
+    - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
+    - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
+    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
+    - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
+
+**Benchmarking Duration**: about 800 senond for single model.
+
+
+# Quick Use
+## Prerequisites
+Before running the benchmarks, ensure the following:
+
+- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
+
+- Install necessary dependencies for benchmarks:
+    ```
+    pip install -r benchmarks/requirements-bench.txt
+    ```
+    
+- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. 
+- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
+
+  ```shell
+  [
+  {
+    "test_name": "serving_qwen2_5vl_7B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "trust_remote_code": "",
+      "max_model_len": 16384
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "backend": "openai-chat",
+      "dataset_name": "hf",
+      "hf_split": "train",
+      "endpoint": "/v1/chat/completions",
+      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
+      "num_prompts": 200
+    }
+  }
+  ]
+  ```
+  this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
+
+  - **Test Overview**
+     - Test Name: serving_qwen2_5vl_7B_tp1
+
+     - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
+
+   - Server Parameters
+      - Model: Qwen/Qwen2.5-VL-7B-Instruct
+
+      - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
+
+      - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
+
+      - disable_log_stats: disables logging of performance statistics.
+
+      - disable_log_requests: disables logging of individual requests.
+
+      - Trust Remote Code: enabled (allows execution of model-specific custom code)
+
+      - Max Model Length: 16,384 tokens (maximum context length supported by the model)
+
+  - Client Parameters
+
+     - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
+
+     - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
+
+     - Dataset Source: Hugging Face (hf)
+
+     - Dataset Split: train
+
+     - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
+
+     - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
+
+     - Number of Prompts: 200 (the total number of prompts used during the test)
+
+
+
+## Run benchmarks
+
+### Use benchmark script
+The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
+```
+bash benchmarks/scripts/run-performance-benchmarks.sh
+```
+Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
+```
+.
+|-- serving_qwen2_5_7B_tp1_qps_1.json
+|-- serving_qwen2_5_7B_tp1_qps_16.json
+|-- serving_qwen2_5_7B_tp1_qps_4.json
+|-- serving_qwen2_5_7B_tp1_qps_inf.json
+|-- latency_qwen2_5_7B_tp1.json
+|-- throughput_qwen2_5_7B_tp1.json
+```
+These files contain detailed benchmarking results for further analysis.
+
+### Use benchmark cli
+
+For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
+Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
+#### Online serving
+1. Launch the server:
+   ```shell
+   vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
+   ```
+2. Running performance tests using cli
+   ```shell
+    vllm bench serve --model Qwen2.5-VL-7B-Instruct\
+    --endpoint-type "openai-chat" --dataset-name hf \
+    --hf-split train --endpoint "/v1/chat/completions" \
+    --dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
+    --num-prompts 200 \
+    --request-rate 16
+   ```
+
+#### Offline
+- **Throughput**
+    ```shell
+    vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
+    --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
+    --dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
+    --num-prompts 200 --backend vllm
+    ```
+- **Latency**
+    ```shell
+    vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
+    --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
+    --load-format dummy --num-iters-warmup 5 --num-iters 15
+    ```
--- a/benchmarks/ops/ben_vocabparallelembedding.py
+++ b/benchmarks/ops/ben_vocabparallelembedding.py
@ -0,0 +1,158 @@
+from typing import Tuple
+
+import numpy as np
+import pytest
+import torch
+import torch_npu  # noqa: F401
+import vllm  # noqa: F401
+
+import vllm_ascend.platform  # noqa: F401
+
+
+def benchmark_npu(fn, num_iterations=100, num_warmup_iterations=50):
+    """
+    Benchmark function for NPU operations
+
+    Args:
+        fn: Function to benchmark
+        num_iterations: Number of timing iterations
+        num_warmup_iterations: Number of warmup iterations
+
+    Returns:
+        float: Minimum elapsed time in seconds
+    """
+    start = torch.npu.Event(enable_timing=True)
+    end = torch.npu.Event(enable_timing=True)
+    times = np.zeros(num_iterations + num_warmup_iterations)
+
+    # Run iterations
+    for i in range(num_warmup_iterations + num_iterations):
+        with torch.no_grad():
+            start.record()
+            fn()  # Execute the function
+            end.record()
+        torch.npu.synchronize()
+        times[i] = start.elapsed_time(end)
+
+    # Remove warmup iterations and convert to seconds
+    times = times[num_warmup_iterations:]
+    elapsed_time = np.amin(times) / 1000
+    return elapsed_time
+
+
+def get_masked_input_and_mask_ref(
+    input_: torch.Tensor,
+    org_vocab_start_index: int,
+    org_vocab_end_index: int,
+    num_org_vocab_padding: int,
+    added_vocab_start_index: int,
+    added_vocab_end_index: int,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Reference implementation for verification"""
+    org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
+    added_vocab_mask = (input_ >= added_vocab_start_index) & (
+        input_ < added_vocab_end_index
+    )
+    added_offset = (
+        added_vocab_start_index
+        - (org_vocab_end_index - org_vocab_start_index)
+        - num_org_vocab_padding
+    )
+    valid_offset = (org_vocab_start_index * org_vocab_mask) + (
+        added_offset * added_vocab_mask
+    )
+    vocab_mask = org_vocab_mask | added_vocab_mask
+    masked_input = vocab_mask * (input_ - valid_offset)
+    return masked_input, ~vocab_mask
+
+
+DTYPES = [torch.int32]
+SHAPES = [(3, 4, 5)]
+DEVICES = [f"npu:{0}"]
+SEEDS = [0]
+
+
+@pytest.mark.parametrize("shape", SHAPES)
+@pytest.mark.parametrize("dtype", DTYPES)
+@pytest.mark.parametrize("device", DEVICES)
+@pytest.mark.parametrize("seed", SEEDS)
+@torch.inference_mode()
+def test_get_masked_input_and_mask(
+    shape: Tuple[int, ...],
+    dtype: torch.dtype,
+    device: str,
+    seed: int,
+) -> None:
+    # Set random seed and device
+    torch.manual_seed(seed)
+    torch.set_default_device(device)
+
+    # Generate random input tensor
+    input_tensor = torch.randint(0, 1000, shape, dtype=dtype)
+
+    # Test parameters
+    test_case = {
+        "org_start": 100,
+        "org_end": 200,
+        "padding": 0,
+        "added_start": 300,
+        "added_end": 400,
+    }
+
+    # Define reference function
+    def ref_fn():
+        return get_masked_input_and_mask_ref(
+            input_tensor,
+            test_case["org_start"],
+            test_case["org_end"],
+            test_case["padding"],
+            test_case["added_start"],
+            test_case["added_end"],
+        )
+
+    # Define custom function
+    def custom_fn():
+        return torch.ops._C.get_masked_input_and_mask(
+            input_tensor,
+            test_case["org_start"],
+            test_case["org_end"],
+            test_case["padding"],
+            test_case["added_start"],
+            test_case["added_end"],
+        )
+
+    # Get results for correctness testing
+    ref_masked_input, ref_mask = ref_fn()
+    custom_masked_input, custom_mask = custom_fn()
+
+    # Benchmark both implementations
+    ref_time = benchmark_npu(ref_fn)
+    custom_time = benchmark_npu(custom_fn)
+
+    # Print performance results
+    print("\nPerformance Results:")
+    print(f"Reference implementation: {ref_time * 1000:.3f} ms")
+    print(f"Custom implementation: {custom_time * 1000:.3f} ms")
+    print(f"Speedup: {ref_time / custom_time:.2f}x")
+
+    # Compare results for correctness
+    ref_masked_input = ref_masked_input.to(dtype)
+    print("\nResults comparison:")
+    print("custom_masked_input:", custom_masked_input)
+    print("ref_masked_input:", ref_masked_input)
+    print("custom_mask:", custom_mask)
+    print("ref_mask:", ref_mask)
+    torch.testing.assert_close(
+        custom_masked_input,
+        ref_masked_input,
+        rtol=1e-5,
+        atol=1e-5,
+        msg=f"Masked input mismatch for case: {test_case}",
+    )
+    torch.testing.assert_close(
+        custom_mask,
+        ref_mask,
+        rtol=1e-5,
+        atol=1e-5,
+        msg=f"Mask mismatch for case: {test_case}",
+    )
--- a/benchmarks/requirements-bench.txt
+++ b/benchmarks/requirements-bench.txt
@ -0,0 +1,4 @@
+pandas
+datasets
+modelscope
+tabulate
--- a/benchmarks/scripts/convert_json_to_markdown.py
+++ b/benchmarks/scripts/convert_json_to_markdown.py
@ -0,0 +1,188 @@
+import argparse
+import json
+import os
+from pathlib import Path
+
+import pandas as pd
+from tabulate import tabulate
+
+CUR_PATH = Path(__file__).parent.resolve()
+# latency results and the keys that will be printed into markdown
+latency_results = []
+latency_column_mapping = {
+    "test_name": "Test name",
+    "avg_latency": "Mean latency (ms)",
+    "P50": "Median latency (ms)",
+    "P99": "P99 latency (ms)",
+}
+
+# throughput tests and the keys that will be printed into markdown
+throughput_results = []
+throughput_results_column_mapping = {
+    "test_name": "Test name",
+    "num_requests": "Num of reqs",
+    "total_num_tokens": "Total num of tokens",
+    "elapsed_time": "Elapsed time (s)",
+    "requests_per_second": "Tput (req/s)",
+    "tokens_per_second": "Tput (tok/s)",
+}
+
+# serving results and the keys that will be printed into markdown
+serving_results = []
+serving_column_mapping = {
+    "test_name": "Test name",
+    "request_rate": "Request rate (req/s)",
+    "request_throughput": "Tput (req/s)",
+    "output_throughput": "Output Tput (tok/s)",
+    "median_ttft_ms": "TTFT (ms)",
+    "median_tpot_ms": "TPOT (ms)",
+    "median_itl_ms": "ITL (ms)",
+}
+
+
+def read_markdown(file):
+    if os.path.exists(file):
+        with open(file) as f:
+            return f.read() + "\n"
+    else:
+        return f"{file} not found.\n"
+
+
+def results_to_json(latency, throughput, serving):
+    return json.dumps(
+        {
+            "latency": latency.to_dict(),
+            "throughput": throughput.to_dict(),
+            "serving": serving.to_dict(),
+        }
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Process the results of the benchmark tests."
+    )
+    parser.add_argument(
+        "--results_folder",
+        type=str,
+        default="../results/",
+        help="The folder where the benchmark results are stored.",
+    )
+    parser.add_argument(
+        "--output_folder",
+        type=str,
+        default="../results/",
+        help="The folder where the benchmark results are stored.",
+    )
+    parser.add_argument(
+        "--markdown_template",
+        type=str,
+        default="./perf_result_template.md",
+        help="The template file for the markdown report.",
+    )
+    parser.add_argument(
+        "--tag", default="main", help="Tag to be used for release message."
+    )
+    parser.add_argument(
+        "--commit_id", default="", help="Commit ID to be used for release message."
+    )
+
+    args = parser.parse_args()
+    results_folder = (CUR_PATH / args.results_folder).resolve()
+    output_folder = (CUR_PATH / args.output_folder).resolve()
+    markdown_template = (CUR_PATH / args.markdown_template).resolve()
+
+    # collect results
+    for test_file in results_folder.glob("*.json"):
+        with open(test_file) as f:
+            raw_result = json.loads(f.read())
+
+        if "serving" in str(test_file):
+            # this result is generated via `benchmark_serving.py`
+
+            # update the test name of this result
+            raw_result.update({"test_name": test_file.stem})
+
+            # add the result to raw_result
+            serving_results.append(raw_result)
+            continue
+
+        elif "latency" in f.name:
+            # this result is generated via `benchmark_latency.py`
+
+            # update the test name of this result
+            raw_result.update({"test_name": test_file.stem})
+
+            # get different percentiles
+            for perc in [10, 25, 50, 75, 90, 99]:
+                # Multiply 1000 to convert the time unit from s to ms
+                raw_result.update(
+                    {f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
+                )
+            raw_result["avg_latency"] = raw_result["avg_latency"] * 1000
+
+            # add the result to raw_result
+            latency_results.append(raw_result)
+            continue
+
+        elif "throughput" in f.name:
+            # this result is generated via `benchmark_throughput.py`
+
+            # update the test name of this result
+            raw_result.update({"test_name": test_file.stem})
+
+            # add the result to raw_result
+            throughput_results.append(raw_result)
+            continue
+
+        print(f"Skipping {test_file}")
+    serving_results.sort(key=lambda x: (len(x["test_name"]), x["test_name"]))
+
+    latency_results = pd.DataFrame.from_dict(latency_results)
+    serving_results = pd.DataFrame.from_dict(serving_results)
+    throughput_results = pd.DataFrame.from_dict(throughput_results)
+
+    raw_results_json = results_to_json(
+        latency_results, throughput_results, serving_results
+    )
+
+    # remapping the key, for visualization purpose
+    if not latency_results.empty:
+        latency_results = latency_results[list(latency_column_mapping.keys())].rename(
+            columns=latency_column_mapping
+        )
+    if not serving_results.empty:
+        serving_results = serving_results[list(serving_column_mapping.keys())].rename(
+            columns=serving_column_mapping
+        )
+    if not throughput_results.empty:
+        throughput_results = throughput_results[
+            list(throughput_results_column_mapping.keys())
+        ].rename(columns=throughput_results_column_mapping)
+
+    processed_results_json = results_to_json(
+        latency_results, throughput_results, serving_results
+    )
+
+    # get markdown tables
+    latency_md_table = tabulate(
+        latency_results, headers="keys", tablefmt="pipe", showindex=False
+    )
+    serving_md_table = tabulate(
+        serving_results, headers="keys", tablefmt="pipe", showindex=False
+    )
+    throughput_md_table = tabulate(
+        throughput_results, headers="keys", tablefmt="pipe", showindex=False
+    )
+
+    # document the result
+    print(output_folder)
+    with open(output_folder / "benchmark_results.md", "w") as f:
+        results = read_markdown(markdown_template)
+        results = results.format(
+            latency_tests_markdown_table=latency_md_table,
+            throughput_tests_markdown_table=throughput_md_table,
+            serving_tests_markdown_table=serving_md_table,
+            benchmarking_results_in_json_string=processed_results_json,
+        )
+        f.write(results)
--- a/benchmarks/scripts/perf_result_template.md
+++ b/benchmarks/scripts/perf_result_template.md
@ -0,0 +1,31 @@
+## Online serving tests
+
+- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
+- Output length: the corresponding output length of these 200 prompts.
+- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
+- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
+- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
+- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).
+
+{serving_tests_markdown_table}
+
+## Offline tests
+### Latency tests
+
+- Input length: 32 tokens.
+- Output length: 128 tokens.
+- Batch size: fixed (8).
+- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
+- Evaluation metrics: end-to-end latency.
+
+{latency_tests_markdown_table}
+
+### Throughput tests
+
+- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed).
+- Output length: the corresponding output length of these 200 prompts.
+- Batch size: dynamically determined by vllm to achieve maximum throughput.
+- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
+- Evaluation metrics: throughput.
+
+{throughput_tests_markdown_table}
--- a/benchmarks/scripts/run-performance-benchmarks.sh
+++ b/benchmarks/scripts/run-performance-benchmarks.sh
@ -0,0 +1,321 @@
+#!/bin/bash
+set -e
+
+check_npus() {
+  # shellcheck disable=SC2155
+  declare -g npu_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | tr -d ' ')
+  
+  if [[ -z "$npu_count" || "$npu_count" -eq 0 ]]; then
+    echo "Need at least 1 NPU to run benchmarking."
+    exit 1
+  else
+    echo "found NPU conut: $npu_count"
+  fi
+
+  npu_type=$(npu-smi info | grep -E "^\| [0-9]+" | awk -F '|' '{print $2}' | awk '{$1=$1;print}' | awk '{print $2}')
+
+  echo "NPU type is: $npu_type"
+}
+
+ensure_sharegpt_downloaded() {
+  local FILE="/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json"
+  local DIR
+  DIR=$(dirname "$FILE")
+
+  if [ ! -f "$FILE" ]; then
+    echo "$FILE not found, downloading from hf-mirror ..."
+    mkdir -p "$DIR"
+    wget -O "$FILE" https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+    if [ $? -ne 0 ]; then
+      echo "Download failed!" >&2
+      return 1
+    fi
+    echo "Download completed and saved to $FILE"
+  else
+    echo "$FILE already exists."
+  fi
+}
+
+json2args() {
+  # transforms the JSON string to command line args, and '_' is replaced to '-'
+  # example:
+  # input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
+  # output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
+  local json_string=$1
+  local args
+  args=$(
+    echo "$json_string" | jq -r '
+      to_entries |
+      map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
+      join(" ")
+    '
+  )
+  echo "$args"
+}
+
+wait_for_server() {
+  local waited=0
+  local timeout_sec=1200
+
+  while (( waited < timeout_sec )); do
+    if curl -s -X GET localhost:8000/health > /dev/null; then
+      return 0
+    fi
+    echo "Waiting for vllm server to start..."
+    sleep 1
+    ((waited++))
+  done
+
+  echo "Timeout waiting for server"
+  return 1
+}
+
+get_cur_npu_id() {
+    npu-smi info -l | awk -F ':' '/NPU ID/ {print $2+0; exit}'
+}
+
+kill_npu_processes() {
+  ps -aux
+  lsof -t -i:8000 | xargs -r kill -9
+  pgrep python3 | xargs -r kill -9
+  
+  sleep 4
+  rm -rf ~/.config/vllm
+
+}
+
+update_json_field() {
+  local json_file="$1"
+  local field_name="$2"
+  local field_value="$3"
+
+  jq --arg value "$field_value" \
+     --arg key "$field_name" \
+     '.[$key] = $value' "$json_file" > "${json_file}.tmp" && \
+     mv "${json_file}.tmp" "$json_file"
+}
+
+run_latency_tests() {
+  # run latency tests using `benchmark_latency.py`
+  # $1: a json file specifying latency test cases
+
+  local latency_test_file
+  latency_test_file=$1
+
+  # Iterate over latency tests
+  jq -c '.[]' "$latency_test_file" | while read -r params; do
+    # get the test name, and append the NPU type back to it.
+    test_name=$(echo "$params" | jq -r '.test_name')
+    if [[ ! "$test_name" =~ ^latency_ ]]; then
+      echo "In latency-test.json, test_name must start with \"latency_\"."
+      exit 1
+    fi
+
+    # if TEST_SELECTOR is set, only run the test cases that match the selector
+    if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
+      echo "Skip test case $test_name."
+      continue
+    fi
+
+    # get arguments
+    latency_params=$(echo "$params" | jq -r '.parameters')
+    latency_args=$(json2args "$latency_params")
+
+    latency_command="vllm bench latency \
+      --output-json $RESULTS_FOLDER/${test_name}.json \
+      $latency_args"
+
+    echo "Running test case $test_name"
+    echo "Latency command: $latency_command"
+
+    # run the benchmark
+    eval "$latency_command"
+    # echo model_name to result file
+    model_name=$(echo "$latency_params" | jq -r '.model')
+    update_json_field "$RESULTS_FOLDER/${test_name}.json" "model_name" "$model_name"
+    kill_npu_processes
+
+  done
+}
+
+run_throughput_tests() {
+  # run throughput tests using `benchmark_throughput.py`
+  # $1: a json file specifying throughput test cases
+
+  local throughput_test_file
+  throughput_test_file=$1
+
+  # Iterate over throughput tests
+  jq -c '.[]' "$throughput_test_file" | while read -r params; do
+    # get the test name, and append the NPU type back to it.
+    test_name=$(echo "$params" | jq -r '.test_name')
+    if [[ ! "$test_name" =~ ^throughput_ ]]; then
+      echo "In throughput-test.json, test_name must start with \"throughput_\"."
+      exit 1
+    fi
+
+    # if TEST_SELECTOR is set, only run the test cases that match the selector
+    if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
+      echo "Skip test case $test_name."
+      continue
+    fi
+
+    # get arguments
+    throughput_params=$(echo "$params" | jq -r '.parameters')
+    throughput_args=$(json2args "$throughput_params")
+
+    throughput_command="vllm bench throughput \
+      --output-json $RESULTS_FOLDER/${test_name}.json \
+      $throughput_args"
+
+    echo "Running test case $test_name"
+    echo "Throughput command: $throughput_command"
+
+    # run the benchmark
+    eval "$throughput_command"
+    # echo model_name to result file
+    model_name=$(echo "$throughput_params" | jq -r '.model')
+    update_json_field "$RESULTS_FOLDER/${test_name}.json" "model_name" "$model_name"
+    kill_npu_processes
+
+  done
+}
+
+run_serving_tests() {
+  # run serving tests using `benchmark_serving.py`
+  # $1: a json file specifying serving test cases
+
+  local serving_test_file
+  serving_test_file=$1
+
+  # Iterate over serving tests
+  jq -c '.[]' "$serving_test_file" | while read -r params; do
+    # get the test name, and append the NPU type back to it.
+    test_name=$(echo "$params" | jq -r '.test_name')
+    if [[ ! "$test_name" =~ ^serving_ ]]; then
+      echo "In serving-test.json, test_name must start with \"serving_\"."
+      exit 1
+    fi
+
+    # if TEST_SELECTOR is set, only run the test cases that match the selector
+    if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
+      echo "Skip test case $test_name."
+      continue
+    fi
+
+    # get client and server arguments
+    server_params=$(echo "$params" | jq -r '.server_parameters')
+    client_params=$(echo "$params" | jq -r '.client_parameters')
+    server_args=$(json2args "$server_params")
+    client_args=$(json2args "$client_params")
+    qps_list=$(echo "$params" | jq -r '.qps_list')
+    qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
+    echo "Running over qps list $qps_list"
+
+    # check if server model and client model is aligned
+    server_model=$(echo "$server_params" | jq -r '.model')
+    client_model=$(echo "$client_params" | jq -r '.model')
+    if [[ $server_model != "$client_model" ]]; then
+      echo "Server model and client model must be the same. Skip testcase $test_name."
+      continue
+    fi
+
+    server_command="python3 \
+      -m vllm.entrypoints.openai.api_server \
+      $server_args"
+
+    # run the server
+    echo "Running test case $test_name"
+    echo "Server command: $server_command"
+    bash -c "$server_command" &
+    server_pid=$!
+
+    # wait until the server is alive
+    if wait_for_server; then
+      echo ""
+      echo "vllm server is up and running."
+    else
+      echo ""
+      echo "vllm failed to start within the timeout period."
+    fi
+
+    # iterate over different QPS
+    for qps in $qps_list; do
+      # remove the surrounding single quote from qps
+      if [[ "$qps" == *"inf"* ]]; then
+        echo "qps was $qps"
+        qps="inf"
+        echo "now qps is $qps"
+      fi
+
+      new_test_name=$test_name"_qps_"$qps
+
+      client_command="vllm bench serve \
+        --save-result \
+        --result-dir $RESULTS_FOLDER \
+        --result-filename ${new_test_name}.json \
+        --request-rate $qps \
+        $client_args"
+
+      echo "Running test case $test_name with qps $qps"
+      echo "Client command: $client_command"
+
+      bash -c "$client_command"
+    done
+
+    # clean up
+    kill -9 $server_pid
+    kill_npu_processes
+  done
+}
+
+cleanup() {
+  rm -rf ./vllm_benchmarks
+}
+
+cleanup_on_error() {
+  echo "An error occurred. Cleaning up results folder..."
+  rm -rf $RESULTS_FOLDER
+}
+
+main() {
+  START_TIME=$(date +%s)
+  check_npus
+  
+  # dependencies
+  (which wget && which curl) || (apt-get update && apt-get install -y wget curl)
+  (which jq) || (apt-get update && apt-get -y install jq)
+  (which lsof) || (apt-get update && apt-get install -y lsof)
+
+  # get the current IP address, required by benchmark_serving.py
+  # shellcheck disable=SC2155
+  export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
+  # turn of the reporting of the status of each request, to clean up the terminal output
+  export VLLM_LOG_LEVEL="WARNING"
+  
+  # set env
+  export VLLM_USE_MODELSCOPE=True
+
+  # prepare for benchmarking
+  cd benchmarks || exit 1
+  trap cleanup EXIT
+
+  QUICK_BENCHMARK_ROOT=./
+
+  declare -g RESULTS_FOLDER=results
+  mkdir -p $RESULTS_FOLDER
+
+  trap cleanup_on_error ERR
+  ensure_sharegpt_downloaded
+  # benchmarks
+  run_serving_tests $QUICK_BENCHMARK_ROOT/tests/serving-tests.json
+  run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
+  run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json
+
+  END_TIME=$(date +%s)
+  ELAPSED_TIME=$((END_TIME - START_TIME))
+  echo "Total execution time: $ELAPSED_TIME seconds"
+
+}
+
+main "$@"
--- a/benchmarks/scripts/run_accuracy.py
+++ b/benchmarks/scripts/run_accuracy.py
@ -0,0 +1,313 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# Copyright 2023 The vLLM team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+import argparse
+import gc
+import json
+import multiprocessing
+import sys
+import time
+from multiprocessing import Queue
+
+import lm_eval
+import torch
+
+# URLs for version information in Markdown report
+VLLM_URL = "https://github.com/vllm-project/vllm/commit/"
+VLLM_ASCEND_URL = "https://github.com/vllm-project/vllm-ascend/commit/"
+
+# Model and task configurations
+UNIMODAL_MODEL_NAME = ["Qwen/Qwen3-8B-Base", "Qwen/Qwen3-30B-A3B"]
+UNIMODAL_TASK = ["ceval-valid", "gsm8k"]
+MULTIMODAL_NAME = ["Qwen/Qwen2.5-VL-7B-Instruct"]
+MULTIMODAL_TASK = ["mmmu_val"]
+
+# Batch size configurations per task
+BATCH_SIZE = {"ceval-valid": 1, "mmlu": 1, "gsm8k": "auto", "mmmu_val": 1}
+
+# Model type mapping (vllm for text, vllm-vlm for vision-language)
+MODEL_TYPE = {
+    "Qwen/Qwen3-8B-Base": "vllm",
+    "Qwen/Qwen3-30B-A3B": "vllm",
+    "Qwen/Qwen2.5-VL-7B-Instruct": "vllm-vlm",
+}
+
+# Command templates for running evaluations
+MODEL_RUN_INFO = {
+    "Qwen/Qwen3-30B-A3B": (
+        "export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True'\n"
+        "lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
+        "--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
+    ),
+    "Qwen/Qwen3-8B-Base": (
+        "export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
+        "lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
+        "--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
+    ),
+    "Qwen/Qwen2.5-VL-7B-Instruct": (
+        "export MODEL_ARGS='pretrained={model},max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2'\n"
+        "lm_eval --model vllm-vlm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
+        "--apply_chat_template --fewshot_as_multiturn  --batch_size 1"
+    ),
+}
+
+# Evaluation metric filters per task
+FILTER = {
+    "gsm8k": "exact_match,flexible-extract",
+    "ceval-valid": "acc,none",
+    "mmmu_val": "acc,none",
+}
+
+# Expected accuracy values for models
+EXPECTED_VALUE = {
+    "Qwen/Qwen3-30B-A3B": {"ceval-valid": 0.83, "gsm8k": 0.85},
+    "Qwen/Qwen3-8B-Base": {"ceval-valid": 0.82, "gsm8k": 0.83},
+    "Qwen/Qwen2.5-VL-7B-Instruct": {"mmmu_val": 0.51},
+}
+PARALLEL_MODE = {
+    "Qwen/Qwen3-8B-Base": "TP",
+    "Qwen/Qwen2.5-VL-7B-Instruct": "TP",
+    "Qwen/Qwen3-30B-A3B": "EP",
+}
+
+# Execution backend configuration
+EXECUTION_MODE = {
+    "Qwen/Qwen3-8B-Base": "ACLGraph",
+    "Qwen/Qwen2.5-VL-7B-Instruct": "ACLGraph",
+    "Qwen/Qwen3-30B-A3B": "ACLGraph",
+}
+
+# Model arguments for evaluation
+MODEL_ARGS = {
+    "Qwen/Qwen3-8B-Base": "pretrained=Qwen/Qwen3-8B-Base,max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6",
+    "Qwen/Qwen2.5-VL-7B-Instruct": "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2",
+    "Qwen/Qwen3-30B-A3B": "pretrained=Qwen/Qwen3-30B-A3B,max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True",
+}
+
+# Whether to apply chat template formatting
+APPLY_CHAT_TEMPLATE = {
+    "Qwen/Qwen3-8B-Base": True,
+    "Qwen/Qwen2.5-VL-7B-Instruct": True,
+    "Qwen/Qwen3-30B-A3B": False,
+}
+# Few-shot examples handling as multi-turn dialogues.
+FEWSHOT_AS_MULTITURN = {
+    "Qwen/Qwen3-8B-Base": True,
+    "Qwen/Qwen2.5-VL-7B-Instruct": True,
+    "Qwen/Qwen3-30B-A3B": False,
+}
+
+# Relative tolerance for accuracy checks
+RTOL = 0.03
+ACCURACY_FLAG = {}
+
+
+def run_accuracy_test(queue, model, dataset):
+    """Run accuracy evaluation for a model on a dataset in separate process"""
+    try:
+        eval_params = {
+            "model": MODEL_TYPE[model],
+            "model_args": MODEL_ARGS[model],
+            "tasks": dataset,
+            "apply_chat_template": APPLY_CHAT_TEMPLATE[model],
+            "fewshot_as_multiturn": FEWSHOT_AS_MULTITURN[model],
+            "batch_size": BATCH_SIZE[dataset],
+        }
+
+        if MODEL_TYPE[model] == "vllm":
+            eval_params["num_fewshot"] = 5
+
+        results = lm_eval.simple_evaluate(**eval_params)
+        print(f"Success: {model} on {dataset} ")
+        measured_value = results["results"]
+        queue.put(measured_value)
+    except Exception as e:
+        print(f"Error in run_accuracy_test: {e}")
+        queue.put(e)
+        sys.exit(1)
+    finally:
+        if "results" in locals():
+            del results
+        gc.collect()
+        torch.npu.empty_cache()
+        time.sleep(5)
+
+
+def generate_md(model_name, tasks_list, args, datasets):
+    """Generate Markdown report with evaluation results"""
+    # Format the run command
+    run_cmd = MODEL_RUN_INFO[model_name].format(model=model_name, datasets=datasets)
+    model = model_name.split("/")[1]
+
+    # Version information section
+    version_info = (
+        f"**vLLM Version**: vLLM: {args.vllm_version} "
+        f"([{args.vllm_commit}]({VLLM_URL + args.vllm_commit})), "
+        f"vLLM Ascend: {args.vllm_ascend_version} "
+        f"([{args.vllm_ascend_commit}]({VLLM_ASCEND_URL + args.vllm_ascend_commit}))  "
+    )
+
+    # Report header with system info
+    preamble = f"""# {model}
+{version_info}
+**Software Environment**: CANN: {args.cann_version}, PyTorch: {args.torch_version}, torch-npu: {args.torch_npu_version}  
+**Hardware Environment**: Atlas A2 Series  
+**Datasets**: {datasets}  
+**Parallel Mode**: {PARALLEL_MODE[model_name]}  
+**Execution Mode**: {EXECUTION_MODE[model_name]}  
+**Command**:  
+```bash
+{run_cmd}
+```
+  """
+
+    header = (
+        "| Task                  | Filter | n-shot | Metric   | Value   | Stderr |\n"
+        "|-----------------------|-------:|-------:|----------|--------:|-------:|"
+    )
+    rows = []
+    rows_sub = []
+    # Process results for each task
+    for task_dict in tasks_list:
+        for key, stats in task_dict.items():
+            alias = stats.get("alias", key)
+            task_name = alias.strip()
+            if "exact_match,flexible-extract" in stats:
+                metric_key = "exact_match,flexible-extract"
+            else:
+                metric_key = None
+                for k in stats:
+                    if "," in k and not k.startswith("acc_stderr"):
+                        metric_key = k
+                        break
+            if metric_key is None:
+                continue
+            metric, flt = metric_key.split(",", 1)
+
+            value = stats[metric_key]
+            stderr = stats.get(f"{metric}_stderr,{flt}", 0)
+            if model_name in UNIMODAL_MODEL_NAME:
+                n_shot = "5"
+            else:
+                n_shot = "0"
+            flag = ACCURACY_FLAG.get(task_name, "")
+            row = (
+                f"| {task_name:<37} "
+                f"| {flt:<6} "
+                f"| {n_shot:6} "
+                f"| {metric:<6} "
+                f"| {flag}{value:>5.4f} "
+                f"| ± {stderr:>5.4f} |"
+            )
+            if not task_name.startswith("-"):
+                rows.append(row)
+                rows_sub.append(
+                    "<details>"
+                    + "\n"
+                    + "<summary>"
+                    + task_name
+                    + " details"
+                    + "</summary>"
+                    + "\n" * 2
+                    + header
+                )
+            rows_sub.append(row)
+        rows_sub.append("</details>")
+    # Combine all Markdown sections
+    md = (
+        preamble
+        + "\n"
+        + header
+        + "\n"
+        + "\n".join(rows)
+        + "\n"
+        + "\n".join(rows_sub)
+        + "\n"
+    )
+    print(md)
+    return md
+
+
+def safe_md(args, accuracy, datasets):
+    """
+    Safely generate and save Markdown report from accuracy results.
+    """
+    data = json.loads(json.dumps(accuracy))
+    for model_key, tasks_list in data.items():
+        md_content = generate_md(model_key, tasks_list, args, datasets)
+        with open(args.output, "w", encoding="utf-8") as f:
+            f.write(md_content)
+        print(f"create Markdown file:{args.output}")
+
+
+def main(args):
+    """Main evaluation workflow"""
+    accuracy = {}
+    accuracy[args.model] = []
+    result_queue: Queue[float] = multiprocessing.Queue()
+    if args.model in UNIMODAL_MODEL_NAME:
+        datasets = UNIMODAL_TASK
+    else:
+        datasets = MULTIMODAL_TASK
+    datasets_str = ",".join(datasets)
+    # Evaluate model on each dataset
+    for dataset in datasets:
+        accuracy_expected = EXPECTED_VALUE[args.model][dataset]
+        p = multiprocessing.Process(
+            target=run_accuracy_test, args=(result_queue, args.model, dataset)
+        )
+        p.start()
+        p.join()
+        if p.is_alive():
+            p.terminate()
+            p.join()
+        gc.collect()
+        torch.npu.empty_cache()
+        time.sleep(10)
+        result = result_queue.get()
+        print(result)
+        if (
+            accuracy_expected - RTOL
+            < result[dataset][FILTER[dataset]]
+            < accuracy_expected + RTOL
+        ):
+            ACCURACY_FLAG[dataset] = "✅"
+        else:
+            ACCURACY_FLAG[dataset] = "❌"
+        accuracy[args.model].append(result)
+    print(accuracy)
+    safe_md(args, accuracy, datasets_str)
+
+
+if __name__ == "__main__":
+    multiprocessing.set_start_method("spawn", force=True)
+    # Initialize argument parser
+    parser = argparse.ArgumentParser(
+        description="Run model accuracy evaluation and generate report"
+    )
+    parser.add_argument("--output", type=str, required=True)
+    parser.add_argument("--model", type=str, required=True)
+    parser.add_argument("--vllm_ascend_version", type=str, required=False)
+    parser.add_argument("--torch_version", type=str, required=False)
+    parser.add_argument("--torch_npu_version", type=str, required=False)
+    parser.add_argument("--vllm_version", type=str, required=False)
+    parser.add_argument("--cann_version", type=str, required=False)
+    parser.add_argument("--vllm_commit", type=str, required=False)
+    parser.add_argument("--vllm_ascend_commit", type=str, required=False)
+    args = parser.parse_args()
+    main(args)
--- a/benchmarks/tests/latency-tests.json
+++ b/benchmarks/tests/latency-tests.json
@ -0,0 +1,23 @@
+[
+  {
+    "test_name": "latency_qwen3_8B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen3-8B",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "max_model_len": 16384,
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
+  },
+  {
+    "test_name": "latency_qwen2_5_7B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
+  }
+]
--- a/benchmarks/tests/serving-tests.json
+++ b/benchmarks/tests/serving-tests.json
@ -0,0 +1,77 @@
+[
+  {
+    "test_name": "serving_qwen2_5vl_7B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "trust_remote_code": "",
+      "max_model_len": 16384
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "endpoint_type": "openai-chat",
+      "dataset_name": "hf",
+      "hf_split": "train",
+      "endpoint": "/v1/chat/completions",
+      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
+      "num_prompts": 200
+    }
+  },
+  {
+    "test_name": "serving_qwen3_8B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen3-8B",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "load_format": "dummy"
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen3-8B",
+      "endpoint_type": "vllm",
+      "dataset_name": "sharegpt",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200
+    }
+  },
+  {
+    "test_name": "serving_qwen2_5_7B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "load_format": "dummy"
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "endpoint_type": "vllm",
+      "dataset_name": "sharegpt",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200
+    }
+  }
+]
--- a/benchmarks/tests/throughput-tests.json
+++ b/benchmarks/tests/throughput-tests.json
@ -0,0 +1,38 @@
+[
+  {
+    "test_name": "throughput_qwen3_8B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen3-8B",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200,
+      "backend": "vllm"
+    }
+  },
+  {
+    "test_name": "throughput_qwen2_5vl_7B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "backend": "vllm-chat",
+      "dataset_name": "hf",
+      "hf_split": "train",
+      "max_model_len": 16384,
+      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
+      "num_prompts": 200
+    }
+  },
+  {
+    "test_name": "throughput_qwen2_5_7B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200,
+      "backend": "vllm"
+    }
+  }
+]
+
--- a/cmake/utils.cmake
+++ b/cmake/utils.cmake
@ -0,0 +1,133 @@
+#
+# Attempt to find the python package that uses the same python executable as
+# `EXECUTABLE` and is one of the `SUPPORTED_VERSIONS`.
+#
+macro (find_python_from_executable EXECUTABLE SUPPORTED_VERSIONS)
+  file(REAL_PATH ${EXECUTABLE} EXECUTABLE)
+  set(Python_EXECUTABLE ${EXECUTABLE})
+  find_package(Python COMPONENTS Interpreter Development.Module Development.SABIModule)
+  if (NOT Python_FOUND)
+    message(FATAL_ERROR "Unable to find python matching: ${EXECUTABLE}.")
+  endif()
+  set(_VER "${Python_VERSION_MAJOR}.${Python_VERSION_MINOR}")
+  set(_SUPPORTED_VERSIONS_LIST ${SUPPORTED_VERSIONS} ${ARGN})
+  if (NOT _VER IN_LIST _SUPPORTED_VERSIONS_LIST)
+    message(FATAL_ERROR
+      "Python version (${_VER}) is not one of the supported versions: "
+      "${_SUPPORTED_VERSIONS_LIST}.")
+  endif()
+  message(STATUS "Found python matching: ${EXECUTABLE}.")
+endmacro()
+
+#
+# Run `EXPR` in python.  The standard output of python is stored in `OUT` and
+# has trailing whitespace stripped.  If an error is encountered when running
+# python, a fatal message `ERR_MSG` is issued.
+#
+function (run_python OUT EXPR ERR_MSG)
+  execute_process(
+    COMMAND
+    "${PYTHON_EXECUTABLE}" "-c" "${EXPR}"
+    OUTPUT_VARIABLE PYTHON_OUT
+    RESULT_VARIABLE PYTHON_ERROR_CODE
+    ERROR_VARIABLE PYTHON_STDERR
+    OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+  if(NOT PYTHON_ERROR_CODE EQUAL 0)
+    message(FATAL_ERROR "${ERR_MSG}: ${PYTHON_STDERR}")
+  endif()
+  set(${OUT} ${PYTHON_OUT} PARENT_SCOPE)
+endfunction()
+
+# Run `EXPR` in python after importing `PKG`. Use the result of this to extend
+# `CMAKE_PREFIX_PATH` so the torch cmake configuration can be imported.
+macro (append_cmake_prefix_path PKG EXPR)
+  run_python(_PREFIX_PATH
+    "import ${PKG}; print(${EXPR})" "Failed to locate ${PKG} path")
+  list(APPEND CMAKE_PREFIX_PATH ${_PREFIX_PATH})
+endmacro()
+
+
+# This cmake function is adapted from vllm /Users/ganyi/workspace/vllm-ascend/cmake/utils.cmake
+# Define a target named `GPU_MOD_NAME` for a single extension. The
+# arguments are:
+#
+# DESTINATION <dest>         - Module destination directory.
+# LANGUAGE <lang>            - The GPU language for this module, e.g CUDA, HIP,
+#                              etc.
+# SOURCES <sources>          - List of source files relative to CMakeLists.txt
+#                              directory.
+#
+# Optional arguments:
+#
+# ARCHITECTURES <arches>     - A list of target GPU architectures in cmake
+#                              format.
+#                              Refer `CMAKE_CUDA_ARCHITECTURES` documentation
+#                              and `CMAKE_HIP_ARCHITECTURES` for more info.
+#                              ARCHITECTURES will use cmake's defaults if
+#                              not provided.
+# COMPILE_FLAGS <flags>      - Extra compiler flags passed to NVCC/hip.
+# INCLUDE_DIRECTORIES <dirs> - Extra include directories.
+# LIBRARIES <libraries>      - Extra link libraries.
+# WITH_SOABI                 - Generate library with python SOABI suffix name.
+# USE_SABI <version>         - Use python stable api <version>
+#
+# Note: optimization level/debug info is set via cmake build type.
+#
+function (define_gpu_extension_target GPU_MOD_NAME)
+  cmake_parse_arguments(PARSE_ARGV 1
+    GPU
+    "WITH_SOABI"
+    "DESTINATION;LANGUAGE;USE_SABI"
+    "SOURCES;ARCHITECTURES;COMPILE_FLAGS;INCLUDE_DIRECTORIES;LIBRARIES")
+
+  # Add hipify preprocessing step when building with HIP/ROCm.
+  if (GPU_LANGUAGE STREQUAL "HIP")
+    hipify_sources_target(GPU_SOURCES ${GPU_MOD_NAME} "${GPU_SOURCES}")
+  endif()
+
+  if (GPU_WITH_SOABI)
+    set(GPU_WITH_SOABI WITH_SOABI)
+  else()
+    set(GPU_WITH_SOABI)
+  endif()
+
+  if (GPU_USE_SABI)
+    Python_add_library(${GPU_MOD_NAME} MODULE USE_SABI ${GPU_USE_SABI} ${GPU_WITH_SOABI} "${GPU_SOURCES}")
+  else()
+    Python_add_library(${GPU_MOD_NAME} MODULE ${GPU_WITH_SOABI} "${GPU_SOURCES}")
+  endif()
+
+  if (GPU_LANGUAGE STREQUAL "HIP")
+    # Make this target dependent on the hipify preprocessor step.
+    add_dependencies(${GPU_MOD_NAME} hipify${GPU_MOD_NAME})
+  endif()
+
+  if (GPU_ARCHITECTURES)
+    set_target_properties(${GPU_MOD_NAME} PROPERTIES
+      ${GPU_LANGUAGE}_ARCHITECTURES "${GPU_ARCHITECTURES}")
+  endif()
+
+  set_property(TARGET ${GPU_MOD_NAME} PROPERTY CXX_STANDARD 17)
+
+  target_compile_options(${GPU_MOD_NAME} PRIVATE
+    $<$<COMPILE_LANGUAGE:${GPU_LANGUAGE}>:${GPU_COMPILE_FLAGS}>)
+
+  target_compile_definitions(${GPU_MOD_NAME} PRIVATE
+    "-DTORCH_EXTENSION_NAME=${GPU_MOD_NAME}")
+
+  target_include_directories(${GPU_MOD_NAME} PRIVATE csrc
+    ${GPU_INCLUDE_DIRECTORIES})
+
+  target_link_libraries(${GPU_MOD_NAME} PRIVATE torch ${GPU_LIBRARIES})
+
+  # Don't use `TORCH_LIBRARIES` for CUDA since it pulls in a bunch of
+  # dependencies that are not necessary and may not be installed.
+  if (GPU_LANGUAGE STREQUAL "CUDA")
+    target_link_libraries(${GPU_MOD_NAME} PRIVATE CUDA::cudart CUDA::cuda_driver)
+  else()
+    target_link_libraries(${GPU_MOD_NAME} PRIVATE ${TORCH_LIBRARIES})
+  endif()
+
+  install(TARGETS ${GPU_MOD_NAME} LIBRARY DESTINATION ${GPU_DESTINATION} COMPONENT ${GPU_MOD_NAME})
+endfunction()
--- a/codecov.yml
+++ b/codecov.yml
@ -0,0 +1,30 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+coverage:
+  status:
+    # non-voting, new code must be fully tested
+    patch:
+      default:
+        target: 100%
+        # non-voting
+        informational: true
+    # non-voting
+    project:
+      default:
+        # non-voting
+        informational: true
--- a/collect_env.py
+++ b/collect_env.py
@ -0,0 +1,489 @@
+#
+# Copyright 2023 The vLLM team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Adapted from https://github.com/vllm-project/vllm/blob/main/collect_env.py
+#
+
+import datetime
+import locale
+import os
+import re
+import subprocess
+import sys
+from collections import namedtuple
+
+from vllm.envs import environment_variables
+
+try:
+    import torch
+    TORCH_AVAILABLE = True
+except (ImportError, NameError, AttributeError, OSError):
+    TORCH_AVAILABLE = False
+
+# System Environment Information
+SystemEnv = namedtuple(
+    'SystemEnv',
+    [
+        'torch_version',
+        'is_debug_build',
+        'gcc_version',
+        'clang_version',
+        'cmake_version',
+        'os',
+        'libc_version',
+        'python_version',
+        'python_platform',
+        'pip_version',  # 'pip' or 'pip3'
+        'pip_packages',
+        'conda_packages',
+        'cpu_info',
+        'vllm_version',  # vllm specific field
+        'vllm_ascend_version',  # vllm ascend specific field
+        'env_vars',
+        'npu_info',  # ascend specific field
+        'cann_info',  # ascend specific field
+    ])
+
+DEFAULT_CONDA_PATTERNS = {
+    "torch",
+    "numpy",
+    "soumith",
+    "mkl",
+    "magma",
+    "optree",
+    "transformers",
+    "zmq",
+    "pynvml",
+}
+
+DEFAULT_PIP_PATTERNS = {
+    "torch",
+    "numpy",
+    "mypy",
+    "flake8",
+    "optree",
+    "onnx",
+    "transformers",
+    "zmq",
+    "pynvml",
+}
+
+
+def run(command):
+    """Return (return-code, stdout, stderr)."""
+    shell = True if type(command) is str else False
+    p = subprocess.Popen(command,
+                         stdout=subprocess.PIPE,
+                         stderr=subprocess.PIPE,
+                         shell=shell)
+    raw_output, raw_err = p.communicate()
+    rc = p.returncode
+    if get_platform() == 'win32':
+        enc = 'oem'
+    else:
+        enc = locale.getpreferredencoding()
+    output = raw_output.decode(enc)
+    err = raw_err.decode(enc)
+    return rc, output.strip(), err.strip()
+
+
+def run_and_read_all(run_lambda, command):
+    """Run command using run_lambda; reads and returns entire output if rc is 0."""
+    rc, out, _ = run_lambda(command)
+    if rc != 0:
+        return None
+    return out
+
+
+def run_and_parse_first_match(run_lambda, command, regex):
+    """Run command using run_lambda, returns the first regex match if it exists."""
+    rc, out, _ = run_lambda(command)
+    if rc != 0:
+        return None
+    match = re.search(regex, out)
+    if match is None:
+        return None
+    return match.group(1)
+
+
+def run_and_return_first_line(run_lambda, command):
+    """Run command using run_lambda and returns first line if output is not empty."""
+    rc, out, _ = run_lambda(command)
+    if rc != 0:
+        return None
+    return out.split('\n')[0]
+
+
+def get_conda_packages(run_lambda, patterns=None):
+    if patterns is None:
+        patterns = DEFAULT_CONDA_PATTERNS
+    conda = os.environ.get('CONDA_EXE', 'conda')
+    out = run_and_read_all(run_lambda, "{} list".format(conda))
+    if out is None:
+        return out
+
+    return "\n".join(line for line in out.splitlines()
+                     if not line.startswith("#") and any(name in line
+                                                         for name in patterns))
+
+
+def get_gcc_version(run_lambda):
+    return run_and_parse_first_match(run_lambda, 'gcc --version', r'gcc (.*)')
+
+
+def get_clang_version(run_lambda):
+    return run_and_parse_first_match(run_lambda, 'clang --version',
+                                     r'clang version (.*)')
+
+
+def get_cmake_version(run_lambda):
+    return run_and_parse_first_match(run_lambda, 'cmake --version',
+                                     r'cmake (.*)')
+
+
+def _parse_version(version, version_tuple):
+    version_str = version_tuple[-1]
+    if isinstance(version_str, str) and version_str.startswith('g'):
+        if '.' in version_str:
+            git_sha = version_str.split('.')[0][1:]
+            date = version_str.split('.')[-1][1:]
+            return f"{version} (git sha: {git_sha}, date: {date})"
+        else:
+            git_sha = version_str[1:]  # type: ignore
+            return f"{version} (git sha: {git_sha})"
+    return version
+
+
+def get_vllm_version():
+    from vllm import __version__, __version_tuple__
+    return _parse_version(__version__, __version_tuple__)
+
+
+def get_vllm_ascend_version():
+    from vllm_ascend._version import __version__, __version_tuple__
+    return _parse_version(__version__, __version_tuple__)
+
+
+def get_cpu_info(run_lambda):
+    rc, out, err = 0, '', ''
+    if get_platform() == 'linux':
+        rc, out, err = run_lambda('lscpu')
+    elif get_platform() == 'win32':
+        rc, out, err = run_lambda(
+            'wmic cpu get Name,Manufacturer,Family,Architecture,ProcessorType,DeviceID, \
+        CurrentClockSpeed,MaxClockSpeed,L2CacheSize,L2CacheSpeed,Revision /VALUE'
+        )
+    elif get_platform() == 'darwin':
+        rc, out, err = run_lambda("sysctl -n machdep.cpu.brand_string")
+    cpu_info = 'None'
+    if rc == 0:
+        cpu_info = out
+    else:
+        cpu_info = err
+    return cpu_info
+
+
+def get_platform():
+    if sys.platform.startswith('linux'):
+        return 'linux'
+    elif sys.platform.startswith('win32'):
+        return 'win32'
+    elif sys.platform.startswith('cygwin'):
+        return 'cygwin'
+    elif sys.platform.startswith('darwin'):
+        return 'darwin'
+    else:
+        return sys.platform
+
+
+def get_mac_version(run_lambda):
+    return run_and_parse_first_match(run_lambda, 'sw_vers -productVersion',
+                                     r'(.*)')
+
+
+def get_windows_version(run_lambda):
+    system_root = os.environ.get('SYSTEMROOT', 'C:\\Windows')
+    wmic_cmd = os.path.join(system_root, 'System32', 'Wbem', 'wmic')
+    findstr_cmd = os.path.join(system_root, 'System32', 'findstr')
+    return run_and_read_all(
+        run_lambda,
+        '{} os get Caption | {} /v Caption'.format(wmic_cmd, findstr_cmd))
+
+
+def get_lsb_version(run_lambda):
+    return run_and_parse_first_match(run_lambda, 'lsb_release -a',
+                                     r'Description:\t(.*)')
+
+
+def check_release_file(run_lambda):
+    return run_and_parse_first_match(run_lambda, 'cat /etc/*-release',
+                                     r'PRETTY_NAME="(.*)"')
+
+
+def get_os(run_lambda):
+    from platform import machine
+    platform = get_platform()
+
+    if platform == 'win32' or platform == 'cygwin':
+        return get_windows_version(run_lambda)
+
+    if platform == 'darwin':
+        version = get_mac_version(run_lambda)
+        if version is None:
+            return None
+        return 'macOS {} ({})'.format(version, machine())
+
+    if platform == 'linux':
+        # Ubuntu/Debian based
+        desc = get_lsb_version(run_lambda)
+        if desc is not None:
+            return '{} ({})'.format(desc, machine())
+
+        # Try reading /etc/*-release
+        desc = check_release_file(run_lambda)
+        if desc is not None:
+            return '{} ({})'.format(desc, machine())
+
+        return '{} ({})'.format(platform, machine())
+
+    # Unknown platform
+    return platform
+
+
+def get_python_platform():
+    import platform
+    return platform.platform()
+
+
+def get_libc_version():
+    import platform
+    if get_platform() != 'linux':
+        return 'N/A'
+    return '-'.join(platform.libc_ver())
+
+
+def get_pip_packages(run_lambda, patterns=None):
+    """Return `pip list` output. Note: will also find conda-installed pytorch and numpy packages."""
+    if patterns is None:
+        patterns = DEFAULT_PIP_PATTERNS
+
+    # People generally have `pip` as `pip` or `pip3`
+    # But here it is invoked as `python -mpip`
+    def run_with_pip(pip):
+        out = run_and_read_all(run_lambda, pip + ["list", "--format=freeze"])
+        return "\n".join(line for line in out.splitlines()
+                         if any(name in line for name in patterns))
+
+    pip_version = 'pip3' if sys.version[0] == '3' else 'pip'
+    out = run_with_pip([sys.executable, '-mpip'])
+
+    return pip_version, out
+
+
+def get_npu_info(run_lambda):
+    return run_and_read_all(run_lambda, 'npu-smi info')
+
+
+def get_cann_info(run_lambda):
+    out = run_and_read_all(run_lambda, 'lscpu | grep Architecture:')
+    cpu_arch = str(out).split()[-1]
+    return run_and_read_all(
+        run_lambda,
+        'cat /usr/local/Ascend/ascend-toolkit/latest/{}-linux/ascend_toolkit_install.info'
+        .format(cpu_arch))
+
+
+def get_env_vars():
+    env_vars = ''
+    secret_terms = ('secret', 'token', 'api', 'access', 'password')
+    report_prefix = ("TORCH", "PYTORCH", "ASCEND_", "ATB_")
+    for k, v in os.environ.items():
+        if any(term in k.lower() for term in secret_terms):
+            continue
+        if k in environment_variables:
+            env_vars = env_vars + "{}={}".format(k, v) + "\n"
+        if k.startswith(report_prefix):
+            env_vars = env_vars + "{}={}".format(k, v) + "\n"
+
+    return env_vars
+
+
+def get_env_info():
+    run_lambda = run
+    pip_version, pip_list_output = get_pip_packages(run_lambda)
+
+    if TORCH_AVAILABLE:
+        version_str = torch.__version__
+        debug_mode_str = str(torch.version.debug)
+    else:
+        version_str = debug_mode_str = 'N/A'
+
+    sys_version = sys.version.replace("\n", " ")
+
+    conda_packages = get_conda_packages(run_lambda)
+
+    return SystemEnv(
+        torch_version=version_str,
+        is_debug_build=debug_mode_str,
+        python_version='{} ({}-bit runtime)'.format(
+            sys_version,
+            sys.maxsize.bit_length() + 1),
+        python_platform=get_python_platform(),
+        pip_version=pip_version,
+        pip_packages=pip_list_output,
+        conda_packages=conda_packages,
+        os=get_os(run_lambda),
+        libc_version=get_libc_version(),
+        gcc_version=get_gcc_version(run_lambda),
+        clang_version=get_clang_version(run_lambda),
+        cmake_version=get_cmake_version(run_lambda),
+        cpu_info=get_cpu_info(run_lambda),
+        vllm_version=get_vllm_version(),
+        vllm_ascend_version=get_vllm_ascend_version(),
+        env_vars=get_env_vars(),
+        npu_info=get_npu_info(run_lambda),
+        cann_info=get_cann_info(run_lambda),
+    )
+
+
+env_info_fmt = """
+PyTorch version: {torch_version}
+Is debug build: {is_debug_build}
+
+OS: {os}
+GCC version: {gcc_version}
+Clang version: {clang_version}
+CMake version: {cmake_version}
+Libc version: {libc_version}
+
+Python version: {python_version}
+Python platform: {python_platform}
+
+CPU:
+{cpu_info}
+
+Versions of relevant libraries:
+{pip_packages}
+{conda_packages}
+""".strip()
+
+# both the above code and the following code use `strip()` to
+# remove leading/trailing whitespaces, so we need to add a newline
+# in between to separate the two sections
+env_info_fmt += "\n"
+
+env_info_fmt += """
+vLLM Version: {vllm_version}
+vLLM Ascend Version: {vllm_ascend_version}
+
+ENV Variables:
+{env_vars}
+
+NPU:
+{npu_info}
+
+CANN:
+{cann_info}
+""".strip()
+
+
+def pretty_str(envinfo):
+
+    def replace_nones(dct, replacement='Could not collect'):
+        for key in dct.keys():
+            if dct[key] is not None:
+                continue
+            dct[key] = replacement
+        return dct
+
+    def replace_bools(dct, true='Yes', false='No'):
+        for key in dct.keys():
+            if dct[key] is True:
+                dct[key] = true
+            elif dct[key] is False:
+                dct[key] = false
+        return dct
+
+    def prepend(text, tag='[prepend]'):
+        lines = text.split('\n')
+        updated_lines = [tag + line for line in lines]
+        return '\n'.join(updated_lines)
+
+    def replace_if_empty(text, replacement='No relevant packages'):
+        if text is not None and len(text) == 0:
+            return replacement
+        return text
+
+    def maybe_start_on_next_line(string):
+        # If `string` is multiline, prepend a \n to it.
+        if string is not None and len(string.split('\n')) > 1:
+            return '\n{}\n'.format(string)
+        return string
+
+    mutable_dict = envinfo._asdict()
+
+    # Replace True with Yes, False with No
+    mutable_dict = replace_bools(mutable_dict)
+
+    # Replace all None objects with 'Could not collect'
+    mutable_dict = replace_nones(mutable_dict)
+
+    # If either of these are '', replace with 'No relevant packages'
+    mutable_dict['pip_packages'] = replace_if_empty(
+        mutable_dict['pip_packages'])
+    mutable_dict['conda_packages'] = replace_if_empty(
+        mutable_dict['conda_packages'])
+
+    # Tag conda and pip packages with a prefix
+    # If they were previously None, they'll show up as ie '[conda] Could not collect'
+    if mutable_dict['pip_packages']:
+        mutable_dict['pip_packages'] = prepend(
+            mutable_dict['pip_packages'], '[{}] '.format(envinfo.pip_version))
+    if mutable_dict['conda_packages']:
+        mutable_dict['conda_packages'] = prepend(
+            mutable_dict['conda_packages'], '[conda] ')
+    mutable_dict['cpu_info'] = envinfo.cpu_info
+    mutable_dict['npu_info'] = envinfo.npu_info
+    mutable_dict['cann_info'] = envinfo.cann_info
+    return env_info_fmt.format(**mutable_dict)
+
+
+def get_pretty_env_info():
+    return pretty_str(get_env_info())
+
+
+def main():
+    print("Collecting environment information...")
+    output = get_pretty_env_info()
+    print(output)
+
+    if TORCH_AVAILABLE and hasattr(torch, 'utils') and hasattr(
+            torch.utils, '_crash_handler'):
+        minidump_dir = torch.utils._crash_handler.DEFAULT_MINIDUMP_DIR
+        if sys.platform == "linux" and os.path.exists(minidump_dir):
+            dumps = [
+                os.path.join(minidump_dir, dump)
+                for dump in os.listdir(minidump_dir)
+            ]
+            latest = max(dumps, key=os.path.getctime)
+            ctime = os.path.getctime(latest)
+            creation_time = datetime.datetime.fromtimestamp(ctime).strftime(
+                '%Y-%m-%d %H:%M:%S')
+            msg = "\n*** Detected a minidump at {} created on {}, ".format(latest, creation_time) + \
+                  "if this is related to your bug please include it when you file a report ***"
+            print(msg, file=sys.stderr)
+
+
+if __name__ == '__main__':
+    main()
--- a/csrc/camem_allocator.cpp
+++ b/csrc/camem_allocator.cpp
@ -0,0 +1,338 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <iostream>
+
+extern "C" {
+
+#define PY_SSIZE_T_CLEAN
+#include <Python.h>
+
+#include <sys/types.h>
+#include "acl/acl.h"
+
+// Global references to Python callables
+// NOTE: this is borrowed reference, so we don't need to DECREF them.
+// This brings the limitation that the allocator needs to be singleton.
+static PyObject* g_python_malloc_callback = nullptr;
+static PyObject* g_python_free_callback = nullptr;
+
+
+// ---------------------------------------------------------------------------
+// Helper functions:
+
+void ensure_context(unsigned long long device) {
+  aclrtContext pctx;
+  aclrtGetCurrentContext(&pctx);
+  if (!pctx) {
+    // Ensure device context.
+    aclrtCreateContext(&pctx, device);
+    aclrtSetCurrentContext(pctx);
+  }
+}
+
+void create_and_map(unsigned long long device, ssize_t size, void* d_mem,
+                    aclrtDrvMemHandle* p_memHandle) {
+  ensure_context(device);
+  // Define memory allocation properties
+  aclrtPhysicalMemProp prop = {};
+  prop.handleType = ACL_MEM_HANDLE_TYPE_NONE ;
+  prop.allocationType = ACL_MEM_ALLOCATION_TYPE_PINNED;
+  prop.memAttr = ACL_HBM_MEM_HUGE;
+  prop.location.id = device;
+  prop.location.type = ACL_MEM_LOCATION_TYPE_DEVICE;
+  prop.reserve = 0;
+
+  // Allocate memory using aclrtMallocPhysical
+  aclError error_code = aclrtMallocPhysical(p_memHandle, size, &prop, 0);
+  if (error_code != 0) {
+    std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
+            << __LINE__ << std::endl;  
+    return;
+  }
+  error_code = aclrtMapMem(d_mem, size, 0, *p_memHandle, 0);
+  if (error_code != 0) {
+    std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
+            << __LINE__ << std::endl;  
+    return;
+  }
+}
+
+void unmap_and_release(unsigned long long device, ssize_t size,
+                       void* d_mem,
+                       aclrtDrvMemHandle* p_memHandle) {
+  // std::cout << "unmap_and_release: device=" << device << ", size=" << size <<
+  // ", d_mem=" << d_mem << ", p_memHandle=" << p_memHandle << std::endl;
+  ensure_context(device);
+  aclError error_code = aclrtUnmapMem(d_mem);
+  if (error_code != 0) {
+    std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
+            << __LINE__ << std::endl;  
+    return;
+  }
+  error_code = aclrtFreePhysical(*p_memHandle);
+  if (error_code != 0) {
+    std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
+            << __LINE__ << std::endl;  
+    return;
+  }
+}
+
+PyObject* create_tuple_from_c_integers(unsigned long long a,
+                                       unsigned long long b,
+                                       unsigned long long c,
+                                       unsigned long long d) {
+  // Create a new tuple of size 4
+  PyObject* tuple = PyTuple_New(4);
+  if (!tuple) {
+    return NULL;  // Return NULL on failure
+  }
+
+  // Convert integers to Python objects and set them in the tuple
+  PyTuple_SetItem(
+      tuple, 0,
+      PyLong_FromUnsignedLongLong(a));  // Steals reference to the PyLong
+  PyTuple_SetItem(tuple, 1, PyLong_FromUnsignedLongLong(b));
+  PyTuple_SetItem(tuple, 2, PyLong_FromUnsignedLongLong(c));
+  PyTuple_SetItem(tuple, 3, PyLong_FromUnsignedLongLong(d));
+
+  // Note: PyTuple_SetItem "steals" a reference to each object,
+  // so we do not need to Py_DECREF the PyLong objects explicitly.
+
+  return tuple;  // Return the created tuple
+}
+
+// ---------------------------------------------------------------------------
+// Our exported C functions that call Python:
+
+__attribute__ ((visibility("default"))) void* my_malloc(ssize_t size, int device, aclrtStream stream) {
+  ensure_context(device);
+
+  // first allocation, align the size, and reserve an address, and also allocate
+  // a aclrtDrvMemHandle
+
+  // Define memory allocation properties
+  aclrtPhysicalMemProp prop = {};
+  prop.handleType = ACL_MEM_HANDLE_TYPE_NONE ;
+  prop.allocationType = ACL_MEM_ALLOCATION_TYPE_PINNED;
+  prop.memAttr = ACL_HBM_MEM_HUGE;
+  prop.location.id = device;
+  prop.location.type = ACL_MEM_LOCATION_TYPE_DEVICE;
+  prop.reserve = 0;
+
+  // Check if the allocation is supported
+  size_t granularity;
+  aclError error_code = aclrtMemGetAllocationGranularity(&prop,
+                                   ACL_RT_MEM_ALLOC_GRANULARITY_MINIMUM,
+                                   &granularity);
+  if (error_code != 0) {
+    std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
+            << __LINE__ << std::endl;  
+    return nullptr;
+  }
+  size_t alignedSize = ((size + granularity - 1) / granularity) * granularity;
+  void *d_mem;
+  error_code = aclrtReserveMemAddress(&d_mem, alignedSize, 0, nullptr, 0);
+  if (error_code != 0) {
+    std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;  
+    return nullptr;
+  }
+  // allocate the aclrtDrvMemHandle
+  aclrtDrvMemHandle* p_memHandle =
+      (aclrtDrvMemHandle*)malloc(sizeof(aclrtDrvMemHandle));
+
+  if (!g_python_malloc_callback) {
+    std::cerr << "ERROR: g_python_malloc_callback not set.\n";
+    return nullptr;
+  }
+
+  // Acquire GIL (not in stable ABI officially, but often works)
+  PyGILState_STATE gstate = PyGILState_Ensure();
+
+  PyObject* arg_tuple = create_tuple_from_c_integers(
+      (unsigned long long)device, (unsigned long long)alignedSize,
+      (unsigned long long)d_mem, (unsigned long long)p_memHandle);
+
+  // Call g_python_malloc_callback
+  PyObject* py_result =
+      PyObject_CallFunctionObjArgs(g_python_malloc_callback, arg_tuple, NULL);
+  Py_DECREF(arg_tuple);
+
+  if (!py_result) {
+    PyErr_Print();
+    PyGILState_Release(gstate);
+    return nullptr;
+  }
+
+  PyGILState_Release(gstate);
+
+  // do the final mapping
+  create_and_map(device, alignedSize, d_mem, p_memHandle);
+
+  return (void*)d_mem;
+}
+
+__attribute__ ((visibility("default"))) void my_free(void* ptr, ssize_t size, int device, aclrtStream stream) {
+  // get memory handle from the pointer
+  if (!g_python_free_callback) {
+    std::cerr << "ERROR: g_python_free_callback not set.\n";
+    return;
+  }
+
+  // Acquire GIL (not in stable ABI officially, but often works)
+  PyGILState_STATE gstate = PyGILState_Ensure();
+
+  PyObject* py_ptr =
+      PyLong_FromUnsignedLongLong(reinterpret_cast<unsigned long long>(ptr));
+
+  PyObject* py_result =
+      PyObject_CallFunctionObjArgs(g_python_free_callback, py_ptr, NULL);
+
+  if (!py_result || !PyTuple_Check(py_result) || PyTuple_Size(py_result) != 4) {
+    PyErr_SetString(PyExc_TypeError, "Expected a tuple of size 4");
+    return;
+  }
+
+  unsigned long long recv_device, recv_size;
+  unsigned long long recv_d_mem, recv_p_memHandle;
+  // Unpack the tuple into four C integers
+  if (!PyArg_ParseTuple(py_result, "KKKK", &recv_device, &recv_size,
+                        &recv_d_mem, &recv_p_memHandle)) {
+    // PyArg_ParseTuple sets an error if it fails
+    return;
+  }
+
+  PyGILState_Release(gstate);
+
+  // recv_size == size
+  // recv_device == device
+
+  // Free memory
+
+  void *d_mem = (void*)recv_d_mem;
+    // allocate the aclrtDrvMemHandle
+  aclrtDrvMemHandle* p_memHandle =
+      (aclrtDrvMemHandle*)recv_p_memHandle;
+  unmap_and_release(device, size, d_mem, p_memHandle);
+
+  // free address and the handle
+  aclError error_code = aclrtReleaseMemAddress(d_mem);
+  if (error_code != 0) {
+    std::cerr << "acl Error, code: " << error_code << " at " << __FILE__ << ":" \
+        << __LINE__ << std::endl;  
+    return;
+  }
+  free(p_memHandle);
+}
+
+// ---------------------------------------------------------------------------
+// Python extension boilerplate:
+
+// Python-exposed function: init_module(python_malloc, python_free)
+static PyObject* py_init_module(PyObject* self, PyObject* args) {
+  PyObject* malloc_callback = nullptr;
+  PyObject* free_callback = nullptr;
+
+  if (!PyArg_ParseTuple(args, "OO", &malloc_callback, &free_callback)) {
+    return nullptr;
+  }
+
+  if (!PyCallable_Check(malloc_callback) || !PyCallable_Check(free_callback)) {
+    PyErr_SetString(PyExc_TypeError, "Both arguments must be callables");
+    return nullptr;
+  }
+
+  // Save the Python callables
+  // This module does not handle GC of these objects, so they must be kept alive
+  // outside of this module.
+  g_python_malloc_callback = malloc_callback;
+  g_python_free_callback = free_callback;
+
+  Py_RETURN_NONE;
+}
+
+static PyObject* python_unmap_and_release(PyObject* self, PyObject* args) {
+  if (!args || !PyTuple_Check(args) || PyTuple_Size(args) != 4) {
+    PyErr_SetString(PyExc_TypeError, "Expected a tuple of size 4");
+    return nullptr;
+  }
+
+  unsigned long long recv_device, recv_size;
+  unsigned long long recv_d_mem, recv_p_memHandle;
+  // Unpack the tuple into four C integers
+  if (!PyArg_ParseTuple(args, "KKKK", &recv_device, &recv_size, &recv_d_mem,
+                        &recv_p_memHandle)) {
+    // PyArg_ParseTuple sets an error if it fails
+    return nullptr;
+  }
+
+  void *d_mem_ptr = (void*)recv_d_mem;
+  aclrtDrvMemHandle* p_memHandle =
+      (aclrtDrvMemHandle*)recv_p_memHandle;
+
+  unmap_and_release(recv_device, recv_size, d_mem_ptr, p_memHandle);
+
+  Py_RETURN_NONE;
+}
+
+static PyObject* python_create_and_map(PyObject* self, PyObject* args) {
+  if (!args || !PyTuple_Check(args) || PyTuple_Size(args) != 4) {
+    PyErr_SetString(PyExc_TypeError, "Expected a tuple of size 4");
+    return nullptr;
+  }
+
+  unsigned long long recv_device, recv_size;
+  unsigned long long recv_d_mem, recv_p_memHandle;
+  // Unpack the tuple into four C integers
+  if (!PyArg_ParseTuple(args, "KKKK", &recv_device, &recv_size, &recv_d_mem,
+                        &recv_p_memHandle)) {
+    // PyArg_ParseTuple sets an error if it fails
+    return nullptr;
+  }
+
+  void *d_mem_ptr = (void*)recv_d_mem;
+  aclrtDrvMemHandle* p_memHandle =
+      (aclrtDrvMemHandle*)recv_p_memHandle;
+
+  create_and_map(recv_device, recv_size, d_mem_ptr, p_memHandle);
+
+  Py_RETURN_NONE;
+}
+
+static PyMethodDef module_methods[] = {
+    {"init_module", (PyCFunction)py_init_module, METH_VARARGS,
+     "Initialize module with python_malloc and python_free callables."},
+    {"python_create_and_map", (PyCFunction)python_create_and_map, METH_VARARGS,
+     "Create and map memory on the device."},
+    {"python_unmap_and_release", (PyCFunction)python_unmap_and_release,
+     METH_VARARGS, "Unmap and release memory on the device."},
+    {NULL, NULL, 0, NULL}  // sentinel
+};
+
+static struct PyModuleDef camem_allocator_module = {
+    PyModuleDef_HEAD_INIT, "camem_allocator",
+    "CANN-mem-based allocator for NPUPluggableAllocator", -1, module_methods};
+
+PyMODINIT_FUNC PyInit_vllm_ascend_C(void) {
+  // Initialize the module
+  PyObject* module = PyModule_Create(&camem_allocator_module);
+  if (!module) {
+    return NULL;
+  }
+  return module;
+}
+}  // extern "C"
--- a/csrc/kernels/get_masked_input_and_mask_kernel.cpp
+++ b/csrc/kernels/get_masked_input_and_mask_kernel.cpp
@ -0,0 +1,378 @@
+/* 
+ * Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
+ */
+
+#include "kernel_operator.h"
+#include "kernel_tensor_impl.h"
+#include "kernel_type.h"
+#include "types.h"
+#include "utils.h"
+using vllm_ascend::AccType;
+
+template<typename scalar_t>
+class GetMaskedInputAndMask {
+public:
+    __aicore__ inline GetMaskedInputAndMask() {}
+    
+    __aicore__ inline ~GetMaskedInputAndMask() {
+        pipe.Reset();
+    }
+
+    
+    __aicore__ inline void Init(
+        __gm__ scalar_t* input,
+        __gm__ scalar_t* masked_input, 
+        __gm__ bool* mask_out,
+        const int64_t org_vocab_start_index,
+        const int64_t org_vocab_end_index,
+        const int64_t num_org_vocab_padding,
+        const int64_t added_vocab_start_index,
+        const int64_t added_vocab_end_index,
+        const int64_t size)
+    {
+        // Initialize basic parameters
+        input_ = input;
+        masked_input_ = masked_input;
+        mask_out_ = mask_out;
+        org_vocab_start_index_ = org_vocab_start_index;
+        org_vocab_end_index_ = org_vocab_end_index;
+        size_ = ((size + 31) / 32) * 32;
+        added_offset_ = added_vocab_start_index - 
+            (org_vocab_end_index - org_vocab_start_index) - 
+            num_org_vocab_padding;
+        added_vocab_start_index_ = added_vocab_start_index;
+        added_vocab_end_index_ = added_vocab_end_index;
+
+        // Initialize global tensors
+        inputGlobal.SetGlobalBuffer(input);
+        maskedOutputGlobal.SetGlobalBuffer(masked_input); 
+        maskOutGlobal.SetGlobalBuffer(mask_out);
+
+        // Initialize queues
+        pipe.InitBuffer(inQueue, 1, size_ * sizeof(scalar_t));
+        pipe.InitBuffer(outQueue, 1, size_ * sizeof(scalar_t));
+        pipe.InitBuffer(maskQueue, 1, size_ * sizeof(bool));
+        
+        // Initialize calculation buffers
+        // NOTE: calc_buf_1 and calc_buf_2 are also used for int16 casting on older archs.
+        pipe.InitBuffer(calc_buf_1, size_ * sizeof(float));
+        pipe.InitBuffer(calc_buf_2, size_ * sizeof(float));
+        
+        // Initialize result queues
+        pipe.InitBuffer(result_ge_que, BUFFER_NUM, size_ * sizeof(float));
+        pipe.InitBuffer(result_le_que, BUFFER_NUM, size_ * sizeof(float));
+        pipe.InitBuffer(result_org_mask_que, BUFFER_NUM, size_ * sizeof(float));
+        pipe.InitBuffer(result_add_mask_que, BUFFER_NUM, size_ * sizeof(float));
+
+        // Initialize temporary buffers
+        pipe.InitBuffer(start_buf, size_ * sizeof(float));
+        pipe.InitBuffer(end_buf, size_ * sizeof(float));
+        pipe.InitBuffer(inputFloat_buf, size_ * sizeof(float)); // Also used for half intermediate in casting
+        pipe.InitBuffer(validOffset_buf, size_ * sizeof(float));
+        pipe.InitBuffer(vocabMask_buf_, size_ * sizeof(int8_t));
+        pipe.InitBuffer(ones_buf_, size_ * sizeof(float));
+    }
+
+    __aicore__ inline void Process()
+    {
+        CopyIn();
+        Compute();
+        CopyOut();
+    }
+
+private:
+    __aicore__ inline void CopyIn()
+    {
+        AscendC::LocalTensor<scalar_t> inputLocal = inQueue.AllocTensor<scalar_t>();
+        AscendC::DataCopy(inputLocal, inputGlobal, size_);
+        inQueue.EnQue(inputLocal);
+    }
+
+    __aicore__ inline void CompareWithValue(
+        AscendC::LocalTensor<int8_t>& result,
+        const AscendC::LocalTensor<float>& input,
+        const AscendC::LocalTensor<float>& compare_value,
+        bool is_greater_equal) {
+
+        AscendC::LocalTensor<float> compute_buf = calc_buf_1.Get<float>();
+        if (is_greater_equal) {
+            AscendC::Max(compute_buf, input, compare_value, size_);  
+            AscendC::Sub(compute_buf, compare_value, compute_buf, size_);  
+        } else {
+            AscendC::Max(compute_buf, input, compare_value, size_); 
+            AscendC::Sub(compute_buf, compute_buf, compare_value, size_); 
+        }
+
+        AscendC::Abs(compute_buf, compute_buf, size_);
+        AscendC::Mins(compute_buf, compute_buf, MIN_ACCURACY_FP32, size_);
+        AscendC::Muls(compute_buf, compute_buf, MAX_MUL_1_FP32, size_);
+        AscendC::Muls(compute_buf, compute_buf, MAX_MUL_1_FP32, size_);
+        AscendC::Muls(compute_buf, compute_buf, MAX_MUL_2_FP32, size_);
+        AscendC::Adds(compute_buf, compute_buf, NEGATIVE_ONE_FP32, size_);
+        AscendC::Abs(compute_buf, compute_buf, size_);
+
+        AscendC::LocalTensor<half> compute_buf_fp16 = calc_buf_2.Get<half>();
+        AscendC::Cast(compute_buf_fp16, compute_buf, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(result, compute_buf_fp16, AscendC::RoundMode::CAST_NONE, size_);
+    }
+
+    __aicore__ inline void ComputeRangeMask(
+        AscendC::LocalTensor<int8_t>& range_mask,
+        const AscendC::LocalTensor<float>& input,
+        const float start_value, 
+        const float end_value) {
+        
+        AscendC::LocalTensor<float> start_value_tensor = start_buf.Get<float>();
+        AscendC::LocalTensor<float> end_value_tensor = end_buf.Get<float>();
+
+        AscendC::Duplicate(start_value_tensor, start_value, size_);
+        AscendC::Duplicate(end_value_tensor, end_value, size_);
+        
+        AscendC::LocalTensor<int8_t> ge_result = result_ge_que.AllocTensor<int8_t>();
+        AscendC::LocalTensor<int8_t> lt_result = result_le_que.AllocTensor<int8_t>();
+
+        CompareWithValue(ge_result, start_value_tensor, input, true);
+        CompareWithValue(lt_result, input, end_value_tensor, false);
+        
+#if (__CCE_AICORE__ >= 220) 
+        AscendC::And(range_mask, ge_result, lt_result, size_);
+#else
+        {
+            // WORKAROUND for older arch
+            // No direct int8->int16 cast. Use half as intermediate.
+            // No direct int8 And. Use int16 And.
+            AscendC::LocalTensor<int16_t> ge_result_i16 = calc_buf_1.Get<int16_t>();
+            AscendC::LocalTensor<int16_t> lt_result_i16 = calc_buf_2.Get<int16_t>();
+            AscendC::LocalTensor<int16_t> range_mask_i16 = ge_result_i16; 
+            
+            // Use a temporary buffer for half type
+            AscendC::LocalTensor<half> tmp_half = inputFloat_buf.Get<half>();
+
+            // 1. Cast inputs: int8_t -> half -> int16_t
+            AscendC::Cast(tmp_half, ge_result, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(ge_result_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+            
+            AscendC::Cast(tmp_half, lt_result, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(lt_result_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+
+            // 2. Perform And on int16_t tensors
+            AscendC::And(range_mask_i16, ge_result_i16, lt_result_i16, size_);
+
+            // 3. Cast result back: int16_t -> half -> int8_t
+            AscendC::Cast(tmp_half, range_mask_i16, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(range_mask, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+        }
+#endif
+    }
+
+    __aicore__ inline void Compute() {
+        AscendC::LocalTensor<scalar_t> inputLocal = inQueue.DeQue<scalar_t>();
+        AscendC::LocalTensor<scalar_t> maskedLocal = outQueue.AllocTensor<scalar_t>();
+        AscendC::LocalTensor<int8_t> maskLocal = maskQueue.AllocTensor<int8_t>();
+
+        AscendC::LocalTensor<float> inputFloat = inputFloat_buf.Get<float>();
+        AscendC::Cast(inputFloat, inputLocal, AscendC::RoundMode::CAST_NONE, size_);
+
+        AscendC::LocalTensor<int8_t> orgVocabMask = result_org_mask_que.AllocTensor<int8_t>();
+        ComputeRangeMask(orgVocabMask, 
+                        inputFloat,
+                        static_cast<float>(org_vocab_start_index_),
+                        static_cast<float>(org_vocab_end_index_));
+
+        AscendC::LocalTensor<int8_t> addedVocabMask = result_add_mask_que.AllocTensor<int8_t>();
+        ComputeRangeMask(addedVocabMask,
+                        inputFloat,
+                        static_cast<float>(added_vocab_start_index_),
+                        static_cast<float>(added_vocab_end_index_));
+
+        AscendC::LocalTensor<float> validOffset = validOffset_buf.Get<float>();
+        AscendC::LocalTensor<float> constOrgStartIndex = start_buf.Get<float>();
+        
+        AscendC::Duplicate(constOrgStartIndex, float(org_vocab_start_index_), size_);
+        
+        AscendC::LocalTensor<half> orgVocabMask_fp16;
+        AscendC::LocalTensor<float> orgVocabMask_fp32;
+        AscendC::Cast(orgVocabMask_fp16, orgVocabMask, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(orgVocabMask_fp32, orgVocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
+
+        AscendC::Mul(validOffset, constOrgStartIndex, orgVocabMask_fp32, size_);
+
+        AscendC::LocalTensor<float> addedOffset;
+        AscendC::LocalTensor<float> addedOffsetTensor = end_buf.Get<float>();
+        AscendC::Duplicate(addedOffsetTensor, float(added_offset_), size_);
+
+        AscendC::LocalTensor<half> addedVocabMask_fp16;
+        AscendC::LocalTensor<float> addedVocabMask_fp32;
+        AscendC::Cast(addedVocabMask_fp16, addedVocabMask, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(addedVocabMask_fp32, addedVocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
+
+        AscendC::Mul(addedOffset, addedOffsetTensor, addedVocabMask_fp32, size_);
+        AscendC::Add(validOffset, validOffset, addedOffset, size_);
+
+        AscendC::LocalTensor<int8_t> vocabMask = vocabMask_buf_.Get<int8_t>();
+        
+#if (__CCE_AICORE__ >= 220)
+        AscendC::Or(vocabMask,
+                    orgVocabMask,
+                    addedVocabMask,
+                    size_);
+#else
+        {
+            // WORKAROUND for older arch 
+            // No direct int8->int16 cast. Use half as intermediate.
+            // No direct int8 Or. Use int16 Or.
+            AscendC::LocalTensor<int16_t> orgVocabMask_i16 = calc_buf_1.Get<int16_t>();
+            AscendC::LocalTensor<int16_t> addedVocabMask_i16 = calc_buf_2.Get<int16_t>();
+            AscendC::LocalTensor<int16_t> vocabMask_i16 = orgVocabMask_i16; 
+
+            // Use a temporary buffer for half type. inputFloat_buf is free now.
+            AscendC::LocalTensor<half> tmp_half = inputFloat_buf.Get<half>();
+
+            // 1. Cast inputs: int8_t -> half -> int16_t
+            AscendC::Cast(tmp_half, orgVocabMask, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(orgVocabMask_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+
+            AscendC::Cast(tmp_half, addedVocabMask, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(addedVocabMask_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+
+            // 2. Perform Or on int16_t tensors
+            AscendC::Or(vocabMask_i16, orgVocabMask_i16, addedVocabMask_i16, size_);
+
+            // 3. Cast result back: int16_t -> half -> int8_t
+            AscendC::Cast(tmp_half, vocabMask_i16, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(vocabMask, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+        }
+#endif
+
+        AscendC::Sub(inputFloat, inputFloat, validOffset, size_);
+
+        AscendC::LocalTensor<half> vocabMask_fp16;
+        AscendC::LocalTensor<float> vocabMask_fp32;
+        AscendC::Cast(vocabMask_fp16, vocabMask, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(vocabMask_fp32, vocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
+        
+        AscendC::Mul(inputFloat, inputFloat, vocabMask_fp32, size_);
+
+        AscendC::Cast(maskedLocal, inputFloat, AscendC::RoundMode::CAST_CEIL, size_);  
+        outQueue.EnQue(maskedLocal);
+
+        AscendC::LocalTensor<float> ones_tensor = ones_buf_.Get<float>();
+        AscendC::Duplicate(ones_tensor, (float)1, size_);
+        AscendC::LocalTensor<float> maskLocal_fp32;
+
+        AscendC::Sub(maskLocal_fp32, ones_tensor, vocabMask_fp32, size_);
+
+        AscendC::LocalTensor<half> maskLocal_fp16;
+        AscendC::Cast(maskLocal_fp16, maskLocal_fp32, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(maskLocal, maskLocal_fp16, AscendC::RoundMode::CAST_NONE, size_);
+        maskQueue.EnQue(maskLocal);
+        inQueue.FreeTensor(inputLocal);
+    }
+
+    __aicore__ inline void CopyOut()
+    {
+        AscendC::LocalTensor<scalar_t> maskedLocal = outQueue.DeQue<scalar_t>();
+        AscendC::LocalTensor<bool> maskLocal = maskQueue.DeQue<bool>();
+        
+        AscendC::DataCopy(maskedOutputGlobal, maskedLocal, size_);
+        AscendC::DataCopy(maskOutGlobal, maskLocal, size_);
+        
+        outQueue.FreeTensor(maskedLocal);
+        maskQueue.FreeTensor(maskLocal);
+    }
+
+private:
+    static constexpr int32_t BUFFER_NUM = 2;
+    AscendC::TPipe pipe;
+    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueue;
+    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue, maskQueue;
+    AscendC::GlobalTensor<scalar_t> inputGlobal, maskedOutputGlobal;
+    AscendC::GlobalTensor<bool> maskOutGlobal;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> calc_buf_1;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> calc_buf_2;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_ge_que;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_le_que;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_org_mask_que;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_add_mask_que;
+
+    // Temporary buffers
+    AscendC::TBuf<AscendC::TPosition::VECCALC> start_buf;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> end_buf; 
+    AscendC::TBuf<AscendC::TPosition::VECCALC> inputFloat_buf;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> validOffset_buf;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> vocabMask_buf_;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> ones_buf_;
+    
+    __gm__ scalar_t *input_, *masked_input_;
+    __gm__ bool *mask_out_;
+    int64_t size_;
+    int64_t org_vocab_start_index_, org_vocab_end_index_;
+    int64_t added_vocab_start_index_, added_vocab_end_index_;
+    int64_t added_offset_;
+
+    static constexpr float MIN_ACCURACY_FP32 = 1.1754943508222875e-38;
+    static constexpr float MAX_MUL_1_FP32 = 1125899906842624;
+    static constexpr float MAX_MUL_2_FP32 = 67108864;
+    static constexpr float NEGATIVE_ONE_FP32 = -1.0f;
+};
+
+extern "C" __global__ __aicore__ void get_masked_input_and_mask_kernel(
+    __gm__ int32_t* input,
+    __gm__ int32_t* masked_input,
+    __gm__ bool* mask_out, 
+    const int64_t org_vocab_start_index,
+    const int64_t org_vocab_end_index,
+    const int64_t num_org_vocab_padding,
+    const int64_t added_vocab_start_index,
+    const int64_t added_vocab_end_index,
+    const int64_t size,
+    const uint32_t loop_cnt,
+    const uint32_t aiv_num)
+{
+    {
+        GetMaskedInputAndMask<int32_t> op{};
+
+        for (int64_t i = AscendC::GetBlockIdx(); i < loop_cnt; i += aiv_num) {
+            op.Init(input + i * size/loop_cnt, 
+                   masked_input + i * size/loop_cnt,
+                   mask_out + i * size/loop_cnt,
+                   org_vocab_start_index, org_vocab_end_index,
+                   num_org_vocab_padding, added_vocab_start_index,
+                   added_vocab_end_index, size/loop_cnt);
+                
+            op.Process();
+        }
+    } // op destructor called here
+}
+
+namespace vllm_ascend {
+
+void get_masked_input_and_mask_impl(
+    void* stream,
+    void* input,
+    void* masked_input,
+    void* mask_out,
+    const int64_t org_vocab_start_index,
+    const int64_t org_vocab_end_index,
+    const int64_t num_org_vocab_padding, 
+    const int64_t added_vocab_start_index,
+    const int64_t added_vocab_end_index,
+    const int64_t size,
+    const uint32_t loop_cnt,
+    const uint32_t aiv_num)
+{
+    get_masked_input_and_mask_kernel<<<aiv_num, nullptr, stream>>>(
+        static_cast<int32_t*>(input),
+        static_cast<int32_t*>(masked_input),
+        static_cast<bool*>(mask_out),
+        org_vocab_start_index,
+        org_vocab_end_index,
+        num_org_vocab_padding,
+        added_vocab_start_index,
+        added_vocab_end_index,
+        size,
+        loop_cnt,
+        aiv_num);
+}
+
+} // namespace vllm_ascend
--- a/csrc/kernels/pos_encoding_kernels.cpp
+++ b/csrc/kernels/pos_encoding_kernels.cpp
@ -0,0 +1,377 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "kernel_operator.h"
+#include "kernel_tpipe_impl.h"
+#include "kernel_tensor_impl.h"
+#include "kernel_type.h"
+#include "kernel_operator_intf.h"
+#include "inner_interface/inner_kernel_operator_intf.h"
+#include <stdio.h>
+#include "types.h"
+#include "utils.h"
+
+
+using vllm_ascend::AccType;
+using vllm_ascend::local_mem_copy;
+template <typename scalar_t, bool isNeox> class RotaryEmbedding {
+    // NOTE(ganyi): we use 512B as load stride for pipe, need to find another way to
+    // retrieve this size from runtime for more Soc support
+    #if (__CCE_AICORE__ >= 220)
+        static int constexpr loadSize = 512;
+    #else
+        static int constexpr loadSize = 1024 * 4;
+    #endif
+    using dst_t = scalar_t;
+    using acc_t = typename AccType<scalar_t>::type;
+    // only half tensor have cast instruct to int8, hardcode acc_dst_t as half
+    using local_scalar_t = AscendC::LocalTensor<scalar_t>;
+    using local_acc_t = AscendC::LocalTensor<acc_t>;
+    using local_dst_t = AscendC::LocalTensor<dst_t>;
+
+public:
+    __aicore__ inline RotaryEmbedding()
+    {
+    }
+
+    // Allocate buffers for input and output queue and the temp buffer used during kernel compute process,
+    // this init process happens only in the kernel compute on a single vector core.
+    __aicore__ inline void init(__gm__ int64_t *positions, __gm__ void *queryDst, __gm__ void *keyDst,
+                                __gm__ scalar_t *query, __gm__ scalar_t *key, __gm__ scalar_t *cosSinCache,
+                                const int rotDim, const int64_t dstQueryStride,
+                                const int64_t dstKeyStride, const int64_t queryStride, const int64_t keyStride,
+                                const int numHeads, const int numKvHeads, const int headSize, AscendC::TPipe *pipe)
+    {
+        pipe_ = pipe;
+        rotDim_ = rotDim;
+        // query stride and key stride is used to handle the strided tensor which is not contiguous on num_tokens dim
+        queryStride_ = queryStride;
+        keyStride_ = keyStride;
+        dstQueryStride_ = dstQueryStride;
+        dstKeyStride_ = dstKeyStride;
+        numHeads_ = numHeads;
+        numKvHeads_ = numKvHeads;
+        headSize_ = headSize;
+        embedDim_ = rotDim / 2;
+
+        pipe_->InitBuffer(inQue_, 1 /* buffer_num */, loadSize /* buffer_size */);
+        pipe_->InitBuffer(inQueSinCos_, 1 /* buffer_num */, rotDim_ * sizeof(scalar_t) /* buffer_size */);
+        pipe_->InitBuffer(outQue_, 1 /* buffer_num */, loadSize /* buffer_size */);
+        // 2 temporary calculation buffer
+        calcTmpBufferOffset_ = 0;
+        // 1 upcast buffer for bf16 (headSize)
+        upcastInputBufferOffset_ = calcTmpBufferOffset_ + sizeof(acc_t) * embedDim_ * 2;
+        // 1 upcast temp buffer for bf16 (2 * embed_dim)
+        upcastTempBufferOffset_ = upcastInputBufferOffset_ + sizeof(acc_t) * headSize_;
+        // 2 sin cos upcast buffer for bf16
+        cosSinUpcastBufferOffset_ = upcastTempBufferOffset_ + sizeof(acc_t) * 2 * embedDim_;
+        // 2. bf16 path: needs 2 cos sin upcast buffer size
+        // 3. fp16 path: needs 2 temporary calculation buffer size
+        tempBufferSize_ = cosSinUpcastBufferOffset_ + 2 * embedDim_ * sizeof(acc_t);
+        // need to consider upcast the bf16 to fp32, so we might need 4 buffer just in case
+        // 2 temporary buffer, 2 input buffer, 1 cos buffer, 1 sin buffer, 2 scale buffer (headSize), 2 zp
+        // buffer(headSize int8), 1 dst_temp buffer(headSize, int32)
+        pipe_->InitBuffer(calcBuf_, tempBufferSize_ /* buffer_size */);
+        if constexpr (!std::is_same_v<scalar_t, acc_t>) {
+            pipe_->InitBuffer(copyBuf_, loadSize);
+        }
+    }
+    __aicore__ inline void update_mem_offset(__gm__ int64_t *positions, __gm__ void *queryDst, __gm__ void *keyDst,
+                                  __gm__ scalar_t *query, __gm__ scalar_t *key, __gm__ scalar_t *cosSinCache,
+                                  const int rotDim, const int64_t dstQueryStride, const int64_t dstKeyStride,
+                                  const int64_t queryStride, const int64_t keyStride, const int numHeads,
+                                  const int numKvHeads, const int headSize, const int64_t idx)
+    {
+        int64_t pos = positions[idx];
+        cosSin_.SetGlobalBuffer(cosSinCache + pos * rotDim_, rotDim_);
+        query_.SetGlobalBuffer(query + queryStride * idx, headSize * numHeads_);
+        key_.SetGlobalBuffer(key + keyStride * idx, headSize * numKvHeads_);
+        queryDst_.SetGlobalBuffer(reinterpret_cast<__gm__ dst_t *>(queryDst) + dstQueryStride * idx,
+                                  headSize * numHeads_);
+        keyDst_.SetGlobalBuffer(reinterpret_cast<__gm__ dst_t *>(keyDst) + dstKeyStride * idx, headSize * numKvHeads_);
+    }
+
+    // compute per head for neox on bf16
+    template <typename acc_t_, typename std::enable_if<!std::is_same_v<acc_t_, scalar_t>, void>::type * = nullptr>
+    __aicore__ inline void
+    neox_compute(local_scalar_t src, local_dst_t dst, AscendC::LocalTensor<acc_t_> sin, AscendC::LocalTensor<acc_t_> cos,
+                 AscendC::LocalTensor<acc_t_> upcastInputBuffer, AscendC::LocalTensor<acc_t_> calcTmpBuffer)
+    {
+        // slice dst
+        local_dst_t dstX = dst;
+        local_dst_t dstY = dst[embedDim_];
+
+        // slice src
+        local_scalar_t srcX = src;
+        local_scalar_t srcY = src[embedDim_];
+
+        // slice temp buffer
+        local_acc_t calcTmpBufferX = calcTmpBuffer;
+        local_acc_t calcTmpBufferY = calcTmpBuffer[embedDim_];
+
+        // slice upcast input buffer
+        local_acc_t upcastBufferX = upcastInputBuffer;
+        local_acc_t upcastBufferY = upcastBufferX[embedDim_];
+
+        // dst x calc
+        Cast(upcastInputBuffer, src, AscendC::RoundMode::CAST_NONE, headSize_);
+        Mul(calcTmpBufferX, upcastBufferX, cos, embedDim_);
+        Mul(calcTmpBufferY, upcastBufferY, sin, embedDim_);
+        Sub(calcTmpBufferX, calcTmpBufferX, calcTmpBufferY, embedDim_);
+        Cast(dstX, calcTmpBufferX, AscendC::RoundMode::CAST_TRUNC, embedDim_);
+
+        // dst y calc
+        Mul(calcTmpBufferX, upcastBufferX, sin, embedDim_);
+        Mul(calcTmpBufferY, upcastBufferY, cos, embedDim_);
+        Add(calcTmpBufferX, calcTmpBufferX, calcTmpBufferY, embedDim_);
+        Cast(dstY, calcTmpBufferX, AscendC::RoundMode::CAST_TRUNC, embedDim_);
+    }
+
+    // compute per head output for neox
+    template <typename acc_t_, typename std::enable_if<std::is_same_v<acc_t_, scalar_t>, void>::type * = nullptr>
+    __aicore__ inline void
+    neox_compute(local_scalar_t src, local_dst_t dst, AscendC::LocalTensor<acc_t_> sin, AscendC::LocalTensor<acc_t_> cos,
+                 AscendC::LocalTensor<acc_t_> upcastInputBuffer, AscendC::LocalTensor<acc_t_> calcTmpBuffer)
+    {
+        // slice dst buffer
+        local_dst_t dstX = dst;
+        local_dst_t dstY = dst[embedDim_];
+        // slice src buffer
+        local_scalar_t srcX = src;
+        local_scalar_t srcY = src[embedDim_];
+        // slice temp buffer
+        local_acc_t calcTmpBufferX = calcTmpBuffer;
+        local_acc_t calcTmpBufferY = calcTmpBuffer[embedDim_];
+
+        // dst x calc
+        Mul(calcTmpBufferX, srcX, cos, embedDim_);
+        Mul(calcTmpBufferY, srcY, sin, embedDim_);
+        Sub(dstX, calcTmpBufferX, calcTmpBufferY, embedDim_);
+
+        // dst y calc
+        Mul(calcTmpBufferX, srcX, sin, embedDim_);
+        Mul(calcTmpBufferY, srcY, cos, embedDim_);
+        Add(dstY, calcTmpBufferX, calcTmpBufferY, embedDim_);
+    }
+
+    __aicore__ inline void compute_qk(AscendC::GlobalTensor<scalar_t> srcG, AscendC::GlobalTensor<dst_t> dstG,
+                                          local_acc_t localCos, local_acc_t localSin, local_acc_t upcastInputBuffer,
+                                          local_acc_t calcTmpBuffer, int loopCnt, int tailHeads, int loadStride,
+                                          int headNumPerLoad)
+    {
+        for (int loopNum = 0; loopNum < loopCnt; ++loopNum) {
+            local_scalar_t src = inQue_.AllocTensor<scalar_t>();
+            local_dst_t dst = outQue_.AllocTensor<dst_t>();
+            AscendC::DataCopy(src, srcG[loopNum * loadStride], loadStride);
+            inQue_.EnQue(src);
+
+            local_scalar_t srcDeque = inQue_.DeQue<scalar_t>();
+            if constexpr (!std::is_same_v<scalar_t, acc_t>) {
+                int elem_num = loadStride / sizeof(scalar_t);
+                AscendC::LocalTensor<acc_t> upBuffer = copyBuf_.GetWithOffset<acc_t>(elem_num, 0);
+                Cast(upBuffer, srcDeque, AscendC::RoundMode::CAST_TRUNC, elem_num);
+                Cast(dst, upBuffer, AscendC::RoundMode::CAST_TRUNC, elem_num);
+            } else {
+                local_mem_copy(dst, srcDeque, loadStride);
+            }
+            for (int i = 0; i < headNumPerLoad; ++i) {
+                neox_compute(srcDeque[i * headSize_], dst[i * headSize_], localSin, localCos, upcastInputBuffer,
+                             calcTmpBuffer);
+            }
+            outQue_.EnQue(dst);
+            local_dst_t dstDeque = outQue_.DeQue<dst_t>();
+            AscendC::DataCopy(dstG[loopNum * loadStride], dstDeque, loadStride);
+            outQue_.FreeTensor(dstDeque);
+            inQue_.FreeTensor(srcDeque);
+        }
+        // process tail
+        {
+            local_scalar_t src = inQue_.AllocTensor<scalar_t>();
+            local_dst_t dst = outQue_.AllocTensor<dst_t>();
+
+            AscendC::DataCopy(src, srcG[loopCnt * loadStride], tailHeads * headSize_);
+            inQue_.EnQue(src);
+            local_scalar_t srcDeque = inQue_.DeQue<scalar_t>();
+
+            if constexpr (!std::is_same_v<scalar_t, acc_t>) {
+                int elem_num = tailHeads * headSize_ / sizeof(scalar_t);
+                AscendC::LocalTensor<acc_t> upBuffer = copyBuf_.GetWithOffset<acc_t>(elem_num, 0);
+                Cast(upBuffer, srcDeque, AscendC::RoundMode::CAST_TRUNC, elem_num);
+                Cast(dst, upBuffer, AscendC::RoundMode::CAST_TRUNC, elem_num);
+            } else {
+                local_mem_copy(dst, srcDeque, tailHeads * headSize_);
+            }
+
+            for (int i = 0; i < tailHeads; ++i) {
+                neox_compute(srcDeque[i * headSize_], dst[i * headSize_], localSin, localCos, upcastInputBuffer,
+                             calcTmpBuffer);
+            }
+            outQue_.EnQue(dst);
+            local_dst_t dstDeque = outQue_.DeQue<dst_t>();
+            AscendC::DataCopy(dstG[loopCnt * loadStride], dstDeque, tailHeads * headSize_);
+            outQue_.FreeTensor(dstDeque);
+            inQue_.FreeTensor(srcDeque);
+        }
+    }
+
+    __aicore__ inline void compute_function()
+    {
+        local_scalar_t cosSinLocal = inQueSinCos_.AllocTensor<scalar_t>();
+
+        AscendC::DataCopy(cosSinLocal, cosSin_, embedDim_ * 2);
+
+        inQueSinCos_.EnQue(cosSinLocal);
+        local_scalar_t localSinCosDeque = inQueSinCos_.DeQue<scalar_t>();
+        local_scalar_t localCos = localSinCosDeque;
+        local_scalar_t localSin = localSinCosDeque[embedDim_];
+
+        local_acc_t calcTmpBuffer;
+        local_acc_t upcastInputBuffer;
+        local_acc_t upcastTempBuffer;
+        local_acc_t cosSinUpcastBuffer;
+        local_acc_t scaleBuffer;
+        local_acc_t offsetBuffer;
+        calcTmpBuffer = calcBuf_.GetWithOffset<acc_t>(embedDim_ * 2, calcTmpBufferOffset_);
+        upcastInputBuffer = calcBuf_.GetWithOffset<acc_t>(headSize_, upcastInputBufferOffset_);
+        upcastTempBuffer = calcBuf_.GetWithOffset<acc_t>(embedDim_ * 2, upcastTempBufferOffset_);
+        cosSinUpcastBuffer = calcBuf_.GetWithOffset<acc_t>(embedDim_ * 2, cosSinUpcastBufferOffset_);
+
+        local_acc_t cosAccBuffer;
+        local_acc_t sinAccBuffer;
+
+        if constexpr (!std::is_same_v<scalar_t, acc_t>) {
+            Cast(cosSinUpcastBuffer, localSinCosDeque, AscendC::RoundMode::CAST_NONE, 2 * embedDim_);
+            cosAccBuffer = cosSinUpcastBuffer;
+            sinAccBuffer = cosSinUpcastBuffer[embedDim_];
+        } else {
+            cosAccBuffer = localCos;
+            sinAccBuffer = localSin;
+        }
+
+        constexpr const int loadSizeByElem = loadSize / sizeof(scalar_t);
+        int64_t headNumPerLoad = loadSizeByElem / headSize_;
+        int64_t loopCnt = numHeads_ / headNumPerLoad;
+        int64_t tailHeads = numHeads_ - loopCnt * headNumPerLoad;
+        int64_t loadStride = headNumPerLoad * headSize_;
+        int64_t loopCntKv = numKvHeads_ / headNumPerLoad;
+        int64_t tailHeadsKv = numKvHeads_ - loopCntKv * headNumPerLoad;
+        compute_qk(query_, queryDst_, cosAccBuffer, sinAccBuffer, upcastInputBuffer,
+                       calcTmpBuffer, loopCnt, tailHeads, loadStride, headNumPerLoad);
+
+        compute_qk(key_, keyDst_, cosAccBuffer, sinAccBuffer, upcastInputBuffer, calcTmpBuffer,
+                       loopCntKv, tailHeadsKv, loadStride, headNumPerLoad);
+
+        inQueSinCos_.FreeTensor(localSinCosDeque);
+    }
+
+private:
+    AscendC::TPipe *pipe_;
+    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQue_, inQueSinCos_;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQue_;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> calcBuf_;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> copyBuf_;
+    AscendC::GlobalTensor<dst_t> queryDst_;
+    AscendC::GlobalTensor<dst_t> keyDst_;
+    AscendC::GlobalTensor<scalar_t> query_;
+    AscendC::GlobalTensor<scalar_t> key_;
+    AscendC::GlobalTensor<scalar_t> cosSin_;
+    int rotDim_;
+    int embedDim_;
+    int64_t queryStride_;
+    int64_t keyStride_;
+    int64_t dstQueryStride_;
+    int64_t dstKeyStride_;
+    int numHeads_;
+    int numKvHeads_;
+    int headSize_;
+    int calcTmpBufferOffset_;
+    int upcastInputBufferOffset_;
+    int upcastTempBufferOffset_;
+    int cosSinUpcastBufferOffset_;
+    int tempBufferSize_;
+};
+
+// Note: Need to use macro to instaniate all the target functions here, for the current build system dose not support template call in cpp
+// We use C style symbol here for kernel compilation, cpp style kernel entry may lead to compilation failure
+#define ROPE_CUSTOM_KERNEL_TYPE_DECLARE(TYPE, NEOX)                                                                            \
+    extern "C" __global__ __aicore__ void rope_custom_##NEOX##_##TYPE(                                                          \
+        __gm__ int64_t* positions, __gm__ void* queryDst, __gm__ void* keyDst, __gm__ TYPE* query, __gm__ TYPE* key,            \
+        __gm__ TYPE* cosSinCache, const int rotDim, const int64_t queryStride, const int64_t keyStride,                         \
+        const int64_t dstQueryStride, const int64_t dstKeyStride, const int numHeads, const int numKvHeads,                     \
+        const int headSize, const int64_t numTokens, const int loopNum, const int coreNum)                                      \
+    {                                                                                                                           \
+        AscendC::TPipe pipe;                                                                                                    \
+        RotaryEmbedding<TYPE, NEOX> op{};                                                                                       \
+        op.init(positions, queryDst, keyDst, query, key, cosSinCache, rotDim, dstQueryStride, dstKeyStride,                     \
+                queryStride, keyStride, numHeads, numKvHeads, headSize, &pipe);                                                 \
+        for (int64_t i = AscendC::GetBlockIdx(); i < numTokens; i += coreNum) {                                                 \
+            op.update_mem_offset(positions, queryDst, keyDst, query, key, cosSinCache, rotDim, dstQueryStride, dstKeyStride,    \
+                      queryStride, keyStride, numHeads, numKvHeads, headSize, i);                                               \
+            op.compute_function();                                                                                              \
+        }                                                                                                                       \
+    }
+
+#define ROPE_CUSTOM_KERNEL_DECLARE(TYPE)    \
+    ROPE_CUSTOM_KERNEL_TYPE_DECLARE(TYPE, true); \
+    ROPE_CUSTOM_KERNEL_TYPE_DECLARE(TYPE, false);
+
+// Declare all the kernel entry here
+ROPE_CUSTOM_KERNEL_DECLARE(half)
+#if (__CCE_AICORE__ >= 220)
+    ROPE_CUSTOM_KERNEL_DECLARE(bfloat16_t)
+#endif
+
+namespace vllm_ascend {
+
+#define ROTARY_EMBEDDING_KERNEL_CALL(TYPE)                                                                       \
+    if (isNeox)                                                                                                  \
+        rope_custom_true_##TYPE<<<blockDim, nullptr, stream>>>(                                                  \
+            positions, queryDst, keyDst, reinterpret_cast<TYPE *>(query), reinterpret_cast<TYPE *>(key),         \
+            reinterpret_cast<TYPE *>(cosSinCache), rotDim, queryStride, keyStride, dstQueryStride, dstKeyStride, \
+            numHeads, numKvHeads, headSize, numTokens, loopCnt, blockDim);                                       \
+    else                                                                                                         \
+        rope_custom_false_##TYPE<<<blockDim, nullptr, stream>>>(                                                 \
+            positions, queryDst, keyDst, reinterpret_cast<TYPE *>(query), reinterpret_cast<TYPE *>(key),         \
+            reinterpret_cast<TYPE *>(cosSinCache), rotDim, queryStride, keyStride, dstQueryStride, dstKeyStride, \
+            numHeads, numKvHeads, headSize, numTokens, loopCnt, blockDim);
+
+// maximum number for runtime to launch a ascendc kernel.
+// we use this to constrain the maximum number of block size
+static const int64_t maxParallelSize = 65535;
+
+extern void rotary_embedding_impl(AscendType type, bool isNeox, void *stream, int64_t *positions, void *queryDst,
+                                    void *keyDst, void *query, void *key, void *cosSinCache, const int rotDim,
+                                    const int64_t queryStride, const int64_t keyStride, const int64_t dstQueryStride,
+                                    const int64_t dstKeyStride, const int numHeads, const int numKvHeads,
+                                    const int headSize, const int64_t numTokens, const uint32_t loopCnt,
+                                    uint32_t aivNum)
+{
+
+    int blockDim = maxParallelSize > numTokens ? numTokens : maxParallelSize;
+    if (type == AscendType::FP16) {
+        ROTARY_EMBEDDING_KERNEL_CALL(half);
+    }
+    #if (__CCE_AICORE__ >= 220)
+    else if (type == AscendType::BF16) {
+        ROTARY_EMBEDDING_KERNEL_CALL(bfloat16_t);
+    }
+    #endif
+    else {
+        return;
+    }
+}
+
+} // namespace vllm_ascend
--- a/csrc/kernels/types.h
+++ b/csrc/kernels/types.h
@ -0,0 +1,25 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+namespace vllm_ascend {
+enum struct AscendType {
+    FP16 = 0,
+    BF16 = 1,
+    FP32 = 2,
+};
+}
--- a/csrc/kernels/utils.h
+++ b/csrc/kernels/utils.h
@ -0,0 +1,51 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+#include "kernel_type.h"
+namespace vllm_ascend {
+
+template <typename scalar_t> struct AccType;
+
+#if (__CCE_AICORE__ >= 220)
+template <> struct AccType<bfloat16_t> {
+  using type = float;
+};
+#endif
+
+template <> struct AccType<half> {
+    using type = half;
+};
+
+template <> struct AccType<float> {
+    using type = float;
+};
+
+template <> struct AccType<int8_t> {
+    using type = int;
+};
+
+template <typename scalar_t>
+__aicore__ inline void local_mem_copy(AscendC::LocalTensor<scalar_t> dst, AscendC::LocalTensor<scalar_t> src, int size)
+{
+    constexpr int loadSize = 256 / sizeof(scalar_t);
+    int loopCnt = size / loadSize;
+    int tailSize = size % loadSize;
+    if (loopCnt)
+        AscendC::Copy(dst, src, loadSize, loopCnt, {1, 1, 8, 8});
+    AscendC::Copy(dst[loopCnt * loadSize], src[loopCnt * loadSize], tailSize, 1, {1, 1, 8, 8});
+}
+} // namespace vllm_ascend
--- a/csrc/ops.h
+++ b/csrc/ops.h
@ -0,0 +1,63 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <optional>
+#include <torch/library.h>
+
+#include <vector>
+#include "kernels/types.h"
+#include "torch_npu/csrc/aten/common/from_blob.h"
+
+namespace vllm_ascend {
+  extern void rotary_embedding_impl(AscendType type, bool isNeox, void *stream, int64_t *positions, void *queryDst,
+    void *keyDst, void *query, void *key, void *cosSinCache, const int rotDim,
+    const int64_t queryStride, const int64_t keyStride, const int64_t dstQueryStride,
+    const int64_t dstKeyStride, const int numHeads, const int numKvHeads,
+    const int headSize, const int64_t numTokens, const uint32_t loopCnt,
+    uint32_t aivNum);
+
+  extern void get_masked_input_and_mask_impl(
+    void* stream,
+    void* input,
+    void* masked_input,
+    void* mask_out,
+    const int64_t org_vocab_start_index,
+    const int64_t org_vocab_end_index,
+    const int64_t num_org_vocab_padding, 
+    const int64_t added_vocab_start_index,
+    const int64_t added_vocab_end_index,
+    const int64_t size,
+    const uint32_t loop_cnt,
+    const uint32_t aiv_num);
+    
+  torch::Tensor weak_ref_tensor(torch::Tensor& tensor) {
+    if (!tensor.is_privateuseone()) {
+      throw std::runtime_error("Tensor must be on NPU device");
+    }
+    // Get the raw data pointer
+    void* data_ptr = tensor.data_ptr();
+    // Get tensor sizes and strides
+    std::vector<int64_t> sizes = tensor.sizes().vec();
+    std::vector<int64_t> strides = tensor.strides().vec();
+    // Get tensor options (dtype, device)
+    auto options = tensor.options();
+    // Create a new tensor from the raw data pointer
+    auto new_tensor = at_npu::native::from_blob(data_ptr, sizes, strides, options);
+    return new_tensor;
+  }
+}
--- a/csrc/torch_binding.cpp
+++ b/csrc/torch_binding.cpp
@ -0,0 +1,233 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <torch/extension.h>
+#include <torch/library.h>
+#include <torch/version.h>
+#include <torch_npu/csrc/core/npu/NPUStream.h>
+#include <torch_npu/csrc/framework/OpCommand.h>
+#include <torch_npu/csrc/npu/Module.h>
+#include <pybind11/pybind11.h>
+#include "acl/acl.h"
+#include "tiling/platform/platform_ascendc.h"
+#include "aclnn/opdev/platform.h"
+#include "ops.h"
+#include "utils.h"
+
+namespace vllm_ascend {
+
+std::tuple<at::Tensor, at::Tensor> rotary_embedding(at::Tensor &positions, at::Tensor &query, at::Tensor &key,
+    int64_t head_size, at::Tensor &cos_sin_cache,  bool is_neox)
+{
+    int32_t deviceId = 0;
+    int64_t num_tokens = positions.numel();
+    int positions_ndim = positions.dim();
+    TORCH_CHECK(
+        positions_ndim == 1 || positions_ndim == 2,
+        "positions must have shape [num_tokens] or [batch_size, seq_len]");
+    if (positions_ndim == 1) {
+      TORCH_CHECK(
+          query.size(0) == positions.size(0) && key.size(0) == positions.size(0),
+          "query, key and positions must have the same number of tokens");
+    }
+    if (positions_ndim == 2) {
+      TORCH_CHECK(
+          query.size(0) == positions.size(0) &&
+              key.size(0) == positions.size(0) &&
+              query.size(1) == positions.size(1) &&
+              key.size(1) == positions.size(1),
+          "query, key and positions must have the same batch_size and seq_len");
+    }
+    TORCH_CHECK(head_size % 32 == 0, "rotary_embedding: headSize should be divisible by 32");
+    int query_hidden_size = query.numel() / num_tokens;
+    int key_hidden_size = key.numel() / num_tokens;
+    TORCH_CHECK(query_hidden_size % head_size == 0);
+    TORCH_CHECK(key_hidden_size % head_size == 0);
+    TORCH_CHECK(is_neox == true, "rotary_embedding: neox=false is not supported as custom kernel in vllm-ascend");
+
+    // Make sure query and key have consistent number of heads
+    int num_heads = query_hidden_size / head_size;
+    int num_kv_heads = key_hidden_size / head_size;
+    TORCH_CHECK(num_heads % num_kv_heads == 0);
+    at::Tensor query_dst = at::empty({num_tokens, num_heads, head_size}, query.options());
+    at::Tensor key_dst = at::empty({num_tokens, num_kv_heads, head_size}, key.options());
+
+    int rot_dim = cos_sin_cache.size(1);
+    int seq_dim_idx = positions_ndim - 1;
+    int64_t *position_ids_ptr = positions.data_ptr<int64_t>();
+    void *query_dst_ptr = query_dst.data_ptr();
+    void *key_dst_ptr = key_dst.data_ptr();
+    void *query_ptr = query.data_ptr();
+    void *key_ptr = key.data_ptr();
+    void *cos_sin_cache_ptr = cos_sin_cache.data_ptr();
+    int64_t query_stride = query.stride(seq_dim_idx);
+    int64_t key_stride = key.stride(seq_dim_idx);
+    int64_t dst_query_stride = query_dst.stride(0);
+    int64_t dst_key_stride = key_dst.stride(0);
+    at::ScalarType scalar_type = query.scalar_type();
+    aclrtStream stream = c10_npu::getCurrentNPUStream().stream();
+    at_npu::native::OpCommand cmd;
+    cmd.Name("rotary_embedding");
+    cmd.SetCustomHandler([scalar_type, is_neox, num_tokens, stream, position_ids_ptr, query_dst_ptr, key_dst_ptr,
+                          query_ptr, key_ptr, cos_sin_cache_ptr, rot_dim, query_stride, key_stride,
+                          dst_query_stride, dst_key_stride, num_heads, num_kv_heads, head_size]() -> int {
+        auto dtype_num = get_dtype_from_torch(scalar_type);
+        fe::PlatFormInfos platform_infos;
+        int device_id = 0;
+        fe::PlatformInfoManager::GeInstance().GetRuntimePlatformInfosByDevice(device_id, platform_infos);
+        uint32_t aivNum = platform_infos.GetCoreNumByType("aiv");
+        uint32_t loop_cnt = (num_tokens + aivNum - 1) / aivNum;
+        rotary_embedding_impl(dtype_num, is_neox, stream, position_ids_ptr, query_dst_ptr, key_dst_ptr, query_ptr,
+                                key_ptr, cos_sin_cache_ptr, rot_dim, query_stride, key_stride, dst_query_stride,
+                                dst_key_stride, num_heads, num_kv_heads, head_size, num_tokens, loop_cnt, aivNum);
+        return 0;
+    });
+    cmd.Run();
+    return {query_dst, key_dst};
+}
+
+std::tuple<at::Tensor, at::Tensor> get_masked_input_and_mask(
+    at::Tensor &input,
+    const int64_t org_vocab_start_index,
+    const int64_t org_vocab_end_index,
+    const int64_t num_org_vocab_padding,
+    const int64_t added_vocab_start_index,
+    const int64_t added_vocab_end_index)
+    /*
+    https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L161-L198
+    Embedding parallelized in the vocabulary dimension.
+
+    Adapted from torch.nn.Embedding, note that we pad the vocabulary size to
+    make sure it is divisible by the number of model parallel GPUs.
+
+    In order to support various loading methods, we ensure that LoRA-added
+    embeddings are always at the end of TP-sharded tensors. In other words,
+    we shard base embeddings and LoRA embeddings separately (both padded),
+    and place them in the same tensor.
+    In this example, we will have the original vocab size = 1010,
+    added vocab size = 16 and padding to 64. Therefore, the total
+    vocab size with padding will be 1088 (because we first pad 1010 to
+    1024, add 16, and then pad to 1088).
+    Therefore, the tensor format looks like the following:
+    TP1, rank 0 (no sharding):
+                            |< --------BASE-------- >|< -BASE PADDING-- >|< -----LORA------ >|< -LORA PADDING-- >|
+    corresponding token_id: |  0  |  1  | ... | 1009 |  -1  | ... |  -1  | 1010 | ... | 1015 |  -1  | ... |  -1  |
+                     index: |  0  |  1  | ... | 1009 | 1010 | ... | 1023 | 1024 | ... | 1039 | 1040 | ... | 1087 |
+
+    TP2, rank 0:
+                            |< --------------------BASE--------------------- >|< -----LORA------ >|< -LORA PADDING- >|
+    corresponding token_id: |  0  |  1  |  2  | ... | 497  | 498 | ...  | 511 | 1000 | ... | 1015 |  -1  | ... |  -1 |
+                     index: |  0  |  1  |  2  | ... | 497  | 498 | ...  | 511 | 512  | ... | 527  |  520 | ... | 543 |
+    TP2, rank 1:
+                            |< -----------BASE----------- >|< -BASE PADDING- >|< -----------LORA PADDING----------- >|
+    corresponding token_id: | 512 | 513 | 514 | ... | 1009 | -1  | ...  | -1  |  -1  | ... |  -1  | -1  | ... |   -1 |
+                     index: |  0  |  1  |  2  | ... | 497  | 498 | ...  | 511 | 512  | ... | 519  | 520 | ... |  543 | 
+    Parameters:
+        org_vocab_start_index //base embeddings start
+        org_vocab_end_index //base embeddings end
+        num_org_vocab_padding //base embeddings padding
+        added_vocab_start_index //LoRA embeddings start
+        added_vocab_end_index //LoRA embeddings end
+    */
+{
+    // Input validation
+    TORCH_CHECK(input.dim() >= 1, "input must have at least 1 dimension");
+    TORCH_CHECK(org_vocab_start_index >= 0, "org_vocab_start_index must be non-negative");
+    TORCH_CHECK(org_vocab_end_index >= org_vocab_start_index, "org_vocab_end_index must be greater than org_vocab_start_index");
+    TORCH_CHECK(num_org_vocab_padding >= 0, "num_org_vocab_padding must be non-negative");
+    TORCH_CHECK(added_vocab_start_index >= org_vocab_end_index, "added_vocab_start_index must be greater than org_vocab_end_index");
+    TORCH_CHECK(added_vocab_end_index >= added_vocab_start_index, "added_vocab_end_index must be greater than added_vocab_start_index");
+
+    // Get total number of elements
+    int64_t size = input.numel();
+
+    // Create output tensors
+    at::Tensor masked_input = at::empty_like(input);
+	at::Tensor mask = at::empty_like(input).to(at::kBool);
+    
+    // Get data pointers
+    void *input_ptr = input.data_ptr();
+    void *masked_input_ptr = masked_input.data_ptr();
+    void *mask_ptr = mask.data_ptr();
+    
+    // Get current stream
+    aclrtStream stream = c10_npu::getCurrentNPUStream().stream();
+    
+    // Get scalar type
+    at::ScalarType scalar_type = input.scalar_type();
+    
+    // Create and configure OpCommand
+    at_npu::native::OpCommand cmd;
+    cmd.Name("get_masked_input_and_mask");
+    cmd.SetCustomHandler([scalar_type, size, stream, 
+                         input_ptr, masked_input_ptr, mask_ptr,
+                         org_vocab_start_index, org_vocab_end_index,
+                         num_org_vocab_padding, added_vocab_start_index,
+                         added_vocab_end_index]() -> int {
+        // Get platform info
+        fe::PlatFormInfos platform_infos;
+        int device_id = 0;
+        fe::PlatformInfoManager::GeInstance().GetRuntimePlatformInfosByDevice(device_id, platform_infos);
+        uint32_t aivNum = platform_infos.GetCoreNumByType("aiv");
+        uint32_t loop_cnt = (size + aivNum - 1) / aivNum;
+        
+        // Call implementation
+        get_masked_input_and_mask_impl(
+            stream,
+            input_ptr,
+            masked_input_ptr, 
+            mask_ptr,
+            org_vocab_start_index,
+            org_vocab_end_index,
+            num_org_vocab_padding,
+            added_vocab_start_index,
+            added_vocab_end_index,
+            size,
+            loop_cnt,
+            aivNum);
+            
+        return 0;
+    });
+    cmd.Run();
+    return {masked_input, mask};
+}
+} // namespace vllm_ascend
+
+TORCH_LIBRARY_EXPAND(_C, ops)
+{
+    // vLLM-Ascend custom ops
+    ops.def("weak_ref_tensor(Tensor input) -> Tensor");
+    ops.impl("weak_ref_tensor", torch::kPrivateUse1, &vllm_ascend::weak_ref_tensor);
+
+    // Rotary embedding
+    // Apply GPT-NeoX style rotary embedding to query and key.
+    ops.def(
+        "rotary_embedding(Tensor positions, Tensor! query,"
+        "                 Tensor! key, int head_size,"
+        "                 Tensor cos_sin_cache, bool is_neox) -> (Tensor query, Tensor key)");
+    ops.impl("rotary_embedding", torch::kPrivateUse1, &vllm_ascend::rotary_embedding);
+
+    ops.def(
+        "get_masked_input_and_mask(Tensor input, "
+        "                         int org_vocab_start_index, "
+        "                         int org_vocab_end_index, "
+        "                         int num_org_vocab_padding, "
+        "                         int added_vocab_start_index, "
+        "                         int added_vocab_end_index) -> (Tensor masked_input, Tensor mask)");
+    ops.impl("get_masked_input_and_mask", torch::kPrivateUse1, &vllm_ascend::get_masked_input_and_mask);
+}
+
+REGISTER_EXTENSION(_C)
--- a/csrc/utils.h
+++ b/csrc/utils.h
@ -0,0 +1,43 @@
+#pragma once
+
+#include "kernels/types.h"
+#include <c10/core/ScalarType.h>
+#include <Python.h>
+
+#define _CONCAT(A, B) A##B
+#define CONCAT(A, B) _CONCAT(A, B)
+
+#define _STRINGIFY(A) #A
+#define STRINGIFY(A) _STRINGIFY(A)
+
+// A version of the TORCH_LIBRARY macro that expands the NAME, i.e. so NAME
+// could be a macro instead of a literal token.
+#define TORCH_LIBRARY_EXPAND(NAME, MODULE) TORCH_LIBRARY(NAME, MODULE)
+
+// A version of the TORCH_LIBRARY_IMPL macro that expands the NAME, i.e. so NAME
+// could be a macro instead of a literal token.
+#define TORCH_LIBRARY_IMPL_EXPAND(NAME, DEVICE, MODULE) \
+  TORCH_LIBRARY_IMPL(NAME, DEVICE, MODULE)
+
+// REGISTER_EXTENSION allows the shared library to be loaded and initialized
+// via python's import statement.
+#define REGISTER_EXTENSION(NAME)                                               \
+  PyMODINIT_FUNC CONCAT(PyInit_, NAME)() {                                     \
+    static struct PyModuleDef module = {PyModuleDef_HEAD_INIT,                 \
+                                        STRINGIFY(NAME), nullptr, 0, nullptr}; \
+    return PyModule_Create(&module);                                           \
+  }
+
+
+namespace vllm_ascend {
+AscendType get_dtype_from_torch(at::ScalarType scalarType)
+{
+    if (scalarType == at::ScalarType::Float) {
+        return AscendType::FP32;
+    } else if (scalarType == at::ScalarType::BFloat16) {
+        return AscendType::BF16;
+    } else {
+        return AscendType::FP16;
+    }
+}
+} // namespace vllm_ascend
--- a/docs/README.md
+++ b/docs/README.md
@ -16,7 +16,7 @@ make html
 ## Open the docs with your browser

 ```bash
-python -m http.server -d build/html/
+python -m http.server -d _build/html/
 ```

 Launch your browser and open http://localhost:8000/.
--- a/docs/requirements-test.txt
+++ b/docs/requirements-test.txt
@ -1,2 +1,2 @@
 pytest-asyncio
-
+pytest-mock
--- a/docs/source/_templates/sections/header.html
+++ b/docs/source/_templates/sections/header.html
@ -0,0 +1,58 @@
+<!--
+  **********************************************************************
+  * Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+  * Copyright 2023 The vLLM team.
+  *
+  * Licensed under the Apache License, Version 2.0 (the "License");
+  * you may not use this file except in compliance with the License.
+  * You may obtain a copy of the License at
+  *
+  *     http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing, software
+  * distributed under the License is distributed on an "AS IS" BASIS,
+  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  * See the License for the specific language governing permissions and
+  * limitations under the License.
+  * This file is a part of the vllm-ascend project.
+  * Adapted from https://github.com/vllm-project/vllm/blob/main/docs/source/_templates/sections/header.html
+  **********************************************************************
+-->
+<style>
+    .notification-bar {
+      width: 100vw;
+      display: flex;
+      justify-content: center;
+      align-items: center;
+      font-size: 16px;
+    }
+    .notification-bar p {
+      margin: 0;
+    }
+    .notification-bar a {
+      font-weight: bold;
+      text-decoration: none;
+    }
+  
+    /* Light mode styles (default) */
+    .notification-bar {
+      background-color: #fff3cd;
+      color: #856404;
+    }
+    .notification-bar a {
+      color: #d97706;
+    }
+  
+    /* Dark mode styles */
+    html[data-theme=dark] .notification-bar {
+      background-color: #333;
+      color: #ddd;
+    }
+    html[data-theme=dark] .notification-bar a {
+      color: #ffa500; /* Brighter color for visibility */
+    }
+  </style>
+  
+  <div class="notification-bar">
+    <p>You are viewing the latest developer preview docs. <a href="https://vllm-ascend.readthedocs.io/en/v0.7.3-dev">Click here</a> to view docs for the latest stable release(v0.7.3.post1).</p>
+  </div>
--- a/docs/source/assets/multi_node_dp.png
+++ b/docs/source/assets/multi_node_dp.png
--- a/docs/source/community/contributors.md
+++ b/docs/source/community/contributors.md
@ -0,0 +1,102 @@
+# Maintainers and contributors
+
+## Maintainers
+
+| Name | Github ID | Date |
+|:-----------:|:-----:|:-----:|
+| Xiyuan Wang| [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
+| Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 |
+| Yi Gan| [@ganyi1996ppo](https://github.com/ganyi1996ppo) | 2025/02 |
+| Shoujian Zheng| [@jianzs](https://github.com/jianzs) | 2025/06 |
+
+## Contributors
+
+vLLM Ascend every release would not have been possible without the following contributors:
+
+Updated on 2025-06-10:
+
+| Number | Contributor | Date | Commit ID |
+|:------:|:-----------:|:-----:|:---------:|
+| 83 | [@ZhengWG](https://github.com/) | 2025/7/7 | [3a469de](https://github.com/vllm-project/vllm-ascend/commit/9c886d0a1f0fc011692090b0395d734c83a469de) |
+| 82 | [@wm901115nwpu](https://github.com/) | 2025/7/7 | [a2a47d4](https://github.com/vllm-project/vllm-ascend/commit/f08c4f15a27f0f27132f4ca7a0c226bf0a2a47d4) |
+| 81 | [@Agonixiaoxiao](https://github.com/) | 2025/7/2 | [6f84576](https://github.com/vllm-project/vllm-ascend/commit/7fc1a984890bd930f670deedcb2dda3a46f84576) |
+| 80 | [@zhanghw0354](https://github.com/zhanghw0354) | 2025/7/2 | [d3df9a5](https://github.com/vllm-project/vllm-ascend/commit/9fb3d558e5b57a3c97ee5e11b9f5dba6ad3df9a5) |
+| 79 | [@GDzhu01](https://github.com/GDzhu01) | 2025/6/28 | [de256ac](https://github.com/vllm-project/vllm-ascend/commit/b308a7a25897b88d4a23a9e3d583f4ec6de256ac) |
+| 78 | [@leo-pony](https://github.com/leo-pony) | 2025/6/26 | [3f2a5f2](https://github.com/vllm-project/vllm-ascend/commit/10253449120307e3b45f99d82218ba53e3f2a5f2) |
+| 77 | [@zeshengzong](https://github.com/zeshengzong) | 2025/6/26 | [3ee25aa](https://github.com/vllm-project/vllm-ascend/commit/192dbbcc6e244a8471d3c00033dc637233ee25aa) |
+| 76 | [@sharonyunyun](https://github.com/sharonyunyun) | 2025/6/25 | [2dd8666](https://github.com/vllm-project/vllm-ascend/commit/941269a6c5bbc79f6c1b6abd4680dc5802dd8666) |
+| 75 | [@Pr0Wh1teGivee](https://github.com/Pr0Wh1teGivee) | 2025/6/25 | [c65dd40](https://github.com/vllm-project/vllm-ascend/commit/2fda60464c287fe456b4a2f27e63996edc65dd40) |
+| 74 | [@xleoken](https://github.com/xleoken) | 2025/6/23 | [c604de0](https://github.com/vllm-project/vllm-ascend/commit/4447e53d7ad5edcda978ca6b0a3a26a73c604de0) |
+| 73 | [@lyj-jjj](https://github.com/lyj-jjj) | 2025/6/23 | [5cbd74e](https://github.com/vllm-project/vllm-ascend/commit/5177bef87a21331dcca11159d3d1438075cbd74e) |
+| 72 | [@farawayboat](https://github.com/farawayboat)| 2025/6/21 | [bc7d392](https://github.com/vllm-project/vllm-ascend/commit/097e7149f75c0806774bc68207f0f6270bc7d392)
+| 71 | [@yuancaoyaoHW](https://github.com/yuancaoyaoHW) | 2025/6/20 | [7aa0b94](https://github.com/vllm-project/vllm-ascend/commit/00ae250f3ced68317bc91c93dc1f1a0977aa0b94)
+| 70 | [@songshanhu07](https://github.com/songshanhu07) | 2025/6/18 | [5e1de1f](https://github.com/vllm-project/vllm-ascend/commit/2a70dbbdb8f55002de3313e17dfd595e1de1f)
+| 69 | [@wangyanhui-cmss](https://github.com/wangyanhui-cmss) | 2025/6/12| [40c9e88](https://github.com/vllm-project/vllm-ascend/commit/2a5fb4014b863cee6abc3009f5bc5340c9e88) |
+| 68 | [@chenwaner](https://github.com/chenwaner) | 2025/6/11 | [c696169](https://github.com/vllm-project/vllm-ascend/commit/e46dc142bf1180453c64226d76854fc1ec696169) |
+| 67 | [@yzim](https://github.com/yzim) | 2025/6/11 | [aaf701b](https://github.com/vllm-project/vllm-ascend/commit/4153a5091b698c2270d160409e7fee73baaf701b) |
+| 66 | [@Yuxiao-Xu](https://github.com/Yuxiao-Xu) | 2025/6/9 | [6b853f1](https://github.com/vllm-project/vllm-ascend/commit/6b853f15fe69ba335d2745ebcf14a164d0bcc505) |
+| 65 | [@ChenTaoyu-SJTU](https://github.com/ChenTaoyu-SJTU) | 2025/6/7 | [20dedba](https://github.com/vllm-project/vllm-ascend/commit/20dedba5d1fc84b7ae8b49f9ce3e3649389e2193) |
+| 64 | [@zxdukki](https://github.com/zxdukki) | 2025/6/7 | [87ebaef](https://github.com/vllm-project/vllm-ascend/commit/87ebaef4e4e519988f27a6aa378f614642202ecf) |
+| 63 | [@sdmyzlp](https://github.com/sdmyzlp) | 2025/6/7 | [3640c60](https://github.com/vllm-project/vllm-ascend/commit/3640c60b0eb4d4cb104e20bfa406d3f1d17920a7) |
+| 62 | [@weijinqian0](https://github.com/weijinqian0) | 2025/6/7 | [e9ada68](https://github.com/vllm-project/vllm-ascend/commit/e9ada685ece798f9fe0d4a287e3f5246a8a7207b) |
+| 61 | [@hahazhky](https://github.com/hahazhky) | 2025/6/6 | [0b12c2a](https://github.com/vllm-project/vllm-ascend/commit/0b12c2acf7d9fd192beebebf662298067d9a5435) |
+| 60 | [@depeng1994](https://github.com/depeng1994) | 2025/6/6 | [6b094a2](https://github.com/vllm-project/vllm-ascend/commit/6b094a2bd49a8a41eb3647568b2d9e5b337db81f) |
+| 59 | [@David9857](https://github.com/David9857) | 2025/6/5 | [78431b3](https://github.com/vllm-project/vllm-ascend/commit/78431b34694dfa3c8f54ed7cc626660318557927) |
+| 58 | [@momo609](https://github.com/momo609) | 2025/6/5 | [908a851](https://github.com/vllm-project/vllm-ascend/commit/908a851a776cfd9051cc062119e6ec481561c6f7) |
+| 57 | [@zhangxinyuehfad](https://github.com/zhangxinyuehfad) | 2025/6/5 | [7737aaa](https://github.com/vllm-project/vllm-ascend/commit/7737aaa40f699b233a35fb61e908b687adc1e2e5) |
+| 56 | [@NINGBENZHE](https://github.com/NINGBENZHE) | 2025/6/3 | [6ec64a3](https://github.com/vllm-project/vllm-ascend/commit/6ec64a3f9686df65b5a23a41aa301e669db19099) |
+| 55 | [@XWFAlone](https://github.com/XWFAlone) | 2025/5/30 | [3442fbd](https://github.com/vllm-project/vllm-ascend/commit/3442fbdb235b4c6d72c2bc64a49707a7bd89958e) |
+| 54 | [@YisongJiang](https://github.com/YisongJiang) | 2025/5/29 | [90afaf6](https://github.com/vllm-project/vllm-ascend/commit/90afaf6306f680307462becf3c78585737579851) |
+| 53 | [@ponix-j](https://github.com/ponix-j) | 2025/5/23 | [df58fb8](https://github.com/vllm-project/vllm-ascend/commit/df58fb80eee24139fc61c495be3ce79cf81b3f73) |
+| 52 | [@ttanzhiqiang](https://github.com/ttanzhiqiang) | 2025/5/23 | [dc6172e](https://github.com/vllm-project/vllm-ascend/commit/dc6172efd3860ce95b40a7b3e93611f875f06d40) |
+| 51 | [@yangpuPKU](https://github.com/yangpuPKU) | 2025/5/23 | [46df67a](https://github.com/vllm-project/vllm-ascend/commit/46df67a5e9ab73fade08cbb2d8c0155cee7316d1) |
+| 50 | [@wonderful199082](https://github.com/wonderful199082) | 2025/5/20 | [5cf9ff1](https://github.com/vllm-project/vllm-ascend/commit/5cf9ff18e91b0b7031c258d71a257b8e24689763) |
+| 49 | [@22dimensions](https://github.com/22dimensions) | 2025/5/17 | [a8730e7](https://github.com/vllm-project/vllm-ascend/commit/a8730e7a3c4ac6c4b39a5946c943252fdea6cce5) |
+| 48 | [@cxcxflying](https://github.com/cxcxflying) | 2025/5/13 | [e564470](https://github.com/vllm-project/vllm-ascend/commit/e56447033889ca95df512208cab22ef832bfdf07) |
+| 47 | [@NeverRaR](https://github.com/NeverRaR) | 2025/5/12 | [efabd72](https://github.com/vllm-project/vllm-ascend/commit/efabd722eb757e49aa309c173bbec91ca8c4ced1) |
+| 46 | [@chris668899](https://github.com/chris668899) | 2025/5/8 | [6c02088](https://github.com/vllm-project/vllm-ascend/commit/6c020883a8332b5c519f4f6502733edd9b391c2b) |
+| 45 | [@sunbaosong](https://github.com/sunbaosong) | 2025/5/6 | [d6bfae8](https://github.com/vllm-project/vllm-ascend/commit/d6bfae8eeebedf677b643b712d367a3a69c9cce4) |
+| 44 | [@ApsarasX](https://github.com/ApsarasX) | 2025/4/29 | [87975fa](https://github.com/vllm-project/vllm-ascend/commit/87975fa058fe3f90d204ded42a08989a8dcb413e) |
+| 43 | [@zouyida2052](https://github.com/zouyida2052) | 2025/4/28 | [b9528e6](https://github.com/vllm-project/vllm-ascend/commit/b9528e6ecdc417cf444e55a0ce4a2bafdef0ea3b) |
+| 42 | [@ZhengJun9](https://github.com/ZhengJun9) | 2025/4/28 | [1791113](https://github.com/vllm-project/vllm-ascend/commit/17911138c90d78a76bd691e9dcb56763db35b19f) |
+| 41 | [@linfeng-yuan](https://github.com/linfeng-yuan) | 2025/4/28 | [2204e4d](https://github.com/vllm-project/vllm-ascend/commit/2204e4d08f8e10cf9c30154a14eaa5ca956c2acd) |
+| 40 | [@jianzs](https://github.com/jianzs) | 2025/4/27 | [fa4a5d9](https://github.com/vllm-project/vllm-ascend/commit/fa4a5d980e8845a88b9162cf169f0a5ab230f8a5) |
+| 39 | [@fakeYan](https://github.com/fakeYan) | 2025/4/23 | [05bdcbe](https://github.com/vllm-project/vllm-ascend/commit/05bdcbeae47c7fcb9b1c30cad059abf1d40b5421) |
+| 38 | [@RongRongStudio](https://github.com/RongRongStudio) | 2025/4/22 | [848e041](https://github.com/vllm-project/vllm-ascend/commit/848e041a54732c923660dd02daf8e9bf439736a2) |
+| 37 | [@paulyu12](https://github.com/paulyu12) | 2025/4/17 | [697908f](https://github.com/vllm-project/vllm-ascend/commit/697908f5cd7c65a3a917ec1a962b0886efc98c7e) |
+| 36 | [@heartStrive1998](https://github.com/heartStrive1998) | 2025/4/16 | [2f15503](https://github.com/vllm-project/vllm-ascend/commit/2f155039dc3997640854daef469bbf0cb77dc6ed) |
+| 35 | [@eeethenQ](https://github.com/eeethenQ) | 2025/4/15 | [44a8301](https://github.com/vllm-project/vllm-ascend/commit/44a8301424ded94dae83e13b837f5bfc0a1bfc15) |
+| 34 | [@wxsIcey](https://github.com/wxsIcey) | 2025/4/10 | [d05ea17](https://github.com/vllm-project/vllm-ascend/commit/d05ea17427b82a506b97409a7de8359f18f565f7) |
+| 33 | [@yx0716](https://github.com/yx0716) | 2025/4/8 | [5d62393](https://github.com/vllm-project/vllm-ascend/commit/5d6239306be9b0f5ac6dbaa137048c372a92ff20) |
+| 32 | [@celestialli](https://github.com/celestialli) | 2025/4/7 | [2b765dc](https://github.com/vllm-project/vllm-ascend/commit/2b765dcc4974b1bafc26ff5da817ce7e652f0eb0) |
+| 31 | [@hfadzxy](https://github.com/hfadzxy) | 2025/3/30 | [7beb433](https://github.com/vllm-project/vllm-ascend/commit/7beb4339dc8047af9ef64db1d0a8c59ddbb3709f) |
+| 30 | [@wuhuikx](https://github.com/wuhuikx) | 2025/3/28 | [57a84bb](https://github.com/vllm-project/vllm-ascend/commit/57a84bb7befeaa0dc62aa35fa406e4d6affbfcca) |
+| 29 | [@zzzzwwjj](https://github.com/zzzzwwjj) | 2025/3/28 | [12390af](https://github.com/vllm-project/vllm-ascend/commit/12390af075962456ecc8233d8dcce7064b75f390) |
+| 28 | [@ganyi1996ppo](https://github.com/ganyi1996ppo) | 2025/3/28 | [27e86b9](https://github.com/vllm-project/vllm-ascend/commit/27e86b993a6a810d818143ec9dbfc439a419fa77) |
+| 27 | [@ZhengZhenyu](https://github.com/ZhengZhenyu) | 2025/3/26 | [0b5a964](https://github.com/vllm-project/vllm-ascend/commit/0b5a9643fd6c3240d7ede669e37209d7ff433841) |
+| 26 | [@baifanxxx](https://github.com/baifanxxx) | 2025/3/26 | [1225052](https://github.com/vllm-project/vllm-ascend/commit/122505208ff6284f409846ca7294f4a4b9883285) |
+| 25 | [@rjg-lyh](https://github.com/rjg-lyh) | 2025/3/13 | [6512470](https://github.com/vllm-project/vllm-ascend/commit/65124705fb39d4cc2c94c80254421e067a82fe50) |
+| 24 | [@xiemingda-1002](https://github.com/xiemingda-1002) | 2025/3/12 | [59ea23d](https://github.com/vllm-project/vllm-ascend/commit/59ea23d0d394879d7f33de6fd22242539b9c3cc5) |
+| 23 | [@yiz-liu](https://github.com/yiz-liu) | 2025/3/11 | [0db6670](https://github.com/vllm-project/vllm-ascend/commit/0db6670bfab8cb1d84c9e7270df0a1d42d6ce7ca) |
+| 22 | [@new-TonyWang](https://github.com/new-TonyWang) | 2025/3/11 | [dfb4e23](https://github.com/vllm-project/vllm-ascend/commit/dfb4e23e9d820ac992a071c123bbe983c7b01b2e) |
+| 21 | [@mengwei805](https://github.com/mengwei805) | 2025/3/6 | [8fcf3d1](https://github.com/vllm-project/vllm-ascend/commit/8fcf3d1704084626db35c5dc82ade446508598d4) |
+| 20 | [@baymax591](https://github.com/baymax591) | 2025/2/28 | [e8131b9](https://github.com/vllm-project/vllm-ascend/commit/e8131b99cf199f50a304e6e6fb125a1b95bcc92b) |
+| 19 | [@dependabot](https://github.com/dependabot) | 2025/2/27 | [a5564ed](https://github.com/vllm-project/vllm-ascend/commit/a5564ed5d8fd9818936a22d9ea35951a27513b4c) |
+| 18 | [@shink](https://github.com/shink) | 2025/2/27 | [6aed833](https://github.com/vllm-project/vllm-ascend/commit/6aed83335cbe92fd0b8ef07c28966a753d012ccb) |
+| 17 | [@wwfu109](https://github.com/wwfu109) | 2025/2/27 | [b074047](https://github.com/vllm-project/vllm-ascend/commit/b07404766bdaf6e3cebc5cb0aba89a247501302e) |
+| 16 | [@kunpengW-code](https://github.com/kunpengW-code) | 2025/2/26 | [ca807ce](https://github.com/vllm-project/vllm-ascend/commit/ca807ce49ed64aa89242f5ae29b9862a77648b45) |
+| 15 | [@Yaphets24](https://github.com/Yaphets24) | 2025/2/22 | [d0b3cb4](https://github.com/vllm-project/vllm-ascend/commit/d0b3cb4fa79d5fc7f8245a3c68885ce1fa030ba4) |
+| 14 | [@noemotiovon](https://github.com/noemotiovon) | 2025/2/21 | [202b39a](https://github.com/vllm-project/vllm-ascend/commit/202b39a38c2869b0ecc3df486550fb555a2eb0c0) |
+| 13 | [@SidaoY](https://github.com/SidaoY) | 2025/2/18 | [718c763](https://github.com/vllm-project/vllm-ascend/commit/718c7638555d12cd43ea2a9e497e185778b68595) |
+| 12 | [@ShiyaNiu](https://github.com/ShiyaNiu) | 2025/2/17 | [36ea38f](https://github.com/vllm-project/vllm-ascend/commit/36ea38fde56437ff1745bd95cd8d9e02a6578d38) |
+| 11 | [@ji-huazhong](https://github.com/ji-huazhong) | 2025/2/12 | [c8b57d1](https://github.com/vllm-project/vllm-ascend/commit/c8b57d10b24efcd9b4fadeb66cfbf66aa3dd5f82) |
+| 10 | [@Angazenn](https://github.com/Angazenn) | 2025/2/11 | [7637759](https://github.com/vllm-project/vllm-ascend/commit/7637759056028839c74960d9cfd3ce6275ee5d35) |
+| 9 | [@whx-sjtu](https://github.com/whx-sjtu) | 2025/2/7 | [8fc5dc9](https://github.com/vllm-project/vllm-ascend/commit/8fc5dc966aaf4e174d1ec0d1902c40289411ec0e) |
+| 8 | [@zouyida2002](https://github.com/zouyida2002) | 2025/2/7 | [4495fc6](https://github.com/vllm-project/vllm-ascend/commit/4495fc68389e3fb1ef14534c202948931e38446b) |
+| 7 | [@hw_whx](https://github.com/hw_whx) | 2025/2/7 | [7d16772](https://github.com/vllm-project/vllm-ascend/commit/7d1677263bc6628ade33bb780455e0f6e5b9b27a) |
+| 6 | [@MengqingCao](https://github.com/MengqingCao) | 2025/2/6 | [7d9ae22](https://github.com/vllm-project/vllm-ascend/commit/7d9ae22ecb6dc3ea4e720e5109cf46e1ae7da730) |
+| 5 | [@Potabk](https://github.com/Potabk) | 2025/2/6 | [8cb5615](https://github.com/vllm-project/vllm-ascend/commit/8cb5615fb010b34c2f4f89e03e6257bfee851f86) |
+| 4 | [@wangxiyuan](https://github.com/wangxiyuan) | 2025/2/6 | [a48b9ad](https://github.com/vllm-project/vllm-ascend/commit/a48b9addefd292af523644411d4ff4142dd4bc66) |
+| 3 | [@shen-shanshan](https://github.com/shen-shanshan) | 2025/2/6 | [bfccf73](https://github.com/vllm-project/vllm-ascend/commit/bfccf739e2fe121b54d9b198c2ec205a9379190e) |
+| 2 | [@Yikun](https://github.com/Yikun) | 2025/2/5 | [d5e7756](https://github.com/vllm-project/vllm-ascend/commit/d5e7756028bd5884ade96b654555c375770a2f64) |
+| 1 | [@simon-mo](https://github.com/simon-mo) | 2025/1/29 | [eb28342](https://github.com/vllm-project/vllm-ascend/commit/eb283428ddc17207b6866118f9bc15454b5b8801) |
--- a/docs/source/community/governance.md
+++ b/docs/source/community/governance.md
@ -0,0 +1,48 @@
+# Governance
+
+## Mission
+As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM. 
+
+## Principles
+vLLM Ascend follows the vLLM community's code of conduct：[vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
+
+## Governance - Mechanics
+vLLM Ascend is an open-source project under the vLLM community, where the authority to appoint roles is ultimately determined by the vLLM community. It adopts a hierarchical technical governance structure.
+
+- Contributor:
+
+    **Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs, code
+
+    **Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement. 
+
+    Contributors will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` permissions (`Can read and clone this repository. Can also manage issues and pull requests`) to help community developers collaborate more efficiently.
+
+- Maintainer:
+
+    **Responsibility:** Develop the project's vision and mission. Maintainers are responsible for driving the technical direction of the entire project and ensuring its overall success, possessing code merge permissions. They formulate the roadmap, review contributions from community members, continuously contribute code, and actively engage in community activities (such as regular meetings/events).
+
+    **Requirements:** Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases, with a commitment to sustained code contributions. Competency in ‌design/development/PR review workflows‌.
+    - **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
+    - **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
+    - **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
+
+    Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.
+
+    Maintainer will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo write permissions (`Can read, clone, and push to this repository. Can also manage issues and pull requests`).
+
+## Nominating and Removing Maintainers
+
+### The Principles
+
+- Membership in vLLM Ascend is given to individuals on merit basis after they demonstrated strong expertise of the vLLM / vLLM Ascend through contributions, reviews and discussions.
+
+- For membership in the maintainer group the individual has to demonstrate strong and continued alignment with the overall vLLM / vLLM Ascend principles.
+
+- Light criteria of moving module maintenance to ‘emeritus’ status if they don’t actively participate over long periods of time.
+
+- The membership is for an individual, not a company.
+
+### Nomination and Removal
+
+- Nomination: Anyone can nominate someone to become a maintainer (include self-nominate). All existing maintainers are responsible for evaluating the nomination. The nominator should provide nominee's info around the strength of the candidate to be a maintainer, include but not limited to review quality, quality contribution, community involvement.
+- Removal: Anyone can nominate a person to be removed from maintainer position (include self-nominate). All existing maintainers are responsible for evaluating the nomination. The nominator should provide nominee's info, include but not limited to lack of activity, conflict with the overall direction and other information that makes them unfit to be a maintainer.
--- a/docs/source/community/user_stories/index.md
+++ b/docs/source/community/user_stories/index.md
@ -0,0 +1,19 @@
+# User Stories
+
+Read case studies on how users and developers solves real, everyday problems with vLLM Ascend
+
+- [LLaMA-Factory](./llamafactory.md) is an easy-to-use and efficient platform for training and fine-tuning large language models, it supports vLLM Ascend to speed up inference since [LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739), gain 2x performance enhancement of inference.
+
+- [Huggingface/trl](https://github.com/huggingface/trl) is a cutting-edge library designed for post-training foundation models using advanced techniques like SFT, PPO and DPO, it uses vLLM Ascend since [v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) to support RLHF on Ascend NPU.
+
+- [MindIE Turbo](https://pypi.org/project/mindie-turbo) is an LLM inference engine acceleration plug-in library developed by Huawei on Ascend hardware, which includes self-developed large language model optimization algorithms and optimizations related to the inference engine framework. It supports vLLM Ascend since [2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev/mindie-turbo-0001.html).
+
+- [GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU cluster manager for running AI models. It supports vLLM Ascend since [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2), see more GPUStack performance evaluation info on [link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew).
+
+- [verl](https://github.com/volcengine/verl) is a flexible, efficient and production-ready RL training library for large language models (LLMs), uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0), see more info on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html).
+
+:::{toctree}
+:caption: More details
+:maxdepth: 1
+llamafactory
+:::
--- a/docs/source/community/user_stories/llamafactory.md
+++ b/docs/source/community/user_stories/llamafactory.md
@ -0,0 +1,19 @@
+# LLaMA-Factory
+
+**About / Introduction**
+
+[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
+
+LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model. 
+
+**The Business Challenge**
+
+LLaMA-Factory used transformers to perform inference on Ascend NPU, but the speed was slow.
+
+**Solving Challenges and Benefits with vLLM Ascend**
+
+With the joint efforts of LLaMA-Factory and vLLM Ascend ([LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)), the performance of LLaMA-Factory in the model inference stage has been significantly improved. According to the test results, the inference speed of LLaMA-Factory has been increased to 2x compared to the transformers version.
+
+**Learn more**
+
+See more about LLaMA-Factory and how it uses vLLM Ascend for inference on the Ascend NPU in the following documentation: [LLaMA-Factory Ascend NPU Inference](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html).
--- a/docs/source/community/versioning_policy.md
+++ b/docs/source/community/versioning_policy.md
@ -0,0 +1,110 @@
+# Versioning policy
+
+Starting with vLLM 0.7.x, the vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) project follows the [PEP 440](https://peps.python.org/pep-0440/) to publish matching with vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)).
+
+## vLLM Ascend Plugin versions
+
+Each vLLM Ascend release will be versioned: `v[major].[minor].[micro][rcN][.postN]` (such as
+`v0.7.3rc1`, `v0.7.3`, `v0.7.3.post1`)
+
+- **Final releases**: will typically be released every **3 months**, will take the vLLM upstream release plan and Ascend software product release plan into comprehensive consideration.
+- **Pre releases**: will typically be released **on demand**, ending with rcN, represents the Nth release candidate version, to support early testing by our users prior to a final release.
+- **Post releases**: will typically be released **on demand** to support to address minor errors in a final release. It's different from [PEP-440 post release note](https://peps.python.org/pep-0440/#post-releases) suggestion, it will contain actual bug fixes considering that the final release version should be matched strictly with the vLLM final release version (`v[major].[minor].[micro]`). The post version has to be published as a patch version of the final release.
+
+For example:
+- `v0.7.x`: it's the first final release to match the vLLM `v0.7.x` version.
+- `v0.7.3rc1`: will be the first pre version of vLLM Ascend.
+- `v0.7.3.post1`: will be the post release if the `v0.7.3` release has some minor errors.
+
+## Release Compatibility Matrix
+
+Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
+
+| vLLM Ascend | vLLM         | Python           | Stable CANN | PyTorch/torch_npu  | MindIE Turbo |
+|-------------|--------------|------------------|-------------|--------------------|--------------|
+| v0.9.2rc1   | v0.9.2       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1.post1.dev20250619      |              |
+| v0.9.1rc1   | v0.9.1       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1.post1.dev20250528      |              |
+| v0.9.0rc2   | v0.9.0       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1      |              |
+| v0.9.0rc1   | v0.9.0       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1      |              |
+| v0.8.5rc1   | v0.8.5.post1 | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1      |              |
+| v0.8.4rc2   | v0.8.4       | >= 3.9, < 3.12   | 8.0.0       | 2.5.1 / 2.5.1      |              |
+| v0.7.3.post1| v0.7.3       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1      |   2.0rc1     |
+| v0.7.3      | v0.7.3       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1      |   2.0rc1     |
+
+## Release cadence
+
+### release window
+
+| Date       | Event                                     |
+|------------|-------------------------------------------|
+| 2025.07.11 | Release candidates, v0.9.2rc1             |
+| 2025.06.22 | Release candidates, v0.9.1rc1             |
+| 2025.06.10 | Release candidates, v0.9.0rc2             |
+| 2025.06.09 | Release candidates, v0.9.0rc1             |
+| 2025.05.29 | v0.7.x post release, v0.7.3.post1         |
+| 2025.05.08 | v0.7.x Final release, v0.7.3              |
+| 2025.05.06 | Release candidates, v0.8.5rc1             |
+| 2025.04.28 | Release candidates, v0.8.4rc2             |
+| 2025.04.18 | Release candidates, v0.8.4rc1             |
+| 2025.03.28 | Release candidates, v0.7.3rc2             |
+| 2025.03.14 | Release candidates, v0.7.3rc1             |
+| 2025.02.19 | Release candidates, v0.7.1rc1             |
+
+## Branch policy
+
+vLLM Ascend has main branch and dev branch.
+
+- **main**: main branch，corresponds to the vLLM main branch and latest 1 or 2 release version. It is continuously monitored for quality through Ascend CI.
+- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.3-dev` is the dev branch for vLLM `v0.7.3` version.
+
+Usually, a commit should be ONLY first merged in the main branch, and then backported to the dev branch to reduce maintenance costs as much as possible.
+
+### Maintenance branch and EOL:
+The branch status will be in one of the following states:
+
+| Branch            | Time frame                       | Summary                                                              |
+|-------------------|----------------------------------|----------------------------------------------------------------------|
+| Maintained        | Approximately 2-3 minor versions | All bugfixes are appropriate. Releases produced, CI commitment.      |
+| Unmaintained      | Community interest driven        | All bugfixes are appropriate. No Releases produced, No CI commitment |
+| End of Life (EOL) | N/A                              | Branch no longer accepting changes                                   |
+
+### Branch state
+
+Note that vLLM Ascend will only be released for a certain vLLM release version rather than all versions. Hence, You might see only part of versions have dev branches (such as only `0.7.1-dev` / `0.7.3-dev` but no `0.7.2-dev`), this is as expected.
+
+Usually, each minor version of vLLM (such as 0.7) will correspond to a vLLM Ascend version branch and support its latest version (for example, we plan to support version 0.7.3) as following shown:
+
+| Branch     | Status       | Note                                 |
+|------------|--------------|--------------------------------------|
+| main       | Maintained   | CI commitment for vLLM main branch and vLLM 0.9.2 branch   |
+| v0.9.1-dev | Maintained   | CI commitment for vLLM 0.9.1 version |
+| v0.7.3-dev | Maintained   | CI commitment for vLLM 0.7.3 version |
+| v0.7.1-dev | Unmaintained | Replaced by v0.7.3-dev               |
+
+### Backward compatibility
+
+For main branch, vLLM Ascend should works with vLLM main branch and latest 1 or 2 release version. So to ensure the backward compatibility, we will do the following:
+- Both main branch and target vLLM release is tested by Ascend E2E CI. For example, currently, vLLM main branch and vLLM 0.8.4 are tested now.
+- For code changes, we will make sure that the changes are compatible with the latest 1 or 2 vLLM release version as well. In this case, vLLM Ascend introduced a version check machinism inner the code. It'll check the version of installed vLLM package first to decide which code logic to use. If users hit the `InvalidVersion` error, it sometimes means that they have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use.
+- For documentation changes, we will make sure that the changes are compatible with the latest 1 or 2 vLLM release version as well. Note should be added if there are any breaking changes.
+
+## Document Branch Policy
+To reduce maintenance costs, **all branch documentation content should remain consistent, and version differences can be controlled via variables in [docs/source/conf.py](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/conf.py)**. While this is not a simple task, it is a principle we should strive to follow.
+
+| Version | Purpose | Code Branch |
+|-----|-----|---------|
+| latest | Doc for the latest dev branch | vX.Y.Z-dev (Will be `main` after the first final release) |
+| version | Doc for historical released versions | Git tags, like vX.Y.Z[rcN] |
+| stable（not yet released） | Doc for latest final release branch | Will be `vX.Y.Z-dev` after the first official release |
+
+As shown above:
+
+- `latest` documentation: Matches the current maintenance branch `vX.Y.Z-dev` (Will be `main` after the first final release). Continuously updated to ensure usability for the latest release.
+- `version` documentation: Corresponds to specific released versions (e.g., `v0.7.3`, `v0.7.3rc1`). No further updates after release.
+- `stable` documentation (**not yet released**): Official release documentation. Updates are allowed in real-time after release, typically based on vX.Y.Z-dev. Once stable documentation is available, non-stable versions should display a header warning: `You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.`.
+
+## Software Dependency Management
+- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu)
+  every 3 months, a development version (aka the POC version) every month, and a nightly version every day.
+  The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CANN** be used in
+  vLLM Ascend RC version for rapid iteration, the nightly version **CANNOT** be used in vLLM Ascend any version and branches.
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -1,7 +1,5 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
-# This file is a part of the vllm-ascend project.
-# Adapted from vllm-project/vllm/docs/source/conf.py
 # Copyright 2023 The vLLM team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@ -15,6 +13,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+# This file is a part of the vllm-ascend project.
+# Adapted from vllm-project/vllm/docs/source/conf.py
 #

 # -- Path setup --------------------------------------------------------------
@ -23,7 +23,9 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
-# import os
+import json
+import os
+
 # import sys
 # sys.path.insert(0, os.path.abspath('.'))

@ -63,15 +65,19 @@ myst_substitutions = {
    # the branch of vllm, used in vllm clone
    # - main branch: 'main'
    # - vX.Y.Z branch: 'vX.Y.Z'
-    'vllm_version': 'main',
+    'vllm_version': 'v0.9.2',
    # the branch of vllm-ascend, used in vllm-ascend clone and image tag
    # - main branch: 'main'
    # - vX.Y.Z branch: latest vllm-ascend release tag
-    'vllm_ascend_version': 'main',
+    'vllm_ascend_version': 'v0.9.2rc1',
    # the newest release version of vllm-ascend and matched vLLM, used in pip install.
    # This value should be updated when cut down release.
-    'pip_vllm_ascend_version': "v0.7.1rc1",
-    'pip_vllm_version': "v0.7.1",
+    'pip_vllm_ascend_version': "0.9.2rc1",
+    'pip_vllm_version': "0.9.2",
+    # CANN image tag
+    'cann_image_tag': "8.1.rc1-910b-ubuntu22.04-py3.10",
+    # vllm version in ci
+    'ci_vllm_version': 'v0.9.2',
 }

 # Add any paths that contain templates here, relative to this directory.
@ -117,6 +123,20 @@ html_theme_options = {
 # so a file named "default.css" will overwrite the builtin "default.css".
 # html_static_path = ['_static']

+READTHEDOCS_VERSION_TYPE = os.environ.get('READTHEDOCS_VERSION_TYPE')
+if READTHEDOCS_VERSION_TYPE == "tag":
+    # remove the warning banner if the version is a tagged release
+    header_file = os.path.join(os.path.dirname(__file__),
+                               "_templates/sections/header.html")
+    # The file might be removed already if the build is triggered multiple times
+    # (readthedocs build both HTML and PDF versions separately)
+    if os.path.exists(header_file):
+        os.remove(header_file)
+

 def setup(app):
    pass
+
+
+if __name__ == "__main__":
+    print(json.dumps(myst_substitutions))
--- a/docs/source/developer_guide/contributing.zh.md
+++ b/docs/source/developer_guide/contributing.zh.md
@ -1,102 +0,0 @@
-# 贡献指南
-
-## 构建与测试
-我们推荐您在提交PR之前在本地开发环境进行构建和测试。
-
-### 环境准备与构建
-理论上，vllm-ascend 构建仅支持 Linux，因为`vllm-ascend` 依赖项 `torch_npu` 仅支持 Linux。
-
-但是您仍然可以在 Linux/Windows/macOS 上配置开发环境进行代码检查和基本测试，如下命令所示：
-
-```bash
-# 选择基础文件夹 (~/vllm-project/) ，创建python虚拟环境
-cd ~/vllm-project/
-python3 -m venv .venv
-source ./.venv/bin/activate
-
-# 克隆并安装vllm
-git clone https://github.com/vllm-project/vllm.git
-cd vllm
-pip install -r requirements-build.txt
-VLLM_TARGET_DEVICE="empty" pip install .
-cd ..
-
-# 克隆并安装vllm-ascend
-git clone https://github.com/vllm-project/vllm-ascend.git
-cd vllm-ascend
-pip install -r requirements-dev.txt
-
-# 通过执行以下脚本以运行 lint 及 mypy 测试
-bash format.sh
-
-# 构建:
-# - 目前仅支持在Linux上进行完整构建（torch_npu 限制）
-# pip install -e .
-# - 在其他操作系统上构建安装，需要跳过依赖
-# - build without deps for debugging in other OS
-# pip install -e . --no-deps
-
-# 使用 `-s` 提交更改
-git commit -sm "your commit info"
-```
-
-### 测试
-虽然 vllm-ascend CI 提供了对 [Ascend](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) 的集成测试，但您也可以在本地运行它。在本地运行这些集成测试的最简单方法是通过容器：
-
-```bash
-# 基于昇腾NPU环境
-git clone https://github.com/vllm-project/vllm-ascend.git
-cd vllm-ascend
-
-IMAGE=vllm-ascend-dev-image
-CONTAINER_NAME=vllm-ascend-dev
-DEVICE=/dev/davinci1
-
-# 首次构建会花费10分钟（10MB/s）下载基础镜像和包
-docker build -t $IMAGE -f ./Dockerfile .
-# 您还可以通过设置 VLLM_REPO 来指定镜像仓库以加速
-# docker build -t $IMAGE -f ./Dockerfile . --build-arg VLLM_REPO=https://gitee.com/mirrors/vllm
-
-docker run --name $CONTAINER_NAME --network host --device $DEVICE \
-           --device /dev/davinci_manager --device /dev/devmm_svm \
-           --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi \
-           -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-           -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-           -ti --rm $IMAGE bash
-
-cd vllm-ascend
-pip install -r requirements-dev.txt
-
-pytest tests/
-```
-
-## 开发者来源证书(DCO)
-
-在向本项目提交贡献时，您必须同意 DCO。提交必须包含“Signed-off-by:”标头，以证明同意 DCO 的条款。
-
-在`git commit`时使用`-s`将会自动添加该标头。
-
-## PR 标题和分类
-
-仅特定类型的 PR 会被审核。PR 标题会以适当的前缀来表明变更类型。请使用以下之一：
-
- `[Attention]` 关于`attention`的新特性或优化
- `[Communicator]` 关于`communicators`的新特性或优化
- `[ModelRunner]` 关于`model runner`的新特性或优化
- `[Platform]` 关于`platform`的新特性或优化
- `[Worker]` 关于`worker`的新特性或优化
- `[Core]` 关于`vllm-ascend`核心逻辑 (如 `platform, attention, communicators, model runner`)的新特性或优化
- `[Kernel]` 影响计算内核和操作的更改.
- `[Bugfix]` bug修复
- `[Doc]` 文档的修复与更新
- `[Test]` 测试 (如：单元测试)
- `[CI]` 构建或持续集成改进
- `[Misc]` 适用于更改内容对于上述类别均不适用的PR，请谨慎使用该前缀
-
-> [!注意]
-> 如果 PR 涉及多个类别，请添加所有相关前缀
-
-## 其他
-
-您可以在 [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html) 上找到更多有关为 vLLM 昇腾插件贡献的信息。
-如果您在贡献过程中发现任何问题，您可以随时提交 PR 来改进文档以帮助其他开发人员。
--- a/docs/source/developer_guide/contribution/index.md
+++ b/docs/source/developer_guide/contribution/index.md
@ -4,7 +4,7 @@
 It's recommended to set up a local development environment to build and test
 before you submit a PR.

-### Prepare environment and build
+### Setup development environment

 Theoretically, the vllm-ascend build is only supported on Linux because
 `vllm-ascend` dependency `torch_npu` only supports Linux.
@ -12,68 +12,64 @@ Theoretically, the vllm-ascend build is only supported on Linux because
 But you can still set up dev env on Linux/Windows/macOS for linting and basic
 test as following commands:

+#### Run lint locally
 ```bash
 # Choose a base dir (~/vllm-project/) and set up venv
 cd ~/vllm-project/
 python3 -m venv .venv
 source ./.venv/bin/activate

-# Clone vllm code and install
-git clone https://github.com/vllm-project/vllm.git
-cd vllm
-pip install -r requirements-build.txt
-VLLM_TARGET_DEVICE="empty" pip install .
-cd ..
-
 # Clone vllm-ascend and install
 git clone https://github.com/vllm-project/vllm-ascend.git
 cd vllm-ascend
-pip install -r requirements-dev.txt

-# Then you can run lint and mypy test
+# Install lint requirement and enable pre-commit hook
+pip install -r requirements-lint.txt
+
+# Run lint (You need install pre-commits deps via proxy network at first time)
 bash format.sh
+```

-# Build:
-# - only supported on Linux (torch_npu available)
-# pip install -e .
-# - build without deps for debugging in other OS
-# pip install -e . --no-deps
+#### Run CI locally

+After complete "Run lint" setup, you can run CI locally:
+
+```{code-block} bash
+   :substitutions:
+
+cd ~/vllm-project/
+
+# Run CI need vLLM installed
+git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
+cd vllm
+pip install -r requirements/build.txt
+VLLM_TARGET_DEVICE="empty" pip install .
+cd ..
+
+# Install requirements
+cd vllm-ascend
+# For Linux:
+pip install -r requirements-dev.txt
+# For non Linux:
+cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
+cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
+
+# Run ci:
+bash format.sh ci
+```
+
+#### Submit the commit
+
+```bash
 # Commit changed files using `-s`
 git commit -sm "your commit info"
 ```

-### Testing
+🎉 Congratulations! You have completed the development environment setup.

-Although vllm-ascend CI provide integration test on [Ascend](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml), you can run it
-locally. The simplest way to run these integration tests locally is through a container:
+### Test locally

-```bash
-# Under Ascend NPU environment
-git clone https://github.com/vllm-project/vllm-ascend.git
-cd vllm-ascend
-
-IMAGE=vllm-ascend-dev-image
-CONTAINER_NAME=vllm-ascend-dev
-DEVICE=/dev/davinci1
-
-# The first build will take about 10 mins (10MB/s) to download the base image and packages
-docker build -t $IMAGE -f ./Dockerfile .
-# You can also specify the mirror repo via setting VLLM_REPO to speedup
-# docker build -t $IMAGE -f ./Dockerfile . --build-arg VLLM_REPO=https://gitee.com/mirrors/vllm
-
-docker run --name $CONTAINER_NAME --network host --device $DEVICE \
-           --device /dev/davinci_manager --device /dev/devmm_svm \
-           --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi \
-           -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-           -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-           -ti --rm $IMAGE bash
-
-cd vllm-ascend
-pip install -r requirements-dev.txt
-
-pytest tests/
-```
+You can refer to [Testing](./testing.md) doc to help you setup testing environment and running tests locally.

 ## DCO and Signed-off-by

@ -106,3 +102,10 @@ If the PR spans more than one category, please include all relevant prefixes.

 You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
 If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
+
+
+:::{toctree}
+:caption: Index
+:maxdepth: 1
+testing
+:::
--- a/docs/source/developer_guide/contribution/testing.md
+++ b/docs/source/developer_guide/contribution/testing.md
@ -0,0 +1,280 @@
+# Testing
+
+This secition explains how to write e2e tests and unit tests to verify the implementation of your feature.
+
+## Setup test environment
+
+The fastest way to setup test environment is to use the main branch container image:
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:selected:
+:sync: cpu
+
+You can run the unit tests on CPU with the following steps:
+
+```{code-block} bash
+   :substitutions:
+
+cd ~/vllm-project/
+# ls
+# vllm  vllm-ascend
+
+# Use mirror to speedup download
+# docker pull quay.nju.edu.cn/ascend/cann:|cann_image_tag|
+export IMAGE=quay.io/ascend/cann:|cann_image_tag|
+docker run --rm --name vllm-ascend-ut \
+    -v $(pwd):/vllm-project \
+    -v ~/.cache:/root/.cache \
+    -ti $IMAGE bash
+
+# (Optional) Configure mirror to speedup download
+sed -i 's|ports.ubuntu.com|mirrors.huaweicloud.com|g' /etc/apt/sources.list
+pip config set global.index-url https://mirrors.huaweicloud.com/repository/pypi/simple/
+
+# For torch-npu dev version or x86 machine
+export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
+
+apt-get update -y
+apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
+
+# Install vllm
+cd /vllm-project/vllm
+VLLM_TARGET_DEVICE=empty python3 -m pip -v install .
+
+# Install vllm-ascend
+cd /vllm-project/vllm-ascend
+# [IMPORTANT] Import LD_LIBRARY_PATH to enumerate the CANN environment under CPU
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
+python3 -m pip install -r requirements-dev.txt
+python3 -m pip install -v .
+```
+
+::::
+
+::::{tab-item} Single card
+:sync: single
+
+```{code-block} bash
+   :substitutions:
+
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci0
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:main
+docker run --rm \
+    --name vllm-ascend \
+    --device $DEVICE \
+    --device /dev/davinci_manager \
+    --device /dev/devmm_svm \
+    --device /dev/hisi_hdc \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/.cache:/root/.cache \
+    -p 8000:8000 \
+    -it $IMAGE bash
+```
+
+After starting the container, you should install the required packages:
+
+```bash
+# Prepare
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Install required packages
+pip install -r requirements-dev.txt
+```
+
+::::
+
+::::{tab-item} Multi cards
+:sync: multi
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:main
+docker run --rm \
+    --name vllm-ascend \
+    --device /dev/davinci0 \
+    --device /dev/davinci1 \
+    --device /dev/davinci2 \
+    --device /dev/davinci3 \
+    --device /dev/davinci_manager \
+    --device /dev/devmm_svm \
+    --device /dev/hisi_hdc \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/.cache:/root/.cache \
+    -p 8000:8000 \
+    -it $IMAGE bash
+```
+
+After starting the container, you should install the required packages:
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+
+# Prepare
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Install required packages
+pip install -r requirements-dev.txt
+```
+
+::::
+
+:::::
+
+## Running tests
+
+### Unit test
+
+There are several principles to follow when writing unit tests:
+
+- The test file path should be consistent with source file and start with `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
+- The vLLM Ascend test are using unittest framework, see [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
+- All unit tests can be run on CPU, so you must mock the device-related function to host.
+- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
+- You can run the unit tests using `pytest`:
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:selected:
+:sync: cpu
+
+```bash
+# Run unit tests
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
+TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
+```
+
+::::
+
+::::{tab-item} Single card
+:sync: single
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+pytest -sv tests/ut
+
+# Run single test
+pytest -sv tests/ut/test_ascend_config.py
+```
+::::
+
+::::{tab-item} Multi cards test
+:sync: multi
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+pytest -sv tests/ut
+
+# Run single test
+pytest -sv tests/ut/test_ascend_config.py
+```
+::::
+
+:::::
+
+### E2E test
+
+Although vllm-ascend CI provide [e2e test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
+locally.
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:sync: cpu
+
+You can't run e2e test on CPU.
+::::
+
+::::{tab-item} Single card
+:selected:
+:sync: single
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
+
+# Run a certain test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py
+
+# Run a certain case in test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models
+```
+::::
+
+::::{tab-item} Multi cards test
+:sync: multi
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/
+
+# Run a certain test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_dynamic_npugraph_batchsize.py
+
+# Run a certain case in test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models
+```
+::::
+
+:::::
+
+This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
+
+#### E2E test example:
+
+- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
+- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)
+- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
+- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
+
+    The CI resource is limited, you might need to reduce layer number of the model, below is an example of how to generate a reduced layer model:
+    1. Fork the original model repo in modelscope, we need all the files in the repo except for weights.
+    2. Set `num_hidden_layers` to the expected number of layers, e.g., `{"num_hidden_layers": 2,}`
+    3. Copy the following python script as `generate_random_weight.py`. Set the relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and `DIST_MODEL_PATH` as needed:
+
+        ```python
+        import torch
+        from transformers import AutoTokenizer, AutoConfig
+        from modeling_deepseek import DeepseekV3ForCausalLM
+        from modelscope import snapshot_download
+
+        MODEL_LOCAL_PATH = "~/.cache/modelscope/models/vllm-ascend/DeepSeek-V3-Pruning"
+        DIST_DTYPE = torch.bfloat16
+        DIST_MODEL_PATH = "./random_deepseek_v3_with_2_hidden_layer"
+
+        config = AutoConfig.from_pretrained(MODEL_LOCAL_PATH, trust_remote_code=True)
+        model = DeepseekV3ForCausalLM(config)
+        model = model.to(DIST_DTYPE)
+        model.save_pretrained(DIST_MODEL_PATH)
+        ```
+
+### Run doctest
+
+vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to run all doctests in the doc files.
+The doctest is a good way to make sure the docs are up to date and the examples are executable, you can run it locally as follows:
+
+```bash
+# Run doctest
+/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh
+```
+
+This will reproduce the same environment as the CI: [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).
--- a/docs/source/developer_guide/evaluation/accuracy_report/index.md
+++ b/docs/source/developer_guide/evaluation/accuracy_report/index.md
@ -0,0 +1,6 @@
+# Accuracy Report
+
+:::{toctree}
+:caption: Accuracy Report
+:maxdepth: 1
+:::
--- a/docs/source/developer_guide/evaluation/index.md
+++ b/docs/source/developer_guide/evaluation/index.md
@ -0,0 +1,10 @@
+# Accuracy
+
+:::{toctree}
+:caption: Accuracy
+:maxdepth: 1
+using_evalscope
+using_lm_eval
+using_opencompass
+accuracy_report/index
+:::
--- a/docs/source/developer_guide/evaluation/using_evalscope.md
+++ b/docs/source/developer_guide/evaluation/using_evalscope.md
@ -0,0 +1,173 @@
+# Using EvalScope
+
+This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
+
+## 1. Online serving
+
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+```
+
+If your service start successfully, you can see the info shown below:
+
+```
+INFO:     Started server process [6873]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts in new terminal:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-7B-Instruct",
+        "prompt": "The future of AI is",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+
+## 2. Install EvalScope using pip
+
+You can install EvalScope by using:
+
+```bash
+python3 -m venv .venv-evalscope
+source .venv-evalscope/bin/activate
+pip install gradio plotly evalscope
+```
+
+## 3. Run gsm8k accuracy test using EvalScope
+
+You can `evalscope eval` run gsm8k accuracy test:
+```
+evalscope eval \
+ --model Qwen/Qwen2.5-7B-Instruct \
+ --api-url http://localhost:8000/v1 \
+ --api-key EMPTY \
+ --eval-type service \
+ --datasets gsm8k \
+ --limit 10
+```
+
+After 1-2 mins, the output is as shown below:
+
+```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
+| Model               | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+=====================+===========+=================+==========+=======+=========+=========+
+| Qwen2.5-7B-Instruct | gsm8k     | AverageAccuracy | main     |    10 |     0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+
+```
+
+See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
+
+## 4. Run model inference stress testing using EvalScope
+
+### Install EvalScope[perf] using pip
+
+```shell
+pip install evalscope[perf] -U
+```
+
+### Basic usage
+
+You can use `evalscope perf` run perf test:
+```
+evalscope perf \
+    --url "http://localhost:8000/v1/chat/completions" \
+    --parallel 5 \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --number 20 \
+    --api openai \
+    --dataset openqa \
+    --stream
+```
+
+### Output results
+
+After 1-2 mins, the output is as shown below: 
+
+```shell
+Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
+| Key                               | Value                                                         |
+===================================+===============================================================+
+| Time taken for tests (s)          | 38.3744                                                       |
+-----------------------------------+---------------------------------------------------------------+
+| Number of concurrency             | 5                                                             |
+-----------------------------------+---------------------------------------------------------------+
+| Total requests                    | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Succeed requests                  | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Failed requests                   | 0                                                             |
+-----------------------------------+---------------------------------------------------------------+
+| Output token throughput (tok/s)   | 132.6926                                                      |
+-----------------------------------+---------------------------------------------------------------+
+| Total token throughput (tok/s)    | 158.8819                                                      |
+-----------------------------------+---------------------------------------------------------------+
+| Request throughput (req/s)        | 0.5212                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average latency (s)               | 8.3612                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average time to first token (s)   | 0.1035                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average time per output token (s) | 0.0329                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average input tokens per request  | 50.25                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Average output tokens per request | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Average package latency (s)       | 0.0324                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average package per request       | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Expected number of requests       | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Result DB path                    | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+
+
+Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+|    10%     |  0.0962  |  0.031  |   4.4571    |      42      |      135      |       29.9767        |
+|    25%     |  0.0971  | 0.0318  |   6.3509    |      47      |      193      |       30.2157        |
+|    50%     |  0.0987  | 0.0321  |   9.3387    |      49      |      285      |       30.3969        |
+|    66%     |  0.1017  | 0.0324  |   9.8519    |      52      |      302      |       30.5182        |
+|    75%     |  0.107   | 0.0328  |   10.2391   |      55      |      313      |       30.6124        |
+|    80%     |  0.1221  | 0.0329  |   10.8257   |      58      |      330      |       30.6759        |
+|    90%     |  0.1245  | 0.0333  |   13.0472   |      62      |      404      |       30.9644        |
+|    95%     |  0.1247  | 0.0336  |   14.2936   |      66      |      432      |       31.6691        |
+|    98%     |  0.1247  | 0.0353  |   14.2936   |      66      |      432      |       31.6691        |
+|    99%     |  0.1247  | 0.0627  |   14.2936   |      66      |      432      |       31.6691        |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+```
+
+See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).
--- a/docs/source/developer_guide/evaluation/using_lm_eval.md
+++ b/docs/source/developer_guide/evaluation/using_lm_eval.md
@ -0,0 +1,62 @@
+# Using lm-eval
+This document will guide you have a accuracy testing using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness).
+
+##  1. Run docker container
+
+You can run docker container on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+```
+
+## 2. Run ceval accuracy test using lm-eval
+Install lm-eval in the container.
+
+```bash
+pip install lm-eval
+```
+Run the following command:
+
+```
+# Only test ceval-valid-computer_network dataset in this demo
+lm_eval \
+  --model vllm \
+  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4,tensor_parallel_size=1 \
+  --tasks ceval-valid_computer_network \
+  --batch_size 8
+```
+
+After 1-2 mins, the output is as shown below:
+
+```
+The markdown format results is as below:
+
+|           Tasks            |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
+|----------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
+|ceval-valid_computer_network|      2|none  |     0|acc     |↑  |0.6842|±  |0.1096|
+|                            |       |none  |     0|acc_norm|↑  |0.6842|±  |0.1096|
+
+```
+
+You can see more usage on [Lm-eval Docs](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md).
--- a/docs/source/developer_guide/evaluation/using_opencompass.md
+++ b/docs/source/developer_guide/evaluation/using_opencompass.md
@ -0,0 +1,120 @@
+# Using OpenCompass 
+This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
+
+## 1. Online Serving
+
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+```
+If your service start successfully, you can see the info shown below:
+```
+INFO:     Started server process [6873]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts in new terminal:
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-7B-Instruct",
+        "prompt": "The future of AI is",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+
+## 2. Run ceval accuracy test using OpenCompass
+Install OpenCompass and configure the environment variables in the container.
+
+```bash
+# Pin Python 3.10 due to:
+# https://github.com/open-compass/opencompass/issues/1976
+conda create -n opencompass python=3.10
+conda activate opencompass
+pip install opencompass modelscope[framework]
+export DATASET_SOURCE=ModelScope
+git clone https://github.com/open-compass/opencompass.git
+```
+
+Add `opencompass/configs/eval_vllm_ascend_demo.py` with the following content:
+
+```python
+from mmengine.config import read_base
+from opencompass.models import OpenAISDK
+
+with read_base():
+    from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
+
+# Only test ceval-computer_network dataset in this demo
+datasets = ceval_datasets[:1]
+
+api_meta_template = dict(
+    round=[
+        dict(role='HUMAN', api_role='HUMAN'),
+        dict(role='BOT', api_role='BOT', generate=True),
+    ],
+    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
+)
+
+models = [
+    dict(
+        abbr='Qwen2.5-7B-Instruct-vLLM-API',
+        type=OpenAISDK,
+        key='EMPTY', # API key
+        openai_api_base='http://127.0.0.1:8000/v1', 
+        path='Qwen/Qwen2.5-7B-Instruct', 
+        tokenizer_path='Qwen/Qwen2.5-7B-Instruct', 
+        rpm_verbose=True, 
+        meta_template=api_meta_template,
+        query_per_second=1, 
+        max_out_len=1024, 
+        max_seq_len=4096, 
+        temperature=0.01, 
+        batch_size=8,
+        retry=3,
+    )
+]
+```
+
+Run the following command:
+
+```
+python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
+```
+
+After 1-2 mins, the output is as shown below:
+
+```
+The markdown format results is as below:
+
+| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
+|----- | ----- | ----- | ----- | -----|
+| ceval-computer_network | db9ce2 | accuracy | gen | 68.42 |
+```
+
+You can see more usage on [OpenCompass Docs](https://opencompass.readthedocs.io/en/latest/index.html).
--- a/docs/source/developer_guide/feature_guide/index.md
+++ b/docs/source/developer_guide/feature_guide/index.md
@ -0,0 +1,9 @@
+# Feature Guide
+
+This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
+
+:::{toctree}
+:caption: Feature Guide
+:maxdepth: 1
+patch
+:::
--- a/docs/source/developer_guide/feature_guide/patch.md
+++ b/docs/source/developer_guide/feature_guide/patch.md
@ -0,0 +1,82 @@
+# Patch in vLLM Ascend
+
+vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM and vLLM Ascend is different, and the hardware limitation in some case, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
+
+In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to address the change for vLLM.
+
+## Principle
+
+We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend originally. In vLLM Ascend, we have the basic principle for Patch strategy:
+
+1. Less is more. Please do not patch unless it's the only way currently.
+2. Once a patch is added, it's required to describe the future plan for removing the patch.
+3. Anytime, clean the patch code is welcome.
+
+## How it works
+
+In `vllm_ascend/patch`, you can see the code structure as follows:
+
+```
+vllm_ascend
+├── patch
+│   ├── platform
+│   │   ├── patch_0_9_2
+│   │   ├── patch_common
+│   │   ├── patch_main
+│   ├── worker
+│   │   ├── patch_0_9_2
+│   │   ├── patch_common
+│   │   ├── patch_main
+└───────────
+```
+
+- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
+  - For online mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
+  - For offline mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
+- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker_v1::NPUWorker::__init__` when the vLLM worker process is initialized.
+  - For both online and offline mode, vLLM engine core process calls the worker patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
+
+In both **platform** and **worker** folder, there are several patch modules. They are used for patching different version of vLLM.
+
+- `patch_0_9_2`: This module is used for patching vLLM 0.9.2. The version is always the nearest version of vLLM. Once vLLM is released, we will drop this patch module and bump to a new version. For example, `patch_0_9_2` is used for patching vLLM 0.9.2.
+- `patch_main`: This module is used for patching the code in vLLM main branch.
+- `patch_common`: This module is used for patching both vLLM 0.9.2 and vLLM main branch.
+
+## How to write a patch
+
+Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
+
+1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.9.2 and main of vLLM.
+2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
+3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
+4. Write your patch code in the new file. Here is an example:
+    ```python
+    import vllm
+
+    def patch_destroy_model_parallel():
+        # your patch code
+        ...
+
+    vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
+    ```
+5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
+6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
+    ```
+    # ** File: <The patch file name> **
+    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    #   1. `<The target patch module in vLLM>`
+    #    Why:
+    #       <Describe the reason why we need to patch>
+    #    How：
+    #       <Describe the way to patch>
+    #    Related PR (if no, explain why):
+    #       <Add a link to the related PR in vLLM. If there is no related PR, explain why>
+    #    Future Plan:
+    #       <Describe the future plan to remove the patch>
+    ```
+7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
+
+
+## Limitation
+1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
+2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.n, the version of vLLM may be change to v0.9.nxxx, in this case, the patch for v0.9.n in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.9.2 should work.
--- a/docs/source/developer_guide/modeling/adding_a_new_model.md
+++ b/docs/source/developer_guide/modeling/adding_a_new_model.md
@ -0,0 +1,258 @@
+# Adding a New Model
+
+This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to
+[vllm official doc: Adding a New Model](https://docs.vllm.ai/en/stable/contributing/model/) first.
+
+## Step 1: Implementing Models with `torch` and `torch_npu`
+
+This section provides instructions for implementing new models compatible with vllm and vllm-ascend.
+
+**Before starting:**
+
+- Verify whether your model already exists in vllm's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
+- Use existing models' implementation as templates to accelerate your development.
+
+### Method 1: Implementing New Models from Scratch
+
+Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
+
+**Key implementation requirements:**
+
+1. Place model files in `vllm_ascend/models/` directory.
+
+2. Standard module structure for decoder-only LLMs (please checkout vllm's implementations for other kinds of model):
+
+- `*ModelForCausalLM` (top-level wrapper)
+- `*Model` (main architecture)
+- `*DecoderLayer` (transformer block)
+- `*Attention` and `*MLP` (specific computation unit)
+
+:::{note}
+`*` denotes your model's unique identifier.
+:::
+
+3. Critical Implementation Details:
+
+All modules must include a `prefix` argument in `__init__()`.
+
+**Required interfaces:**
+
+| Module Type          | Required Methods                          |
+| :------------------- | :---------------------------------------- |
+| `*ModelForCausalLM`  | `get_input_embeddings`, `compute_logits`, `load_weights` |
+| `*Model`             | `get_input_embeddings`, `load_weights`    |
+
+4. Attention Backend Integration:
+
+Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py`).
+
+5. Tensor Parallelism:
+
+Use vllm's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
+
+**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py`):
+
+```python
+from collections.abc import Iterable
+from typing import Optional, Union
+
+import torch
+from torch import nn
+from vllm.attention import Attention
+from vllm.config import VllmConfig
+from vllm.sequence import IntermediateTensors
+from vllm.model_executor.sampling_metadata import SamplingMetadata
+
+class CustomAttention(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.attn = Attention(prefix=f"{prefix}.attn")
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # Implement attention logic
+        ...
+
+class CustomDecoderLayer(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn")
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # Implement decoder layer
+        ...
+
+class CustomModel(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.layers = nn.ModuleList([
+            CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") 
+            for i in range(vllm_config.model_config.hf_config.num_hidden_layers)
+        ])
+
+    def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
+        ...
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        ...
+
+    def load_weights(self, 
+                    weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        ...
+
+class CustomModelForCausalLM(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
+        super().__init__()
+        self.model = CustomModel(vllm_config, prefix=f"{prefix}.model")
+
+    def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
+        ...
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        ...
+
+    def compute_logits(self,
+                      hidden_states: torch.Tensor,
+                      sampling_metadata: SamplingMetadata) -> torch.Tensor:
+        ...
+
+    def load_weights(self, 
+                    weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        ...
+```
+
+### Method 2: Customizing Existing vLLM Models
+
+For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
+
+```python
+from typing import List, Optional
+import torch
+from vllm.attention import AttentionMetadata
+from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM
+from vllm.sequence import IntermediateTensors
+
+class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
+    # Define merged weights for quantization/efficiency
+    packed_modules_mapping = {
+        "gate_up_proj": ["gate_proj", "up_proj"],
+        "experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
+    }
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        kv_caches: Optional[List[torch.Tensor]] = None,
+        attn_metadata: Optional[AttentionMetadata] = None,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        # Custom forward logic
+        hidden_states = self.model(
+            input_ids, 
+            positions, 
+            kv_caches,
+            attn_metadata, 
+            intermediate_tensors,
+            inputs_embeds
+        )
+        return hidden_states
+```
+
+:::{note}
+For a complete implementation reference, see: `vllm_ascend/models/deepseek_v2.py`.
+:::
+
+## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM
+
+vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.
+
+To integrate your implemented model from `vllm_ascend/models/` directory:
+
+1. Import your model implementation in `vllm_ascend/models/__init__.py` using relative imports.
+2. Register the model wrapper class via `vllm.ModelRegistry.register_model()` function.
+
+**Reference Registration Template** (an example of registering new models in `vllm_ascend/models/__init__.py`):
+
+```python
+from vllm import ModelRegistry
+
+def register_model():
+    from .custom_model import CustomModelForCausalLM        # New custom model
+    from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM  # Customized Deepseek
+
+    # For NEW architectures: Register with unique name
+    ModelRegistry.register_model(
+        "CustomModelForCausalLM",  # Must match config.json's 'architectures'
+        "vllm_ascend.models.custom_model:CustomModelForCausalLM"
+    )
+
+    # For MODIFIED architectures: Use original name
+    ModelRegistry.register_model(
+        "DeepseekV2ForCausalLM",   # Original architecture identifier in vLLM
+        "vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM  "
+    )
+```
+
+:::{note}
+The first argument of `vllm.ModelRegistry.register_model()` indicates the unique architecture identifier which must match `architectures` in `config.json` of the model.
+
+```json
+{
+  "architectures": [
+    "CustomModelForCausalLM"
+  ],
+}
+```
+:::
+
+## Step 3: Verification
+
+### Case 1: Overriding Existing vLLM Model Architecture
+
+If you're registering a customized model architecture based on vllm's existing implementation (overriding vllm's original class), when executing vllm offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
+
+```bash
+Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
+```
+
+### Case 2: Registering New Model Architecture
+
+If you're registering a novel model architecture not present in vllm (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
+
+```python
+logger.info(f"model_arch: {model_arch} has been registered here!")
+```
+
+After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).
+
+```bash
+model_arch: CustomModelForCausalLM has been registered here!
+```
+
+This log output confirms your novel model architecture has been successfully registered in vllm.
+
+## Step 4: Testing
+
+After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.
+
+Find more details at:
+
+- [Accuracy test guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/evaluation/index.html)
+- [Performance benchmark guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/performance/performance_benchmark.html)
+
+## Step 5: Updating Supported Models Doc
+
+At last, if all the steps above are completed, you should add the new model into our [Supported Models](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html) doc.
--- a/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md
+++ b/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md
@ -0,0 +1,3 @@
+# Adding a New Multi-Modal Model
+
+**_Comming soon ..._**
--- a/docs/source/developer_guide/modeling/index.md
+++ b/docs/source/developer_guide/modeling/index.md
@ -0,0 +1,10 @@
+# Modeling
+
+This section provides tutorials of how to implement and register a new model into vllm-ascend.
+
+:::{toctree}
+:caption: Modeling
+:maxdepth: 1
+adding_a_new_model
+adding_a_new_multimodal_model
+:::
--- a/docs/source/developer_guide/performance/index.md
+++ b/docs/source/developer_guide/performance/index.md
@ -0,0 +1,8 @@
+# Performance
+
+:::{toctree}
+:caption: Performance
+:maxdepth: 1
+performance_benchmark
+profile_execute_duration
+:::
--- a/docs/source/developer_guide/performance/performance_benchmark.md
+++ b/docs/source/developer_guide/performance/performance_benchmark.md
@ -0,0 +1,187 @@
+# Performance Benchmark
+This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
+
+**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
+
+## 1. Run docker container
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+```
+
+## 2. Install dependencies
+```bash
+cd /workspace/vllm-ascend
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+pip install -r benchmarks/requirements-bench.txt
+```
+
+## 3. (Optional)Prepare model weights
+For faster running speed, we recommend downloading the model in advance：
+```bash
+modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
+```
+
+You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
+```bash
+[
+  {
+    "test_name": "latency_llama8B_tp1",
+    "parameters": {
+      "model": "your local model path",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
+  }
+]
+```
+
+## 4. Run benchmark script
+Run benchmark script:
+```bash
+bash benchmarks/scripts/run-performance-benchmarks.sh
+```
+
+After about 10 mins, the output is as shown below:
+```bash
+online serving:
+qps 1:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  212.77    
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              0.94      
+Output token throughput (tok/s):         204.66    
+Total Token throughput (tok/s):          405.16    
+---------------Time to First Token----------------
+Mean TTFT (ms):                          104.14    
+Median TTFT (ms):                        102.22    
+P99 TTFT (ms):                           153.82    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          38.78     
+Median TPOT (ms):                        38.70     
+P99 TPOT (ms):                           48.03     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           38.46     
+Median ITL (ms):                         36.96     
+P99 ITL (ms):                            75.03     
+==================================================
+
+qps 4:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  72.55     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              2.76      
+Output token throughput (tok/s):         600.24    
+Total Token throughput (tok/s):          1188.27   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          115.62    
+Median TTFT (ms):                        109.39    
+P99 TTFT (ms):                           169.03    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          51.48     
+Median TPOT (ms):                        52.40     
+P99 TPOT (ms):                           69.41     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           50.47     
+Median ITL (ms):                         43.95     
+P99 ITL (ms):                            130.29    
+==================================================
+
+qps 16:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  47.82     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              4.18      
+Output token throughput (tok/s):         910.62    
+Total Token throughput (tok/s):          1802.70   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          128.50    
+Median TTFT (ms):                        128.36    
+P99 TTFT (ms):                           187.87    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          83.60     
+Median TPOT (ms):                        77.85     
+P99 TPOT (ms):                           165.90    
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           65.72     
+Median ITL (ms):                         54.84     
+P99 ITL (ms):                            289.63    
+==================================================
+
+qps inf:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  41.26     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              4.85      
+Output token throughput (tok/s):         1055.44   
+Total Token throughput (tok/s):          2089.40   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          3394.37   
+Median TTFT (ms):                        3359.93   
+P99 TTFT (ms):                           3540.93   
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          66.28     
+Median TPOT (ms):                        64.19     
+P99 TPOT (ms):                           97.66     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           56.62     
+Median ITL (ms):                         55.69     
+P99 ITL (ms):                            82.90     
+==================================================
+
+offline:
+latency:
+Avg latency: 4.944929537673791 seconds
+10% percentile latency: 4.894104263186454 seconds
+25% percentile latency: 4.909652255475521 seconds
+50% percentile latency: 4.932477846741676 seconds
+75% percentile latency: 4.9608619548380375 seconds
+90% percentile latency: 5.035418218374252 seconds
+99% percentile latency: 5.052476694583893 seconds
+
+throughput:
+Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
+Total num prompt tokens:  42659
+Total num output tokens:  43545
+```
+The result json files are generated into the path `benchmark/results`
+These files contain detailed benchmarking results for further analysis.
+
+```bash
+.
+|-- latency_llama8B_tp1.json
+|-- serving_llama8B_tp1_qps_1.json
+|-- serving_llama8B_tp1_qps_16.json
+|-- serving_llama8B_tp1_qps_4.json
+|-- serving_llama8B_tp1_qps_inf.json
+`-- throughput_llama8B_tp1.json
+```
--- a/docs/source/developer_guide/performance/profile_execute_duration.md
+++ b/docs/source/developer_guide/performance/profile_execute_duration.md
@ -0,0 +1,39 @@
+# Profile Execute Duration
+
+The execution duration of each stage (including pre/post-processing, model forward, etc.) usually needs to be captured during a complete inference process. Typically, this is done by using `torch.npu.synchronize()` and obtaining CPU timestamps, which increases the performance overhead of host/device synchronization.
+
+**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
+
+## Usage
+* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
+* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
+* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
+
+**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
+```
+VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
+```
+
+## Example Output
+
+```
+5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
+5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
+5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
+5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
+5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
+5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
+5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
+5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
+5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
+5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
+5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
+5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
+5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
+5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
+5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
+5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
+5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
+5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
+
+```
--- a/docs/source/developer_guide/versioning_policy.md
+++ b/docs/source/developer_guide/versioning_policy.md
@ -1,65 +0,0 @@
-# Versioning policy
-
-Starting with vLLM 0.7.x, the vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) project follows the [PEP 440](https://peps.python.org/pep-0440/) to publish matching with vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)).
-
-## vLLM Ascend Plugin versions
-
-Each vllm-ascend release will be versioned: `v[major].[minor].[micro][rcN][.postN]` (such as
-`v0.7.1rc1`, `v0.7.1`, `v0.7.1.post1`)
-
- **Final releases**: will typically be released every **3 months**, will take the vLLM upstream release plan and Ascend software product release plan into comprehensive consideration.
- **Pre releases**: will typically be released **on demand**, ending with rcN, represents the Nth release candidate version, to support early testing by our users prior to a final release.
- **Post releases**: will typically be released **on demand** to support to address minor errors in a final release. It's different from [PEP-440 post release note](https://peps.python.org/pep-0440/#post-releases) suggestion, it will contain actual bug fixes considering that the final release version should be matched strictly with the vLLM final release version (`v[major].[minor].[micro]`). The post version has to be published as a patch version of the final release.
-
-For example:
- `v0.7.x`: it's the first final release to match the vLLM `v0.7.x` version.
- `v0.7.1rc1`: will be the first pre version of vllm-ascend.
- `v0.7.1.post1`: will be the post release if the `v0.7.1` release has some minor errors.
-
-## Branch policy
-
-vllm-ascend has main branch and dev branch.
-
- **main**: main branch，corresponds to the vLLM main branch, and is continuously monitored for quality through Ascend CI.
- **vX.Y.Z-dev**: development branch, created with part of new releases of vLLM. For example, `v0.7.1-dev` is the dev branch for vLLM `v0.7.1` version.
-
-Usually, a commit should be ONLY first merged in the main branch, and then backported to the dev branch to reduce maintenance costs as much as possible.
-
-### Maintenance branch and EOL:
-The branch status will be in one of the following states:
-
-| Branch            | Time frame                       | Summary                                                              |
-|-------------------|----------------------------------|----------------------------------------------------------------------|
-| Maintained        | Approximately 2-3 minor versions | All bugfixes are appropriate. Releases produced, CI commitment.      |
-| Unmaintained      | Community interest driven        | All bugfixes are appropriate. No Releases produced, No CI commitment |
-| End of Life (EOL) | N/A                              | Branch no longer accepting changes                                   |
-
-### Branch state
-
-Note that vllm-ascend will only be released for a certain vLLM release version rather than all versions. Hence, You might see only part of versions have dev branches (such as only `0.7.1-dev` / `0.7.3-dev` but no `0.7.2-dev`), this is as expected.
-
-Usually, each minor version of vLLM (such as 0.7) will correspond to a vllm-ascend version branch and support its latest version (for example, we plan to support version 0.7.3) as following shown:
-
-| Branch     | Status       | Note                                 |
-|------------|--------------|--------------------------------------|
-| main       | Maintained   | CI commitment for vLLM main branch   |
-| v0.7.3-dev | Maintained   | CI commitment for vLLM 0.7.3 version |
-| v0.7.1-dev | Unmaintained | Replaced by v0.7.3-dev               |
-
-
-## Release Compatibility Matrix
-
-Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
-
-| vllm-ascend  | vLLM         | Python | Stable CANN | PyTorch/torch_npu |
-|--------------|--------------| --- | --- | --- |
-| v0.7.1rc1 | v0.7.1 | 3.9 - 3.12 | 8.0.0   |  2.5.1 / 2.5.1.dev20250218 |
-
-## Release cadence
-
-### Next final release (`v0.7.x`) window
-
-| Date       | Event                                                            |
-|------------|------------------------------------------------------------------|
-| March 2025 | Release candidates, v0.7.3rc1                                    |
-| March 2025 | Final release passes, match vLLM v0.7.x latest: v0.7.1 or v0.7.3 |
--- a/docs/source/developer_guide/versioning_policy.zh.md
+++ b/docs/source/developer_guide/versioning_policy.zh.md
@ -1,64 +0,0 @@
-# 版本策略
-
-从vLLM的0.7.x版本开始，vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) 整体遵循[PEP 440](https://peps.python.org/pep-0440/)的版本策略，与vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)) 配套发布。
-
-## vLLM Ascend Plugin版本
-
-vllm-ascend的版本号为：`v[major].[minor].[micro][rcN][.postN]`（比如`v0.7.1rc1`, `v0.7.1`, `v0.7.1.post1`）
-
- **Final releases （正式版）**: 通常3个月发布一次正式版，将会综合考虑vLLM上游发布及昇腾产品软件发布策略。
- **Pre releases （尝鲜版）**: 通常为按需发布，以rcN结尾，代表第N个Release Candidate版本，提供在final release之前的尝鲜版（早期试用版）。
- **Post releases （补丁版）**: 通常在final release发布后按需发布，主要是修复最终版本的错误。这个策略与[PEP-440提到的策略](https://peps.python.org/pep-0440/#post-releases)有所不同，它会包含实际的bug修复，考虑到正式版与vLLM的版本（`v[major].[minor].[micro]`）配套发布。因此，Post releases通常是Final release的补丁版本。
-
-例如：
- `v0.7.x`: 是配套 vLLM `v0.7.x` 版本的正式版。
- `v0.7.1rc1`: 是vllm-ascend第一个尝鲜版（早期试用版）。
- `v0.7.1.post1`: 是`v0.7.1`版本的post release。
-
-## 分支管理策略
-
-vllm-ascend有主干和开发两种分支。
-
- **main**: 主干分支，与vLLM的主干分支对应，并通过昇腾CI持续进行质量看护。
- **vX.Y.Z-dev**: 开发分支，随vLLM部分新版本发布而创建，比如`v0.7.1-dev`是vllm-ascend针对vLLM `v0.7.1`版本的开发分支。
-
-
-通常，一个commit需要先合入到主干分支，然后再反合到开发分支，从而尽可能地减少版本维护成本。
-
-
-### 分支维护和EOL
-某个分支的状态将会以下三种之一：
-| 分支            | 维护时间                 | 备注                                                              |
-|-------------------|----------------------------|----------------------------------------------------------------------|
-| Maintained （维护中）        | 大概2-3个minor版本 | 合入所有已解决的问题，发布版本，CI保证 |
-| Unmaintained （无维护）     | 社区诉求/兴趣驱动  | 合入所有已解决的问题，无版本发布，无CI承诺 |
-| End of Life (EOL, 生命周期终止) | 无                        | 分支不再接受任何代码                                   |
-
-### 分支状态
-
-注意：对于`*-dev`分支，vllm-ascend将仅针对 vLLM 某个特定版本创建开发分支，而非全量版本。 因此，您可能看到部分vLLM版本没有对应的开发分支（比如只能看到`0.7.1-dev` / `0.7.3-dev`分支，而没有`0.7.2-dev`分支），这是符合预期的。
-
-通常来说，vLLM每个minor版本（比如0.7）均会对应一个vllm-ascend版本分支，并支持其最新的版本（例如我们计划支持0.7.3版本）。如下所示：
-
-| 分支         | 状态           | 备注                  |
-|------------|--------------|---------------------|
-| main       | Maintained   | 基于vLLM main分支CI看护   |
-| v0.7.3-dev | Maintained   | 基于vLLM v0.7.3版本CI看护 |
-| v0.7.1-dev | Unmaintained | 被v0.7.3-dev分支代替     |
-
-## 版本配套
-
-vLLM Ascend Plugin (`vllm-ascend`) 的关键配套关系如下:
-
-| vllm-ascend  | vLLM    | Python | Stable CANN | PyTorch/torch_npu |
-|--------------|---------| --- | --- | --- |
-| v0.7.1rc1 | v0.7.1 | 3.9 - 3.12 | 8.0.0 |  2.5.1 / 2.5.1.dev20250218 |
-
-## 发布节奏
-
-### 下一个正式版(`v0.7.x`)发布窗口
-
-| 时间       | 事件                            |
-|----------|-------------------------------|
-| 2025年03月 | RC版本, v0.7.3rc1               |
-| 2025年03月 | 正式版, 匹配0.7.3最新的vLLM版本: v0.7.3 |
--- a/docs/source/faqs.md
+++ b/docs/source/faqs.md
@ -0,0 +1,169 @@
+# FAQs
+
+## Version Specific FAQs
+
+- [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
+- [[v0.9.2rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1742)
+
+## General FAQs
+
+### 1. What devices are currently supported?
+
+Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b) and Atlas 300I(Ascend-cann-kernels-310p) series are supported:
+
+- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
+- Atlas 800I A2 Inference series (Atlas 800I A2)
+- Atlas 300I Inference series (Atlas 300I Duo)
+
+Below series are NOT supported yet:
+- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
+- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet
+
+From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We are also welcome to join us to improve together.
+
+### 2. How to get our docker containers?
+
+You can get our containers at `Quay.io`, e.g., [<u>vllm-ascend</u>](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and [<u>cann</u>](https://quay.io/repository/ascend/cann?tab=tags).
+
+If you are in China, you can use `daocloud` to accelerate your downloading:
+
+```bash
+# Replace with tag you want to pull
+TAG=v0.7.3rc2
+docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
+```
+
+### 3. What models does vllm-ascend supports?
+
+Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_models.html).
+
+### 4. How to get in touch with our community?
+
+There are many channels that you can communicate with our community developers / users:
+
+- Submit a GitHub [<u>issue</u>](https://github.com/vllm-project/vllm-ascend/issues?page=1).
+- Join our [<u>weekly meeting</u>](https://docs.google.com/document/d/1hCSzRTMZhIB8vRq1_qOOjx4c9uYUxvdQvDsMV2JcSrw/edit?tab=t.0#heading=h.911qu8j8h35z) and share your ideas.
+- Join our [<u>WeChat</u>](https://github.com/vllm-project/vllm-ascend/issues/227) group and ask your quenstions.
+- Join our ascend channel in [<u>vLLM forums</u>](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support/6) and publish your topics.
+
+### 5. What features does vllm-ascend V1 supports?
+
+Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).
+
+### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
+
+Basically, the reason is that the NPU environment is not configured correctly. You can:
+1. try `source /usr/local/Ascend/nnal/atb/set_env.sh` to enable NNAL package.
+2. try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
+3. try `npu-smi info` to check whether the NPU is working.
+
+If all above steps are not working, you can try the following code with python to check whether there is any error:
+
+```
+import torch
+import torch_npu
+import vllm
+```
+
+If all above steps are not working, feel free to submit a GitHub issue.
+
+### 7. How does vllm-ascend perform?
+
+Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek  V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
+
+### 8. How vllm-ascend work with vllm?
+vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
+
+### 9. Does vllm-ascend support Prefill Disaggregation feature?
+
+Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
+
+### 10. Does vllm-ascend support quantization method?
+
+Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
+
+### 11. How to run w8a8 DeepSeek model?
+
+Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
+
+### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
+
+If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
+
+### 13. How vllm-ascend is tested
+
+vllm-ascend is tested by functional test, performance test and accuracy test.
+
+- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests，on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) via e2e test
+
+- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website to show the performance test results for each pull request
+
+- **Accuracy test**: we're working on adding accuracy test to CI as well.
+
+Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
+
+### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
+It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
+
+### 15. How to handle Out Of Memory?
+OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
+
+In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
+
+- **Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
+
+- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
+
+### 16. Failed to enable NPU graph mode when running DeepSeek?
+You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
+
+And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
+
+```bash
+[rank0]: RuntimeError: EZ9999: Inner Error!
+[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
+```
+
+### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
+You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
+
+### 18. How to generate determinitic results when using vllm-ascend?
+There are several factors that affect output certainty:
+
+1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0)
+# Create an LLM.
+llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
+
+# Generate texts from the prompts.
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+2. Set the following enveriments parameters:
+
+```bash
+export LCCL_DETERMINISTIC = 1
+export HCCL_DETERMINISTIC = 1
+export ATB_MATMUL_SHUFFLE_K_ENABLE = 0
+export ATB_LLM_LCOC_ENABLE = 0
+```
+
+### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model？
+The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
+this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -35,22 +35,37 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
 :maxdepth: 1
 quick_start
 installation
-tutorials
+tutorials/index.md
+faqs
 :::

 % What does vLLM Ascend Plugin support?
 :::{toctree}
 :caption: User Guide
 :maxdepth: 1
-user_guide/suppoted_features
-user_guide/supported_models
+user_guide/support_matrix/index
+user_guide/configuration/index
+user_guide/feature_guide/index
 user_guide/release_notes
 :::

-% How to contribute to the vLLM project
+% How to contribute to the vLLM Ascend project
 :::{toctree}
 :caption: Developer Guide
 :maxdepth: 1
-developer_guide/contributing
-developer_guide/versioning_policy
-:::
+developer_guide/contribution/index
+developer_guide/feature_guide/index
+developer_guide/evaluation/index
+developer_guide/performance/index
+developer_guide/modeling/index
+:::
+
+% How to involve vLLM Ascend
+:::{toctree}
+:caption: Community
+:maxdepth: 1
+community/governance
+community/contributors
+community/versioning_policy
+community/user_stories/index
+:::
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@ -5,15 +5,15 @@ This document describes how to install vllm-ascend manually.
 ## Requirements

 - OS: Linux
- Python: 3.9 or higher
+- Python: >= 3.9, < 3.12
 - A hardware with Ascend NPU. It's usually the Atlas 800 A2 series.
 - Software:

-    | Software     | Supported version | Note |
-    | ------------ | ----------------- | ---- | 
-    | CANN         | >= 8.0.0          | Required for vllm-ascend and torch-npu |
-    | torch-npu    | >= 2.5.1rc1       | Required for vllm-ascend |
-    | torch        | >= 2.5.1          | Required for torch-npu and vllm |
+    | Software      | Supported version                | Note                                      |
+    |---------------|----------------------------------|-------------------------------------------|
+    | CANN          | >= 8.1.RC1                       | Required for vllm-ascend and torch-npu    |
+    | torch-npu     | >= 2.5.1.post1.dev20250619       | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
+    | torch         | >= 2.5.1                         | Required for torch-npu and vllm           |

 You have 2 way to install:
 - **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
@ -44,10 +44,12 @@ Refer to [Ascend Environment Setup Guide](https://ascend.github.io/docs/sources/

 The easiest way to prepare your software environment is using CANN image directly:

-```bash
+```{code-block} bash
+   :substitutions:
 # Update DEVICE according to your device (/dev/davinci[0-7])
 export DEVICE=/dev/davinci7
-
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/cann:|cann_image_tag|
 docker run --rm \
    --name vllm-ascend-env \
    --device $DEVICE \
@ -59,40 +61,42 @@ docker run --rm \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
-    -it quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py3.10 bash
+    -v /root/.cache:/root/.cache \
+    -it $IMAGE bash
 ```

+:::{dropdown} Click here to see "Install CANN manually"
+:animate: fade-in-slide-down
 You can also install CANN manually:

-:::{note}
-This guide takes aarch64 as an example. If you run on x86, you need to replace `aarch64` with `x86_64` for the package name shown below.
-:::
-
 ```bash
 # Create a virtual environment
 python -m venv vllm-ascend-env
 source vllm-ascend-env/bin/activate

 # Install required python packages.
-pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs numpy<2.0.0 decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
+pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs 'numpy<2.0.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions

 # Download and install the CANN package.
-wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.0.0/Ascend-cann-toolkit_8.0.0_linux-aarch64.run
-chmod +x ./Ascend-cann-toolkit_8.0.0_linux-aarch64.run
-./Ascend-cann-toolkit_8.0.0_linux-aarch64.run --full
-
-wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.0.0/Ascend-cann-kernels-910b_8.0.0_linux-aarch64.run
-chmod +x ./Ascend-cann-kernels-910b_8.0.0_linux-aarch64.run
-./Ascend-cann-kernels-910b_8.0.0_linux-aarch64.run --install
-
-wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.0.0/Ascend-cann-nnal_8.0.0_linux-aarch64.run
-chmod +x. /Ascend-cann-nnal_8.0.0_linux-aarch64.run
-./Ascend-cann-nnal_8.0.0_linux-aarch64.run --install
+wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
+chmod +x ./Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
+./Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run --full

 source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
+chmod +x ./Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
+./Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run --install
+
+wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
+chmod +x ./Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
+./Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run --install
+
 source /usr/local/Ascend/nnal/atb/set_env.sh
 ```

+:::
+
 ::::

 ::::{tab-item} Before using docker
@ -112,50 +116,63 @@ Once it's done, you can start to set up `vllm` and `vllm-ascend`.
 :selected:
 :sync: pip

-You can install `vllm` and `vllm-ascend` from **pre-built wheel**:
+First install system dependencies and config pip mirror:
+
+```bash
+# Using apt-get with mirror
+sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
+apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
+# Or using yum
+# yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
+# Config pip mirror
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+```
+
+**[Optional]** Then config the extra-index of `pip` if you are working on a x86 machine or using torch-npu dev version:
+
+```bash
+# For torch-npu dev version or x86 machine
+pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
+```
+
+Then you can install `vllm` and `vllm-ascend` from **pre-built wheel**:

 ```{code-block} bash
   :substitutions:

-# Install vllm from source, since `pip install vllm` doesn't work on CPU currently.
-# It'll be fixed in the next vllm release, e.g. v0.7.3.
-git clone --branch |pip_vllm_version| https://github.com/vllm-project/vllm
+# Install vllm-project/vllm from pypi
+pip install vllm==|pip_vllm_version|

-cd vllm
-VLLM_TARGET_DEVICE=empty pip install . --extra-index https://download.pytorch.org/whl/cpu/
-
-# Install vllm-ascend from pypi.
-pip install vllm-ascend==|pip_vllm_ascend_version| --extra-index https://download.pytorch.org/whl/cpu/
-
-# Once the packages are installed, you need to install `torch-npu` manually,
-# because that vllm-ascend relies on an unreleased version of torch-npu.
-# This step will be removed in the next vllm-ascend release.
-# 
-# Here we take python 3.10 on aarch64 as an example. Feel free to install the correct version for your environment. See:
-#
-# https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py39.tar.gz
-# https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py310.tar.gz
-# https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py311.tar.gz
-#
-mkdir pta
-cd pta
-wget https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py310.tar.gz
-tar -xvf pytorch_v2.5.1_py310.tar.gz
-pip install ./torch_npu-2.5.1.dev20250218-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
+# Install vllm-project/vllm-ascend from pypi.
+pip install vllm-ascend==|pip_vllm_ascend_version|
 ```

+:::{dropdown} Click here to see "Build from source code"
 or build from **source code**:

 ```{code-block} bash
   :substitutions:

+# Install vLLM
 git clone --depth 1 --branch |vllm_version| https://github.com/vllm-project/vllm
 cd vllm
-VLLM_TARGET_DEVICE=empty pip install . --extra-index https://download.pytorch.org/whl/cpu/
+VLLM_TARGET_DEVICE=empty pip install -v -e .
+cd ..

+# Install vLLM Ascend
 git clone  --depth 1 --branch |vllm_ascend_version| https://github.com/vllm-project/vllm-ascend.git
 cd vllm-ascend
-pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
+pip install -v -e .
+cd ..
+```
+
+vllm-ascend will build custom ops by default. If you don't want to build it, set `COMPILE_CUSTOM_KERNELS=0` environment to disable it.
+:::
+
+```{note}
+If you are building from v0.7.3-dev and intend to use sleep mode feature, you should set `COMPILE_CUSTOM_KERNELS=1` manually.
+To build custom ops, gcc/g++ higher than 8 and c++ 17 or higher is required. If you're using `pip install -e .` and encourage a torch-npu version conflict, please install with `pip install --no-build-isolation -e .` to build on system env.
+If you encounter other problems during compiling, it is probably because unexpected compiler is being used, you may export `CXX_COMPILER` and `C_COMPILER` in env to specify your g++ and gcc locations before compiling.
 ```

 ::::
@ -165,14 +182,23 @@ pip install -e . --extra-index https://download.pytorch.org/whl/cpu/

 You can just pull the **prebuilt image** and run it with bash.

+:::{dropdown} Click here to see "Build from Dockerfile"
+or build IMAGE from **source code**:
+
+```bash
+git clone https://github.com/vllm-project/vllm-ascend.git
+cd vllm-ascend
+docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile .
+```
+:::
+
 ```{code-block} bash
   :substitutions:

 # Update DEVICE according to your device (/dev/davinci[0-7])
-DEVICE=/dev/davinci7
+export DEVICE=/dev/davinci7
 # Update the vllm-ascend image
-IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
-docker pull $IMAGE
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
 docker run --rm \
    --name vllm-ascend-env \
    --device $DEVICE \
@ -184,17 +210,11 @@ docker run --rm \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/.cache:/root/.cache \
    -it $IMAGE bash
 ```

-or build IMAGE from **source code**:
-
-```bash
-git clone https://github.com/vllm-project/vllm-ascend.git
-cd vllm-ascend
-docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile .
-```
-
+The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
 ::::

 :::::
@ -231,7 +251,8 @@ for output in outputs:
 Then run:

 ```bash
-# export VLLM_USE_MODELSCOPE=true to speed up download if huggingface is not reachable.
+# Try `export VLLM_USE_MODELSCOPE=true` and `pip install modelscope`
+# to speed up download if huggingface is not reachable.
 python example.py
 ```

--- a/docs/source/quick_start.md
+++ b/docs/source/quick_start.md
@ -8,15 +8,19 @@

 ## Setup environment using container

+:::::{tab-set}
+::::{tab-item} Ubuntu
+
 ```{code-block} bash
   :substitutions:

-# You can change version a suitable one base on your requirement, e.g. main
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci0
+# Update the vllm-ascend image
 export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
-
-docker run \
+docker run --rm \
 --name vllm-ascend \
--device /dev/davinci0 \
+--device $DEVICE \
 --device /dev/davinci_manager \
 --device /dev/devmm_svm \
 --device /dev/hisi_hdc \
@ -28,23 +32,61 @@ docker run \
 -v /root/.cache:/root/.cache \
 -p 8000:8000 \
 -it $IMAGE bash
+# Install curl
+apt-get update -y && apt-get install -y curl
 ```
+::::
+
+::::{tab-item} openEuler
+
+```{code-block} bash
+   :substitutions:
+
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci0
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-openeuler
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-it $IMAGE bash
+# Install curl
+yum update -y && yum install -y curl
+```
+::::
+:::::
+
+The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.

 ## Usage

-There are two ways to start vLLM on Ascend NPU:
-
-### Offline Batched Inference with vLLM
-
-With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).
+You can use Modelscope mirror to speed up download:

+<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
 ```bash
-# Use Modelscope mirror to speed up download
 export VLLM_USE_MODELSCOPE=true
 ```

+There are two ways to start vLLM on Ascend NPU:
+
+:::::{tab-set}
+::::{tab-item} Offline Batched Inference
+
+With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).
+
 Try to run below Python script directly or use `python3` shell to generate texts:

+<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
 ```python
 from vllm import LLM, SamplingParams

@ -64,15 +106,16 @@ for output in outputs:
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```

-### OpenAI Completions API with vLLM
+::::
+
+::::{tab-item} OpenAI Completions API

 vLLM can also be deployed as a server that implements the OpenAI API protocol. Run
 the following command to start the vLLM server with the
 [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model:

+<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
 ```bash
-# Use Modelscope mirror to speed up download
-export VLLM_USE_MODELSCOPE=true
 # Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models)
 vllm serve Qwen/Qwen2.5-0.5B-Instruct &
 ```
@ -89,12 +132,14 @@ Congratulations, you have successfully started the vLLM server!

 You can query the list the models:

+<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
 ```bash
 curl http://localhost:8000/v1/models | python3 -m json.tool
 ```

 You can also query the model with input prompts:

+<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
 ```bash
 curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
@ -109,10 +154,10 @@ curl http://localhost:8000/v1/completions \
 vLLM is serving as background process, you can use `kill -2 $VLLM_PID` to stop the background process gracefully,
 it's equal to `Ctrl-C` to stop foreground vLLM process:

+<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
 ```bash
-ps -ef | grep "/.venv/bin/vllm serve" | grep -v grep
-VLLM_PID=`ps -ef | grep "/.venv/bin/vllm serve" | grep -v grep | awk '{print $2}'`
-kill -2 $VLLM_PID
+  VLLM_PID=$(pgrep -f "vllm serve")
+  kill -2 "$VLLM_PID"
 ```

 You will see output as below:
@ -124,3 +169,5 @@ INFO:     Application shutdown complete.
 ```

 Finally, you can exit container by using `ctrl-D`.
+::::
+:::::
--- a/docs/source/tutorials.md
+++ b/docs/source/tutorials.md
@ -1,311 +0,0 @@
-# Tutorials
-
-## Run vllm-ascend on Single NPU
-
-### Offline Inference on Single NPU
-
-Run docker container:
-
-```{code-block} bash
-   :substitutions:
-docker run \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it quay.io/ascend/vllm-ascend:|vllm_ascend_version| bash
-```
-
-Setup environment variables:
-
-```bash
-# Use Modelscope mirror to speed up model download
-export VLLM_USE_MODELSCOPE=True
-
-# To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate for Qwen2.5-7B-Instruct
-export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
-```
-
-:::{note}
-`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
-:::
-
-Run the following script to execute offline inference on a single NPU:
-
-```python
-from vllm import LLM, SamplingParams
-
-prompts = [
-    "Hello, my name is",
-    "The future of AI is",
-]
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", max_model_len=26240)
-
-outputs = llm.generate(prompts, sampling_params)
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-```
-
-If you run this script successfully, you can see the info shown below:
-
-```bash
-Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
-Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
-```
-
-### Online Serving on Single NPU
-
-Run docker container to start the vLLM server on a single NPU:
-
-```{code-block} bash
-   :substitutions:
-
-docker run \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it quay.io/ascend/vllm-ascend:|vllm_ascend_version| \
-vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
-```
-
-:::{note}
-Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
-:::
-
-If your service start successfully, you can see the info shown below:
-
-```bash
-INFO:     Started server process [6873]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-```
-
-Once your server is started, you can query the model with input prompts:
-
-```bash
-curl http://localhost:8000/v1/completions \
-    -H "Content-Type: application/json" \
-    -d '{
-        "model": "Qwen/Qwen2.5-7B-Instruct",
-        "prompt": "The future of AI is",
-        "max_tokens": 7,
-        "temperature": 0
-    }'
-```
-
-If you query the server successfully, you can see the info shown below (client):
-
-```bash
-{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"text":" here. It’s not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}
-```
-
-Logs of the vllm server:
-
-```bash
-INFO:     172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK
-INFO 02-13 08:34:35 logger.py:39] Received request cmpl-574f00e342904692a73fb6c1c986c521-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
-```
-
-## Run vllm-ascend on Multi-NPU
-
-### Distributed Inference on Multi-NPU
-
-Run docker container:
-
-```{code-block} bash
-   :substitutions:
-
-docker run \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it quay.io/ascend/vllm-ascend:|vllm_ascend_version| bash
-```
-
-Setup environment variables:
-
-```bash
-# Use Modelscope mirror to speed up model download
-export VLLM_USE_MODELSCOPE=True
-
-# To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate for Qwen2.5-7B-Instruct
-export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
-```
-
-Run the following script to execute offline inference on multi-NPU:
-
-```python
-import gc
-
-import torch
-
-from vllm import LLM, SamplingParams
-from vllm.distributed.parallel_state import (destroy_distributed_environment,
-                                             destroy_model_parallel)
-
-def clean_up():
-    destroy_model_parallel()
-    destroy_distributed_environment()
-    gc.collect()
-    torch.npu.empty_cache()
-
-prompts = [
-    "Hello, my name is",
-    "The future of AI is",
-]
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="Qwen/Qwen2.5-7B-Instruct",
-          tensor_parallel_size=2,
-          distributed_executor_backend="mp",
-          max_model_len=26240)
-
-outputs = llm.generate(prompts, sampling_params)
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-del llm
-clean_up()
-```
-
-If you run this script successfully, you can see the info shown below:
-
-```bash
-Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
-Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
-```
-
-## Online Serving on Multi Machine
-
-Run docker container on each machine:
-
-```shell
-docker run \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2\
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it quay.io/ascend/vllm-ascend:v0.7.1rc1 bash
-```
-
-Choose one machine as head node, the other are worker nodes, then start ray on each machine:
-:::{note} Check out your `nic_name` by command `ip addr`  :::
-
-```shell
-# Head node
-export HCCL_IF_IP={local_ip}
-export GLOO_SOCKET_IFNAME={nic_name}
-export TP_SOCKET_IFNAME={nic_name}
-export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-ray start --head --num-gpus=8
-
-# Worker node
-export HCCL_IF_IP={local_ip}
-export ASCEND_PROCESS_LOG_PATH={plog_save_path}
-export GLOO_SOCKET_IFNAME={nic_name}
-export TP_SOCKET_IFNAME={nic_name}
-export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 
-export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
-```
-
-Start the vLLM server on head node:
-
-```shell
-export VLLM_HOST_IP={head_node_ip}
-export HCCL_CONNECT_TIMEOUT=120
-export ASCEND_PROCESS_LOG_PATH={plog_save_path}
-export HCCL_IF_IP={head_node_ip}
-
-if [ -d "{plog_save_path}" ]; then
-    rm -rf {plog_save_path}
-    echo ">>> remove {plog_save_path}"
-fi
-
-LOG_FILE="multinode_$(date +%Y%m%d_%H%M).log"
-VLLM_TORCH_PROFILER_DIR=./vllm_profile
-python -m vllm.entrypoints.openai.api_server  \
-       --model="Deepseek/DeepSeek-V2-Lite-Chat" \
-       --trust-remote-code \
-       --enforce-eager \
-       --max-model-len {max_model_len} \
-       --distributed_executor_backend "ray" \
-       --tensor-parallel-size 16 \
-       --disable-log-requests \
-       --disable-log-stats \
-       --disable-frontend-multiprocessing \
-       --port {port_num} \
-```
-
-Once your server is started, you can query the model with input prompts:
-
-```shell
-curl -X POST http://127.0.0.1:{prot_num}/v1/completions  \
-     -H "Content-Type: application/json" \
-     -d '{
-         "model": "Deepseek/DeepSeek-V2-Lite-Chat",
-         "prompt": "The future of AI is",
-         "max_tokens": 24
-     }'
-```
-
-If you query the server successfully, you can see the info shown below (client):
-
-```
-{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
-```
-
-Logs of the vllm server:
-
-```
-INFO:     127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
-INFO 02-19 17:37:35 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
-```
--- a/docs/source/tutorials/index.md
+++ b/docs/source/tutorials/index.md
@ -0,0 +1,16 @@
+# Tutorials
+
+:::{toctree}
+:caption: Deployment
+:maxdepth: 1
+single_npu
+single_npu_multimodal
+single_npu_audio
+single_npu_qwen3_embedding
+multi_npu
+multi_npu_moge
+multi_npu_qwen3_moe
+multi_npu_quantization
+single_node_300i
+multi_node
+:::
--- a/docs/source/tutorials/multi_node.md
+++ b/docs/source/tutorials/multi_node.md
@ -0,0 +1,198 @@
+# Multi-Node-DP (DeepSeek)
+
+## Getting Start
+vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
+
+Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
+
+For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
+    - Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
+    - Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
+
+This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
+
+In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
+
+For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
+
+## Verify Multi-Node Communication Environment
+
+### Physical Layer Requirements:
+
+- The physical machines must be located on the same WLAN, with network connectivity.
+- All NPUs are connected with optical modules, and the connection status must be normal.
+
+### Verification Process:
+
+Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
+
+```bash
+ # Check the remote switch ports
+ for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done 
+ # Get the link status of the Ethernet ports (UP or DOWN)
+ for i in {0..7}; do hccn_tool -i $i -link -g ; done
+ # Check the network health status
+ for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
+ # View the network detected IP configuration
+ for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
+ # View gateway configuration
+ for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
+ # View NPU network configuration
+ cat /etc/hccn.conf
+```
+
+### NPU Interconnect Verification:
+#### 1. Get NPU IP Addresses
+```bash
+for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
+```
+
+#### 2. Cross-Node PING Test
+```bash
+# Execute on the target node (replace with actual IP)
+hccn_tool -i 0 -ping -g address 10.20.0.20
+```
+
+## Run with docker
+Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node.
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+export NAME=vllm-ascend
+
+# Run the container using the defined variables
+# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
+docker run --rm \
+--name $NAME \
+--net=host \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci4 \
+--device /dev/davinci5 \
+--device /dev/davinci6 \
+--device /dev/davinci7 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /mnt/sfs_turbo/.cache:/root/.cache \
+-it $IMAGE bash
+```
+
+:::{note}
+Before launch the inference server, ensure some environment variables are set for multi node communication
+:::
+
+Run the following scripts on two nodes respectively
+
+**node0**
+```shell
+#!/bin/sh
+
+# this obtained through ifconfig
+# nic_name is the network interface name corresponding to local_ip
+nic_name="xxxx"
+local_ip="xxxx"
+
+export HCCL_IF_IP=$local_ip
+export GLOO_SOCKET_IFNAME=$nic_name
+export TP_SOCKET_IFNAME=$nic_name
+export HCCL_SOCKET_IFNAME=$nic_name
+export OMP_PROC_BIND=false
+export OMP_NUM_THREADS=100
+export HCCL_BUFFSIZE=1024
+
+# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
+# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
+vllm serve /root/.cache/ds_v3 \
+--host 0.0.0.0 \
+--port 8004 \
+--data-parallel-size 4 \
+--data-parallel-size-local 2 \
+--data-parallel-address $local_ip \
+--data-parallel-rpc-port 13389 \
+--tensor-parallel-size 4 \
+--seed 1024 \
+--served-model-name deepseek_v3 \
+--enable-expert-parallel \
+--max-num-seqs 16 \
+--max-model-len 32768 \
+--quantization ascend \
+--max-num-batched-tokens 4096 \
+--trust-remote-code \
+--no-enable-prefix-caching \
+--gpu-memory-utilization 0.9 \
+--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
+```
+
+**node1**
+```shell
+#!/bin/sh
+
+nic_name="xxx"
+local_ip="xxx"
+
+export HCCL_IF_IP=$local_ip
+export GLOO_SOCKET_IFNAME=$nic_name
+export TP_SOCKET_IFNAME=$nic_name
+export HCCL_SOCKET_IFNAME=$nic_name
+export OMP_PROC_BIND=false
+export OMP_NUM_THREADS=100
+export VLLM_USE_V1=1
+export HCCL_BUFFSIZE=1024
+
+vllm serve /root/.cache/ds_v3 \
+--host 0.0.0.0 \
+--port 8004 \
+--headless \
+--data-parallel-size 4 \
+--data-parallel-size-local 2 \
+--data-parallel-start-rank 2 \
+--data-parallel-address { node0 ip } \
+--data-parallel-rpc-port 13389 \
+--tensor-parallel-size 4 \
+--seed 1024 \
+--quantization ascend \
+--served-model-name deepseek_v3 \
+--max-num-seqs 16 \
+--max-model-len 32768 \
+--max-num-batched-tokens 4096 \
+--enable-expert-parallel \
+--trust-remote-code \
+--no-enable-prefix-caching \
+--gpu-memory-utilization 0.92 \
+--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
+```
+
+The Deployment view looks like: 
+![alt text](../assets/multi_node_dp.png)
+
+Once your server is started, you can query the model with input prompts:
+
+```shell
+curl http://{ node0 ip:8004 }/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "/root/.cache/ds_v3",
+        "prompt": "The future of AI is",
+        "max_tokens": 50,
+        "temperature": 0
+    }'
+```
+
+## Run benchmarks
+For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
+```shell
+vllm bench serve --model /root/.cache/ds_v3  --served-model-name deepseek_v3 \
+--dataset-name random --random-input-len 128 --random-output-len 128 \
+--num-prompts 200  --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
+```
--- a/docs/source/tutorials/multi_npu.md
+++ b/docs/source/tutorials/multi_npu.md
@ -0,0 +1,107 @@
+# Multi-NPU (QwQ 32B)
+
+## Run vllm-ascend on Multi-NPU
+
+Run docker container:
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-it $IMAGE bash
+```
+
+Setup environment variables:
+
+```bash
+# Load model from ModelScope to speed up download
+export VLLM_USE_MODELSCOPE=True
+
+# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
+export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+```
+
+### Online Inference on Multi-NPU
+
+Run the following script to start the vLLM server on Multi-NPU:
+
+```bash
+vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
+```
+
+Once your server is started, you can query the model with input prompts
+
+```bash
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/QwQ-32B",
+        "prompt": "QwQ-32B是什么？",
+        "max_tokens": "128",
+        "top_p": "0.95",
+        "top_k": "40",
+        "temperature": "0.6"
+    }'
+```
+
+### Offline Inference on Multi-NPU
+
+Run the following script to execute offline inference on multi-NPU:
+
+```python
+import gc
+
+import torch
+
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (destroy_distributed_environment,
+                                             destroy_model_parallel)
+
+def clean_up():
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    gc.collect()
+    torch.npu.empty_cache()
+
+prompts = [
+    "Hello, my name is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+llm = LLM(model="Qwen/QwQ-32B",
+          tensor_parallel_size=4,
+          distributed_executor_backend="mp",
+          max_model_len=4096)
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+del llm
+clean_up()
+```
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
+Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
+```
--- a/Show More
+++ b/Show More