[Doc] Add a doc for qwen omni (#1867 )

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com> ### What this PR does / why we need it? Add FAQ note for qwen omni Fixes https://github.com/vllm-project/vllm-ascend/issues/1760 issue1 - vLLM version: v0.9.2 - vLLM main: b9a21e9173
[CI] Fix broken CI (#1889 )
2025-07-20 09:05:41 +08:00 · 2025-07-20 02:11:57 +08:00 · 2025-07-19 11:39:48 +08:00 · 2025-07-19 11:37:03 +08:00 · 2025-07-19 09:42:32 +08:00 · 2025-07-18 23:09:54 +08:00
274 changed files with 21139 additions and 14280 deletions
--- a/.github/Dockerfile.buildwheel
+++ b/.github/Dockerfile.buildwheel
@ -15,17 +15,16 @@
 # This file is a part of the vllm-ascend project.
 #
 ARG PY_VERSION=3.10
-FROM quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py${PY_VERSION}
+FROM quay.io/ascend/manylinux:8.0.0-910b-manylinux_2_28-py${PY_VERSION}

 ARG COMPILE_CUSTOM_KERNELS=1

 # Define environments
 ENV DEBIAN_FRONTEND=noninteractive
 ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
-RUN apt-get update -y && \
-    apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
-    rm -rf /var/cache/apt/* && \
-    rm -rf /var/lib/apt/lists/*
+RUN yum update -y && \
+    yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
+    rm -rf /var/cache/yum

 WORKDIR /workspace

@ -41,8 +40,6 @@ RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
    cd vllm-ascend && \
    python3 setup.py bdist_wheel && \
-    ls -l dist && \
-    for f in dist/*.whl; do mv "$f" "$(echo "$f" | sed -e 's/-linux_x86_64\.whl$/-manylinux1_x86_64.whl/' -e 's/-linux_aarch64\.whl$/-manylinux2014_aarch64.whl/')"; done && \
-    ls -l dist
+    ls -l dist 

 CMD ["/bin/bash"]
--- a/.github/ISSUE_TEMPLATE/110-user-story.yml
+++ b/.github/ISSUE_TEMPLATE/110-user-story.yml
@ -1,5 +1,5 @@
 name: 📚 User Story
-description: Apply for an user story to be displayed on https://vllm-ascend.readthedocs.org/user_stories/index.html
+description: Apply for an user story to be displayed on https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html
 title: "[User Story]: "
 labels: ["user-story"]

--- a/.github/ISSUE_TEMPLATE/900-release-checklist.yml
+++ b/.github/ISSUE_TEMPLATE/900-release-checklist.yml
@ -0,0 +1,100 @@
+name: Release Checklist
+description: Generate a release checklist issue when prepare a new release.(Used for release team)
+title: "[Release]: Release checklist for v"
+
+body:
+- type: textarea
+  attributes:
+    description: >
+      Brief info for the new release.
+    label: Release Checklist
+    value: >
+      **Release Version**: 
+
+      **Release Branch**: 
+
+      **Release Date**: 
+
+      **Release Manager**: 
+- type: textarea
+  attributes:
+    description: >
+      Release notes.
+    label: Prepare Release Note
+    value: >
+      - [ ] Create a new issue for release feedback
+
+      - [ ] Write the release note PR.
+
+        - [ ] Update the feedback issue link in docs/source/faqs.md
+
+        - [ ] Add release note to docs/source/user_guide/release_notes.md
+
+        - [ ] Update version info in docs/source/community/versioning_policy.md
+
+        - [ ] Update contributor info in docs/source/community/contributors.md
+
+        - [ ] Update package version in docs/conf.py
+- type: textarea
+  attributes:
+    description: >
+      Make sure the code is merged.
+    label: PR need Merge
+    value: >
+      - [ ] PR link1
+
+      - [ ] PR link2
+
+      - [ ] ...
+- type: textarea
+  attributes:
+    description: >
+      Make sure the new Feature/Function is tested
+    label: Functional Test
+    value: >
+      - [ ] Feature1
+
+      - [ ] Bug1
+
+      - [ ] ...
+- type: textarea
+  attributes:
+    description: >
+      Make sure the doc is updated.
+    label: Doc Test
+    value: >
+      - [ ] Tutorial is updated.
+
+      - [ ] User Guide is updated.
+
+      - [ ] Developer Guide is updated.
+- type: textarea
+  attributes:
+    description: >
+      Make sure the artifacts is ready
+    label: Prepare Artifacts
+    value: >
+      - [ ] Docker image is ready.
+
+      - [ ] Wheel package is ready.
+- type: textarea
+  attributes:
+    description: >
+      Start to release.
+    label: Release Step
+    value: >
+      - [ ] Release note PR is merged.
+
+      - [ ] Post the release on GitHub release page.
+
+      - [ ] Generate official doc page on https://app.readthedocs.org/dashboard/
+
+      - [ ] Wait for the wheel package to be available on https://pypi.org/project/vllm-ascend
+
+      - [ ] Wait for the docker image to be available on https://quay.io/ascend/vllm-ascend
+
+      - [ ] Upload 310p wheel to Github release page
+
+      - [ ] Broadcast the release news (By message, blog , etc)
+
+      - [ ] Close this issue
--- a/.github/format_pr_body.sh
+++ b/.github/format_pr_body.sh
@ -0,0 +1,59 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+# Adapted from vllm/.github/scripts/cleanup_pr_body.sh
+
+#!/bin/bash
+
+set -eux
+
+# ensure 2 argument is passed
+if [ "$#" -ne 3 ]; then
+    echo "Usage: $0 <pr_number> <vllm_version> <vllm_commit>"
+    exit 1
+fi
+
+PR_NUMBER=$1
+VLLM_VERSION=$2
+VLLM_COMMIT=$3
+OLD=/tmp/orig_pr_body.txt
+NEW=/tmp/new_pr_body.txt
+FINAL=/tmp/final_pr_body.txt
+
+gh pr view --json body --template "{{.body}}" "${PR_NUMBER}" > "${OLD}"
+cp "${OLD}" "${NEW}"
+
+# Remove notes in pr description and add vLLM version and commit
+sed -i '/<!--/,/-->/d' "${NEW}"
+sed -i '/- vLLM .*$/d' "${NEW}"
+{
+    echo ""
+    echo "- vLLM version: $VLLM_VERSION"
+    echo "- vLLM main: $VLLM_COMMIT"
+} >> "${NEW}"
+
+# Remove redundant empty lines
+uniq "${NEW}" > "${FINAL}"
+
+# Run this only if ${NEW} is different than ${OLD}
+if ! cmp -s "${OLD}" "${FINAL}"; then
+    echo
+    echo "Updating PR body:"
+    echo
+    cat "${NEW}"
+    gh pr edit --body-file "${FINAL}" "${PR_NUMBER}"
+else
+    echo "No changes needed"
+fi
--- a/.github/workflows/accuracy_report.yaml
+++ b/.github/workflows/accuracy_report.yaml
@ -1,202 +0,0 @@
-#
-# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# This file is a part of the vllm-ascend project.
-#
-
-name: Accuracy Report
-on:
-  workflow_dispatch:
-    inputs:
-      vllm-ascend-branch:
-        description: 'vllm-ascend branch:'
-        required: true
-        type: choice
-        options:
-          - main
-          - v0.7.3-dev
-      models:  
-        description: 'models:'
-        required: true
-        type: choice
-        options:
-          - all
-          - Qwen/Qwen2.5-7B-Instruct
-          - Qwen/Qwen2.5-VL-7B-Instruct
-          - Qwen/Qwen3-8B-Base
-        default: 'all'
-
-jobs:
-  download_reports:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        model: ${{ fromJSON(
-          (github.event.inputs.models == 'all' &&
-            '["Qwen/Qwen2.5-7B-Instruct","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
-          (github.event.inputs.models == 'Qwen/Qwen2.5-7B-Instruct' &&
-            '["Qwen/Qwen2.5-7B-Instruct"]') ||
-          (github.event.inputs.models == 'Qwen/Qwen2.5-VL-7B-Instruct' &&
-            '["Qwen/Qwen2.5-VL-7B-Instruct"]') ||
-          (github.event.inputs.models == 'Qwen/Qwen3-8B-Base' &&
-            '["Qwen/Qwen3-8B-Base"]')
-         ) }}
-        
-        version: [0, 1]
-        exclude:
-          - model: 'Qwen/Qwen2.5-VL-7B-Instruct'
-            version: 1
-      fail-fast: false     
-
-    name: Download ${{ matrix.model }} V${{ matrix.version }}
-    steps:
-      - name: Checkout repository
-        uses: actions/checkout@v4
-        with:
-          ref: ${{ github.event.inputs.vllm-ascend-branch }}
-
-      - name: Get base model name
-        id: get_basename
-        run: |
-          model_base_name=$(basename "${{ matrix.model }}")
-          echo "model_base_name=$model_base_name" >> $GITHUB_OUTPUT
-        shell: bash
-
-      - name: Query artifact run id
-        id: get_run_id
-        run: |
-          ARTIFACT_PATTERN="${{ github.event.inputs.vllm-ascend-branch }}-${{ steps.get_basename.outputs.model_base_name }}-V${{ matrix.version }}-report"
-          echo "Querying artifacts with pattern: $ARTIFACT_PATTERN"
-          
-          ARTIFACT_JSON=$(gh api --paginate /repos/${{ github.repository }}/actions/artifacts || echo "{}")
-          
-          RUN_ID=$(echo "$ARTIFACT_JSON" | \
-            jq -s -r --arg pattern "$ARTIFACT_PATTERN" \
-            '[.[].artifacts[]] | map(select(.name | test($pattern))) | sort_by(.created_at) | last | .workflow_run.id // empty')
-          
-          if [ -z "$RUN_ID" ]; then
-            echo "::warning::No artifact found matching pattern $ARTIFACT_PATTERN. Skipping download."
-            echo "runid=" >> $GITHUB_OUTPUT
-          else
-            echo "Found matching artifact with run ID: $RUN_ID"
-            echo "runid=$RUN_ID" >> $GITHUB_OUTPUT
-          fi
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN  }}
-
-      - name: Download Artifact
-        if: ${{ steps.get_run_id.outputs.runid != '' }}
-        uses: actions/download-artifact@v4
-        with:
-          name: ${{ github.event.inputs.vllm-ascend-branch }}-${{ steps.get_basename.outputs.model_base_name }}-V${{ matrix.version }}-report
-          path: ./docs/source/developer_guide/evaluation/accuracy_report_bak
-          github-token: ${{ secrets.GITHUB_TOKEN  }}
-          repository: ${{ github.repository }}
-          run-id: ${{ steps.get_run_id.outputs.runid }}
-          
-      - name: Upload reports artifact
-        if: ${{ steps.get_run_id.outputs.runid != '' }}
-        uses: actions/upload-artifact@v4
-        with:
-          name: report-${{ steps.get_basename.outputs.model_base_name }}-v${{ matrix.version }}
-          path: ./docs/source/developer_guide/evaluation/accuracy_report_bak/*.md
-          retention-days: 90
-
-  create_pr:
-    runs-on: ubuntu-latest
-    needs: download_reports
-    steps:
-      - name: Checkout repository
-        uses: actions/checkout@v4
-        with:
-          ref: ${{ github.event.inputs.vllm-ascend-branch }}
-
-      - name: Setup workspace
-        run: mkdir -p ./accuracy/accuracy_report
-
-      - name: Download only current run reports
-        uses: actions/download-artifact@v4
-        with:
-          path: ./docs/source/developer_guide/evaluation/accuracy_report
-          pattern: report-*
-          github-token: ${{ secrets.GITHUB_TOKEN }}
-          run-id: ${{ github.run_id }}
-
-      - name: Delete old report
-        run: |
-          find ./docs/source/developer_guide/evaluation/accuracy_report -maxdepth 1 -type f -name '*.md' ! -name 'index.md' -delete
-          find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 2 -type f -name '*.md' -exec mv -f {} ./docs/source/developer_guide/evaluation/accuracy_report \;
-          find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 1 -type d -empty -delete
-
-      - name: Generate step summary
-        if: ${{ always() }}
-        run: |
-          for report in ./docs/source/developer_guide/evaluation/accuracy_report/*.md; do
-            filename=$(basename "$report")
-            # skip index.md
-            if [ "$filename" = "index.md" ]; then
-              continue
-            fi
-
-            if [ -f "$report" ]; then
-              {
-                echo -e "\n\n---\n"
-                echo "## 📄 Report File: $(basename $report)"
-                cat "$report"
-              } >> "$GITHUB_STEP_SUMMARY"
-            fi
-          done
-
-      - name: Update accuracy_report/index.md
-        run: |
-          REPORT_DIR="./docs/source/developer_guide/evaluation/accuracy_report"
-          INDEX_MD="$REPORT_DIR/index.md"
-
-          {
-            echo "# Accuracy Report"
-            echo ""
-            echo "::: {toctree}"
-            echo ":caption: Accuracy Report"
-            echo ":maxdepth: 1"
-            
-            for report in "$REPORT_DIR"/*.md; do
-              filename="$(basename "$report" .md)"
-              if [ "$filename" != "index" ]; then
-                echo "$filename"
-              fi
-            done
-
-            echo ":::"
-          } > "$INDEX_MD"
-
-      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v7
-        with:
-          token: ${{ secrets.PR_TOKEN }}
-          base: ${{ github.event.inputs.vllm-ascend-branch }}
-          branch: auto-pr/accuracy-report
-          commit-message: "Update accuracy reports for ${{ github.event.inputs.vllm-ascend-branch }}"
-          add-paths: ./docs/source/developer_guide/evaluation/accuracy_report/*.md
-          title: "[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-branch }}"
-          body: |
-            The accuracy results running on NPU Altlas A2 have changed, updating reports for:
-            ${{ 
-              github.event.inputs.models == 'all' 
-                && 'All models (Qwen2.5-7B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base)' 
-                || github.event.inputs.models 
-            }}
-            
-            - [Workflow run][1]
-            
-            [1]: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
--- a/.github/workflows/accuracy_test.yaml
+++ b/.github/workflows/accuracy_test.yaml
@ -22,6 +22,9 @@
 name: Benchmarks / accuracy

 on:
+  schedule:
+    # Runs every 6 hours
+    - cron:  '0 */6 * * *'
  pull_request:
    types: [ labeled ]
  workflow_dispatch:
@ -34,8 +37,8 @@ on:
        # Current supported vLLM versions
        options:
          - main
-          - v0.9.0.1
-          - v0.9.0
+          - v0.9.2
+          - v0.9.1
          - v0.7.3
      vllm-ascend-version:
        description: 'vllm-ascend version:'
@ -43,6 +46,7 @@ on:
        type: choice
        options:
          - main
+          - v0.9.1-dev
          - v0.7.3-dev
      models:
        description: 'model:'
@ -50,9 +54,9 @@ on:
        type: choice
        options:
          - all
-          - Qwen/Qwen2.5-7B-Instruct
          - Qwen/Qwen2.5-VL-7B-Instruct
          - Qwen/Qwen3-8B-Base
+          - Qwen/Qwen3-30B-A3B
        default: 'all'

 # Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
@ -74,56 +78,56 @@ jobs:
      ${{
      (contains(github.event.pull_request.labels.*.name, 'accuracy-test') ||
      contains(github.event.pull_request.labels.*.name, 'vl-accuracy-test') ||
+      contains(github.event.pull_request.labels.*.name, 'moe-accuracy-test') ||
      contains(github.event.pull_request.labels.*.name, 'dense-accuracy-test')) &&
      contains(github.event.pull_request.labels.*.name, 'ready-for-test') ||
-      github.event_name == 'workflow_dispatch'
+      github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
      }}
    runs-on: >-
      ${{
-          (matrix.model_name == 'Qwen/Qwen2.5-VL-7B-Instruct' && 'linux-arm64-npu-4') ||
+          (matrix.model_name == 'Qwen/Qwen3-30B-A3B' && 'linux-arm64-npu-4') ||
          'linux-arm64-npu-2'
      }}
    strategy:
      matrix:
-        vllm_use_version: [0, 1]
        # the accuracy test will run:
        # 1. workflow_dispatch with models input
-        #   - all: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
-        #   - specified but not all: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
+        #   - all: Qwen/Qwen3-30B-A3B, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
+        #   - specified but not all: Qwen/Qwen3-30B-A3B, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-8B-Base
        # 2. PR labeled with "*-accuracy-test"
-        #   - accuracy-test: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct
-        #   - dense-accuracy-test: Qwen/Qwen2.5-7B-Instruct
+        #   - accuracy-test: Qwen/Qwen3-8B-Base, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen3-30B-A3B
+        #   - dense-accuracy-test: Qwen/Qwen3-8B-Base
        #   - vl-accuracy-test: Qwen/Qwen2.5-VL-7B-Instruct
+        #   - moe-accuracy-test: Qwen/Qwen3-30B-A3B
        model_name: ${{ fromJSON(
+          (github.event_name == 'schedule' &&
+            '["Qwen/Qwen3-30B-A3B","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
          (github.event.inputs.models == 'all' &&
-            '["Qwen/Qwen2.5-7B-Instruct","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
-          (github.event.inputs.models == 'Qwen/Qwen2.5-7B-Instruct' &&
-            '["Qwen/Qwen2.5-7B-Instruct"]') ||
+            '["Qwen/Qwen3-30B-A3B","Qwen/Qwen2.5-VL-7B-Instruct","Qwen/Qwen3-8B-Base"]') ||
+          (github.event.inputs.models == 'Qwen/Qwen3-30B-A3B' &&
+            '["Qwen/Qwen3-30B-A3B"]') ||
          (github.event.inputs.models == 'Qwen/Qwen2.5-VL-7B-Instruct' &&
            '["Qwen/Qwen2.5-VL-7B-Instruct"]') ||
          (github.event.inputs.models == 'Qwen/Qwen3-8B-Base' &&
            '["Qwen/Qwen3-8B-Base"]') ||
          contains(github.event.pull_request.labels.*.name, 'accuracy-test') &&
-            '["Qwen/Qwen2.5-7B-Instruct","Qwen/Qwen2.5-VL-7B-Instruct"]' ||
+            '["Qwen/Qwen3-8B-Base","Qwen/Qwen2.5-VL-7B-Instruct", "Qwen/Qwen3-30B-A3B"]' ||
          contains(github.event.pull_request.labels.*.name, 'dense-accuracy-test') &&
-            '["Qwen/Qwen2.5-7B-Instruct"]' ||
+            '["Qwen/Qwen3-8B-Base"]' ||
          contains(github.event.pull_request.labels.*.name, 'vl-accuracy-test') &&
-            '["Qwen/Qwen2.5-VL-7B-Instruct"]'
+            '["Qwen/Qwen2.5-VL-7B-Instruct"]' ||
+          contains(github.event.pull_request.labels.*.name, 'moe-accuracy-test') &&
+            '["Qwen/Qwen3-30B-A3B"]'
         ) }}
-        # Remove exclude after https://github.com/vllm-project/vllm-ascend/issues/1044 resolved
-        exclude:
-          - model_name: Qwen/Qwen2.5-VL-7B-Instruct
-            vllm_use_version: 1

      fail-fast: false
-    name: ${{ matrix.model_name }} accuracy V${{ matrix.vllm_use_version }}
+    name: ${{ matrix.model_name }} accuracy
    container:
      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
      env:
-        HF_ENDPOINT: https://hf-mirror.com
-        HF_TOKEN: ${{ secrets.HF_TOKEN }}
        DATASET_SOURCE: ModelScope
        VLLM_USE_MODELSCOPE: True
+        USE_MODELSCOPE_HUB: 1
        # 1. If version specified (work_dispatch), do specified branch accuracy test
        # 2. If no version (labeled PR), do accuracy test by default ref:
        # The branch, tag or SHA to checkout. When checking out the repository that
@ -142,11 +146,11 @@ jobs:

      - name: Config mirrors
        run: |
-          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
-          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+          sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
+          pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
+          pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
          apt-get update -y
          apt install git -y
-          git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/

      - name: Install system dependencies
        run: |
@ -159,12 +163,29 @@ jobs:
          repository: vllm-project/vllm
          path: ./vllm-empty
          # Please also update this when bump matched version
-          ref: ${{ github.event.inputs.vllm-version || 'v0.9.0' }}
+          ref: ${{ github.event.inputs.vllm-version || 'v0.9.2' }}

      - name: Install vllm-project/vllm from source
        working-directory: ./vllm-empty
        run: VLLM_TARGET_DEVICE=empty pip install -e .

+      - name: Resolve vllm-ascend version
+        run: |
+          VERSION_INPUT="${{ github.event.inputs.vllm-ascend-version }}"
+          
+          if [[ "$VERSION_INPUT" == "main" ]]; then
+            TAGS=$(git ls-remote --tags --sort=-v:refname https://github.com/vllm-project/vllm-ascend "v*" | cut -f2 | sed 's|refs/tags/||')
+            LATEST_TAG=$(echo "$TAGS" | head -n1)
+            if [[ -z "$LATEST_TAG" ]]; then
+              RESOLVED_VERSION="main"
+            else
+              RESOLVED_VERSION="$LATEST_TAG"
+            fi
+          else
+            RESOLVED_VERSION="$VERSION_INPUT"
+          fi
+          echo "GHA_VLLM_ASCEND_VERSION=$RESOLVED_VERSION" >> $GITHUB_ENV
+
      - name: Checkout vllm-project/vllm-ascend repo
        uses: actions/checkout@v4
        with:
@ -174,13 +195,32 @@ jobs:

      - name: Install vllm-project/vllm-ascend
        working-directory: ./vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
        run: |
          pip install -r requirements-dev.txt
-          pip install -e .
+          pip install -v -e . 
+            
+      - name: Get vLLM commit hash and URL
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_COMMIT=$(git rev-parse --short=7 HEAD)
+          echo "VLLM_COMMIT=$VLLM_COMMIT" >> $GITHUB_ENV
+
+      - name: Get vLLM-Ascend commit hash and URL
+        working-directory: ./vllm-ascend
+        run: |
+          VLLM_ASCEND_COMMIT=$(git rev-parse --short=7 HEAD)
+          echo "VLLM_ASCEND_COMMIT=$VLLM_ASCEND_COMMIT" >> $GITHUB_ENV
+
+      - name: Print resolved hashes
+        run: |
+          echo "vLLM       : ${{ env.VLLM_COMMIT }}"
+          echo "vLLM-Ascend: ${{ env.VLLM_ASCEND_COMMIT }}"

      - name: Install lm-eval, ray, and datasets
        run: |
-            pip install lm-eval
+            pip install lm-eval==0.4.8

      - name: Collect version info
        run: |
@ -201,7 +241,6 @@ jobs:
            pip show torch | grep "Version:" | awk '{print "GHA_TORCH_VERSION="$2}'
            pip show torch_npu | grep "Version:" | awk '{print "GHA_TORCH_NPU_VERSION="$2}'
            pip show vllm | grep "Version:" | awk '{print "GHA_VLLM_VERSION="$2}' | sed 's/+.*//'
-            echo "GHA_VLLM_ASCEND_VERSION=${{ github.event.inputs.vllm-ascend-version || github.ref }}"
          } >> "$GITHUB_ENV"
      
      - name: Print versions
@ -212,15 +251,14 @@ jobs:
          echo "vLLM: ${{ env.GHA_VLLM_VERSION }}"
          echo "vLLM Ascend: ${{ env.GHA_VLLM_ASCEND_VERSION }}"

-      - name: Run Accuracy Test for V${{ matrix.vllm_use_version }}
+      - name: Run Accuracy Test
        id: report
        working-directory: ./benchmarks
        env:
          PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
-          VLLM_USE_V1: ${{ matrix.vllm_use_version }}
        run: |
          model_base_name=$(basename ${{ matrix.model_name }})
-          markdown_name="${model_base_name}-V${{ matrix.vllm_use_version }}"
+          markdown_name="${model_base_name}"
          echo "markdown_name=$markdown_name"
          echo "markdown_name=$markdown_name" >> $GITHUB_OUTPUT
          mkdir -p ./accuracy
@ -232,7 +270,9 @@ jobs:
            --cann_version "${{ env.GHA_CANN_VERSION }}" \
            --torch_npu_version "${{ env.GHA_TORCH_NPU_VERSION }}" \
            --torch_version "${{ env.GHA_TORCH_VERSION }}" \
-            --vllm_version "${{ env.GHA_VLLM_VERSION }}"
+            --vllm_version "${{ env.GHA_VLLM_VERSION }}" \
+            --vllm_commit "${{ env.VLLM_COMMIT }}" \
+            --vllm_ascend_commit "${{ env.VLLM_ASCEND_COMMIT }}" \

      - name: Generate step summary
        if: ${{ always() }}
@ -244,12 +284,122 @@ jobs:
          SAFE_VLLM_ASCEND_VERSION="${GHA_VLLM_ASCEND_VERSION//\//-}"
          echo "SAFE_VLLM_ASCEND_VERSION=$SAFE_VLLM_ASCEND_VERSION" >> "$GITHUB_ENV"

-      - name: Upload Report for V${{ matrix.vllm_use_version }}
-        if: ${{ github.event_name == 'workflow_dispatch' }}
+      - name: Check report first line for failure
+        id: check_report
+        run: |
+          REPORT_PATH="./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md"
+          echo "Scanning $REPORT_PATH for ❌ …"
+          if grep -q '❌' "$REPORT_PATH"; then
+            echo "contains_fail=true" >> $GITHUB_OUTPUT
+          else
+            echo "contains_fail=false" >> $GITHUB_OUTPUT
+          fi
+
+      - name: Upload Report 
+        if: ${{ github.event_name == 'workflow_dispatch' && steps.check_report.outputs.contains_fail == 'false' }}
        uses: actions/upload-artifact@v4
        with:
-          name: "${{ env.SAFE_VLLM_ASCEND_VERSION }}-${{ steps.report.outputs.markdown_name }}-report"
+          name: "report-${{ env.SAFE_VLLM_ASCEND_VERSION }}-${{ steps.report.outputs.markdown_name }}"
          path: ./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md
          if-no-files-found: warn
          retention-days: 90
          overwrite: true
+
+  create_pr:
+    runs-on: ubuntu-latest
+    needs: accuracy_tests
+    if: ${{ github.event_name == 'workflow_dispatch' }}
+    env:
+      UPSTREAM_REPO: vllm-project/vllm-ascend
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-ascend-ci/vllm-ascend
+          token: ${{ secrets.PAT_TOKEN }}
+          ref: main
+
+      - name: Add upstream remote
+        run: |
+          git remote add upstream https://github.com/${{ env.UPSTREAM_REPO }}.git
+          git fetch upstream
+          git remote -v
+
+      - name: Set Git user info dynamically
+        run: |
+          git config user.name "${{ github.actor }}"
+          git config user.email "${{ github.actor }}@users.noreply.github.com"
+
+      - name: Create or switch to branch
+        run: |
+          TIMESTAMP=$(date +%Y%m%d%H%M%S)
+          BRANCH_NAME="auto-pr/accuracy-report-${TIMESTAMP}"
+          echo "BRANCH_NAME=${BRANCH_NAME}" >> $GITHUB_ENV
+          git checkout -B "${BRANCH_NAME}" upstream/${{ github.event.inputs.vllm-ascend-version }}
+
+      - name: Download only current run reports
+        uses: actions/download-artifact@v4
+        with:
+          path: ./docs/source/developer_guide/evaluation/accuracy_report
+          pattern: report-*
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          run-id: ${{ github.run_id }}
+
+      - name: Delete old report
+        run: |
+          find ./docs/source/developer_guide/evaluation/accuracy_report -maxdepth 1 -type f -name '*.md' ! -name 'index.md' -delete
+          find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 2 -type f -name '*.md' -exec mv -f {} ./docs/source/developer_guide/evaluation/accuracy_report \;
+          find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 1 -type d -empty -delete
+
+      - name: Update accuracy_report/index.md
+        run: |
+          REPORT_DIR="./docs/source/developer_guide/evaluation/accuracy_report"
+          INDEX_MD="$REPORT_DIR/index.md"
+          {
+            echo "# Accuracy Report"
+            echo ""
+            echo ":::{toctree}"
+            echo ":caption: Accuracy Report"
+            echo ":maxdepth: 1"
+            
+            for report in "$REPORT_DIR"/*.md; do
+              filename="$(basename "$report" .md)"
+              if [ "$filename" != "index" ]; then
+                echo "$filename"
+              fi
+            done
+            echo ":::"
+          } > "$INDEX_MD"
+
+      - name: push accuracy report
+        env:
+          GITHUB_TOKEN: ${{ secrets.PAT_TOKEN }}
+        run: |
+          git add ./docs/source/developer_guide/evaluation/accuracy_report/*.md
+          git commit -s -m "[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-version }}"
+          git push -f origin "${{ env.BRANCH_NAME }}"
+
+      - name: Create PR in upstream via API
+        uses: actions/github-script@v7
+        with:
+          github-token: ${{ secrets.PAT_TOKEN }}
+          script: |
+            const pr = await github.rest.pulls.create({
+              owner: 'vllm-project',
+              repo: 'vllm-ascend',
+              head: `vllm-ascend-ci:${{ env.BRANCH_NAME }}`,
+              base: '${{ github.event.inputs.vllm-ascend-version }}',
+              title: `[Doc] Update accuracy reports for ${{ github.event.inputs.vllm-ascend-version }}`,
+              body: `The accuracy results running on NPU Altlas A2 have changed, updating reports for:
+            ${{ 
+              github.event.inputs.models == 'all' 
+                && 'All models (Qwen/Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base)' 
+                || github.event.inputs.models 
+            }}
+            
+            - [Workflow run][1]
+            
+            [1]: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}`
+            });
+            core.info(`Created PR #${pr.data.number}`);
+ 
--- a/.github/workflows/actionlint.yml
+++ b/.github/workflows/actionlint.yml
@ -1,53 +0,0 @@
-#
-# Copyright 2023 The vLLM team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Adapted from vllm-project/vllm/blob/main/.github
-#
-
-name: Lint GitHub Actions workflows
-on:
-  pull_request:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - '.github/workflows/*.ya?ml'
-      - '.github/workflows/actionlint.*'
-      - '.github/workflows/matchers/actionlint.json'
-
-env:
-  LC_ALL: en_US.UTF-8
-
-defaults:
-  run:
-    shell: bash
-
-permissions:
-  contents: read
-
-jobs:
-  actionlint:
-    runs-on: ubuntu-latest
-    steps:
-      - name: "Checkout"
-        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
-        with:
-          fetch-depth: 0
-
-      - name: "Run actionlint"
-        env:
-          SHELLCHECK_OPTS: --exclude=SC2046,SC2006,SC2086
-        run: |
-          echo "::add-matcher::.github/workflows/matchers/actionlint.json"
-          tools/actionlint.sh -color
--- a/.github/workflows/format_pr_body.yaml
+++ b/.github/workflows/format_pr_body.yaml
@ -0,0 +1,63 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+name: format / pr body
+
+on:
+  # The PR updated when PR opened and push new commits
+  pull_request_target:
+    types: [opened, synchronize]
+    branches:
+      - 'main'
+
+permissions:
+  pull-requests: write
+
+jobs:
+  update-description:
+    name: update vLLM version
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          path: ./vllm-empty
+
+      - name: Get vLLM version
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_COMMIT=$(git rev-parse HEAD)
+          echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> $GITHUB_ENV
+
+      - name: Checkout repository
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+
+      - name: Set up Python
+        uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
+
+      - name: Get vLLM release version
+        run: |
+          VLLM_VERSION=$(python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"')
+          echo "VLLM_VERSION=$VLLM_VERSION" >> $GITHUB_ENV
+
+      - name: Update PR description
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          bash .github/format_pr_body.sh "${{ github.event.number }}" "${{ env.VLLM_VERSION }}" "${{ env.VLLM_COMMIT }}"
--- a/.github/workflows/image_310p_openeuler.yml
+++ b/.github/workflows/image_310p_openeuler.yml
@ -0,0 +1,117 @@
+name: 'image / openEuler / 310p'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main-310p-openeuler / vllm-ascend:*-dev-310p-openeuler
+# 3. tags push trigger image publish
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-310p-openeuler / vllm-ascend:v1.2.3rc1-310p-openeuler
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_310p_openeuler.yml'
+      - 'Dockerfile.310p.openEuler'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_310p_openeuler.yml'
+      - 'Dockerfile.310p.openEuler'
+      - 'vllm_ascend/**'
+
+jobs:
+  build:
+    name: vllm-ascend image build
+    runs-on: >-
+      ${{
+          github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+          'ubuntu-latest' ||
+          'ubuntu-24.04-arm'
+      }}
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N-310p-openeuler is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-310p-openeuler
+        #    - pre/post/dev: v0.7.1rc1-310p-openeuler/v0.7.1rc1-310p-openeuler/v0.7.1rc1.dev1-310p-openeuler/v0.7.1.post1-310p-openeuler, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-310p-openeuler
+          type=ref,event=pr,suffix=-310p-openeuler
+          type=pep440,pattern={{raw}},suffix=-310p-openeuler
+        flavor:
+          latest=false
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push 310p
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/arm64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        file: Dockerfile.310p.openEuler
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
--- a/.github/workflows/image_310p_ubuntu.yml
+++ b/.github/workflows/image_310p_ubuntu.yml
@ -0,0 +1,113 @@
+name: 'image / Ubuntu / 310p'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main-310p / vllm-ascend:*-dev-310p
+# 3. tags push trigger image publish
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-310p / vllm-ascend:v1.2.3rc1-310p
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_310p_ubuntu.yml'
+      - 'Dockerfile.310p'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_310p_ubuntu.yml'
+      - 'Dockerfile.310p'
+      - 'vllm_ascend/**'
+jobs:
+
+  build:
+    name: vllm-ascend image build
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-310p
+        #    - pre/post/dev: v0.7.1rc1-310p/v0.7.1rc1-310p/v0.7.1rc1.dev1-310p/v0.7.1.post1-310p, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-310p
+          type=ref,event=pr,suffix=-310p
+          type=pep440,pattern={{raw}},suffix=-310p
+        flavor:
+          latest=false
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push 310p
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/amd64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        file: Dockerfile.310p
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
--- a/.github/workflows/image_a3_openeuler.yml
+++ b/.github/workflows/image_a3_openeuler.yml
@ -0,0 +1,117 @@
+name: 'image / openEuler / a3'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main / vllm-ascend:*-dev
+# 3. tags push trigger image publish
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-a3-openeuler / vllm-ascend:v1.2.3rc1-a3-openeuler
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_a3_openeuler.yml'
+      - 'Dockerfile.a3.openEuler'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_a3_openeuler.yml'
+      - 'Dockerfile.a3.openEuler'
+      - 'vllm_ascend/**'
+
+jobs:
+  build:
+    name: vllm-ascend image build
+    runs-on: >-
+      ${{
+          github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+          'ubuntu-latest' ||
+          'ubuntu-24.04-arm'
+      }}
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N-a3-openeuler is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-a3-openeuler
+        #    - pre/post/dev: v0.7.1rc1-a3-openeuler/v0.7.1rc1-a3-openeuler/v0.7.1rc1.dev1-a3-openeuler/v0.7.1.post1-a3-openeuler, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-a3-openeuler
+          type=ref,event=pr,suffix=-a3-openeuler
+          type=pep440,pattern={{raw}},suffix=-a3-openeuler
+        flavor:
+          latest=false
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push a3
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/arm64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        file: Dockerfile.a3.openEuler
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
+
--- a/.github/workflows/image_a3_ubuntu.yml
+++ b/.github/workflows/image_a3_ubuntu.yml
@ -0,0 +1,113 @@
+name: 'image / Ubuntu / a3'
+# This is a docker build check and publish job:
+# 1. PR Triggered docker image build check
+#   - is for image build check
+#   - Enable on main/*-dev branch
+#   - push: ${{ github.event_name != 'pull_request' }} ==> false
+# 2. branches push trigger image publish
+#   - is for branch/dev/nightly image
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main / vllm-ascend:*-dev
+# 3. tags push trigger image publish
+#   - is for final release image
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-a3|vllm-ascend:v1.2.3rc1-a3
+on:
+  pull_request:
+    branches:
+      - 'main'
+      - '*-dev'
+    paths:
+      - '.github/workflows/image_a3_ubuntu.yml'
+      - 'Dockerfile.a3'
+      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
+  push:
+    # Publish image when tagging, the Dockerfile in tag will be build as tag image
+    branches:
+      - 'main'
+      - '*-dev'
+    tags:
+      - 'v*'
+    paths:
+      - '.github/workflows/image_a3_ubuntu.yml'
+      - 'Dockerfile.a3'
+      - 'vllm_ascend/**'
+jobs:
+
+  build:
+    name: vllm-ascend image build
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Print
+      run: |
+        lscpu
+    - name: Docker meta
+      id: meta
+      uses: docker/metadata-action@v5
+      with:
+        # TODO(yikun): add more hub image and a note on release policy for container image
+        images: |
+          quay.io/ascend/vllm-ascend
+        # Note for test case
+        # https://github.com/marketplace/actions/docker-metadata-action#typeref
+        # 1. branch job pulish per main/*-dev branch commits
+        # 2. main and dev pull_request is build only, so the tag pr-N-a3 is fine
+        # 3. only pep440 matched tag will be published:
+        #    - v0.7.1 --> v0.7.1-a3
+        #    - pre/post/dev: v0.7.1rc1-a3/v0.7.1rc1-a3/v0.7.1rc1.dev1-a3/v0.7.1.post1-a3, no latest
+        #      which follow the rule from vLLM with prefix v
+        # TODO(yikun): the post release might be considered as latest release
+        tags: |
+          type=ref,event=branch,suffix=-a3
+          type=ref,event=pr,suffix=-a3
+          type=pep440,pattern={{raw}},suffix=-a3
+        flavor:
+          latest=false
+
+    - name: Free up disk space
+      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      with:
+        tool-cache: true
+        docker-images: false
+
+    - name: Build - Set up QEMU
+      uses: docker/setup-qemu-action@v3
+
+    - name: Build - Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Publish - Login to Quay Container Registry
+      if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+      uses: docker/login-action@v3
+      with:
+        registry: quay.io
+        username: ${{ vars.QUAY_USERNAME }}
+        password: ${{ secrets.QUAY_PASSWORD }}
+
+    - name: Build and push a3
+      uses: docker/build-push-action@v6
+      with:
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/amd64'
+          }}
+        # use the current repo path as the build context, ensure .git is contained
+        context: .
+        file: Dockerfile.a3
+        # only trigger when tag, branch/main push
+        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
+        labels: ${{ steps.meta.outputs.labels }}
+        tags: ${{ steps.meta.outputs.tags }}
+        build-args: |
+          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
+
--- a/.github/workflows/image_openeuler.yml
+++ b/.github/workflows/image_openeuler.yml
@ -1,4 +1,4 @@
-name: 'image'
+name: 'image / openEuler'
 # This is a docker build check and publish job:
 # 1. PR Triggered docker image build check
 #   - is for image build check
@ -6,10 +6,9 @@ name: 'image'
 #   - push: ${{ github.event_name != 'pull_request' }} ==> false
 # 2. branches push trigger image publish
 #   - is for branch/dev/nightly image
-#   - commits are merge into main/*-dev  ==> vllm-ascend:main / vllm-ascend:*-dev
-# 3. tags push trigger image publish
+#   - commits are merge into main/*-dev  ==> vllm-ascend:main-openeuler / vllm-ascend:*-dev-openeuler
 #   - is for final release image
-#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-openeuler|latest / vllm-ascend:v1.2.3rc1-openeuler
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3-openeuler / vllm-ascend:v1.2.3rc1-openeuler
 on:
  pull_request:
    branches:
@ -19,6 +18,12 @@ on:
      - '.github/workflows/image_openeuler.yml'
      - 'Dockerfile.openEuler'
      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
  push:
    # Publish image when tagging, the Dockerfile in tag will be build as tag image
    branches:
@ -33,9 +38,13 @@ on:

 jobs:
  build:
-    name: vllm-ascend openEuler image
-    runs-on: ubuntu-latest
-
+    name: vllm-ascend image build
+    runs-on: >-
+      ${{
+          github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+          'ubuntu-latest' ||
+          'ubuntu-24.04-arm'
+      }}
    steps:
    - uses: actions/checkout@v4

@ -55,7 +64,7 @@ jobs:
        # 1. branch job pulish per main/*-dev branch commits
        # 2. main and dev pull_request is build only, so the tag pr-N-openeuler is fine
        # 3. only pep440 matched tag will be published:
-        #    - v0.7.1 --> v0.7.1-openeuler, latest
+        #    - v0.7.1 --> v0.7.1-openeuler
        #    - pre/post/dev: v0.7.1rc1-openeuler/v0.7.1rc1-openeuler/v0.7.1rc1.dev1-openeuler/v0.7.1.post1-openeuler, no latest
        #      which follow the rule from vLLM with prefix v
        # TODO(yikun): the post release might be considered as latest release
@ -63,6 +72,8 @@ jobs:
          type=ref,event=branch,suffix=-openeuler
          type=ref,event=pr,suffix=-openeuler
          type=pep440,pattern={{raw}},suffix=-openeuler
+        flavor:
+          latest=true

    - name: Free up disk space
      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
@ -84,10 +95,15 @@ jobs:
        username: ${{ vars.QUAY_USERNAME }}
        password: ${{ secrets.QUAY_PASSWORD }}

-    - name: Build and push
+    - name: Build and push 910b
      uses: docker/build-push-action@v6
      with:
-        platforms: linux/amd64,linux/arm64
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/arm64'
+          }}
        # use the current repo path as the build context, ensure .git is contained
        context: .
        # only trigger when tag, branch/main push
@ -97,3 +113,4 @@ jobs:
        file: Dockerfile.openEuler
        build-args: |
          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
--- a/.github/workflows/image_ubuntu.yml
+++ b/.github/workflows/image_ubuntu.yml
@ -1,4 +1,4 @@
-name: 'image'
+name: 'image / Ubuntu'
 # This is a docker build check and publish job:
 # 1. PR Triggered docker image build check
 #   - is for image build check
@ -9,7 +9,7 @@ name: 'image'
 #   - commits are merge into main/*-dev  ==> vllm-ascend:main / vllm-ascend:*-dev
 # 3. tags push trigger image publish
 #   - is for final release image
-#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3|latest / vllm-ascend:v1.2.3rc1
+#   - Publish when tag with v* (pep440 version)  ===>  vllm-ascend:v1.2.3 / vllm-ascend:v1.2.3rc1
 on:
  pull_request:
    branches:
@ -19,6 +19,12 @@ on:
      - '.github/workflows/image_ubuntu.yml'
      - 'Dockerfile'
      - 'vllm_ascend/**'
+      - 'setup.py'
+      - 'pyproject.toml'
+      - 'requirements.txt'
+      - 'cmake/**'
+      - 'CMakeLists.txt'
+      - 'csrc/**'
  push:
    # Publish image when tagging, the Dockerfile in tag will be build as tag image
    branches:
@ -33,7 +39,7 @@ on:
 jobs:

  build:
-    name: vllm-ascend Ubuntu image
+    name: vllm-ascend image build
    runs-on: ubuntu-latest

    steps:
@ -63,6 +69,8 @@ jobs:
            type=ref,event=branch
            type=ref,event=pr
            type=pep440,pattern={{raw}}
+        flavor:
+          latest=true

    - name: Free up disk space
      uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
@ -84,15 +92,22 @@ jobs:
        username: ${{ vars.QUAY_USERNAME }}
        password: ${{ secrets.QUAY_PASSWORD }}

-    - name: Build and push
+    - name: Build and push 910b
      uses: docker/build-push-action@v6
      with:
-        platforms: linux/amd64,linux/arm64
+        platforms: >-
+          ${{
+              github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
+              'linux/amd64,linux/arm64' ||
+              'linux/amd64'
+          }}
        # use the current repo path as the build context, ensure .git is contained
        context: .
+        file: Dockerfile
        # only trigger when tag, branch/main push
        push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
        labels: ${{ steps.meta.outputs.labels }}
        tags: ${{ steps.meta.outputs.tags }}
        build-args: |
          PIP_INDEX_URL=https://pypi.org/simple
+        provenance: false
--- a/.github/workflows/nightly_benchmarks.yaml
+++ b/.github/workflows/nightly_benchmarks.yaml
@ -20,9 +20,10 @@ name: 'Benchmarks / Performance'

 on:
  schedule:
-    # Run at 02:00 everyday
-    - cron: '00 18 * * *'
-  
+    # Run benchmarks at 20:00 and 03:00 Beijing time (UTC+8)
+    - cron: "0 12 * * *"
+    - cron: "0 19 * * *"
+
  workflow_dispatch:
    # Allow manual triggering of the workflow

@ -45,13 +46,15 @@ jobs:
  test:
    if: ${{ contains(github.event.pull_request.labels.*.name, 'performance-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}

-    name: Benchmarks/vLLM=${{ matrix.vllm_branch }}, vLLM-Ascend=${{ matrix.vllm_ascend_branch }}
+    name: Benchmarks/vLLM=${{ matrix.vllm_branch }}, vLLM-Ascend=${{ matrix.vllm_ascend_branch }}, use_v1=${{ matrix.vllm_use_v1 }}
    runs-on: 'linux-arm64-npu-static-8'
    strategy:
      matrix:
        include:
-          - vllm_branch: v0.9.0
+          - vllm_branch: v0.9.2
            vllm_ascend_branch: main
+            vllm_use_v1: 1
+      max-parallel: 1
    container:
      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
      volumes:
@ -67,10 +70,10 @@ jobs:
        --device /dev/devmm_svm
        --device /dev/hisi_hdc
      env:
-        HF_ENDPOINT: https://hf-mirror.com
-        HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        VLLM_USE_MODELSCOPE: True
        ES_OM_DOMAIN: ${{ secrets.ES_OM_DOMAIN }}
        ES_OM_AUTHORIZATION: ${{ secrets.ES_OM_AUTHORIZATION }}
+        VLLM_USE_V1: ${{ matrix.vllm_use_v1 }}
    steps:
      - name: Check npu and CANN info
        run: |
@ -79,6 +82,8 @@ jobs:

      - name: Config mirrors
        run: |
+          # keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
+          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

      - name: Install system dependencies
@ -109,7 +114,10 @@ jobs:
          VLLM_TARGET_DEVICE=empty pip install -e .

      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
        run: |
+          pip install "transformers<=4.52.4"
          pip install -e .
          pip install -r benchmarks/requirements-bench.txt

@ -140,8 +148,8 @@ jobs:
      - name: Install elastic_tool
        if: github.event_name != 'pull_request'
        run: |
-          pip install escli-tool==0.2.1
-          
+          pip install escli-tool==0.2.3
+
      - name: Collect pr info from vllm-project/vllm-ascend
        if: github.event_name != 'pull_request'
        run: |
@ -159,17 +167,19 @@ jobs:
          cp -r benchmarks/* /github/home/benchmarks/

      - name: Run benchmark iteration
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
        if: github.event_name != 'pull_request'
        run: |
          while IFS= read -r line || [[ -n "$line" ]]; do
            commit_id=${line%% *}
            commit_title=${line#* }
-            commit_time=$(git show -s --format=%cd $commit_hash --date=iso-strict)
-            commit_time_no_tz=${commit_time::19}

            git checkout $commit_id
+            commit_time=$(git show -s --format=%cd $commit_hash --date=iso-strict)
+            commit_time_no_tz=${commit_time::19}
            pip install -e .
-            
+
            echo "------------------------"
            echo "commit_id: $commit_id"
            echo "commit_title: $commit_title"
@ -177,17 +187,21 @@ jobs:
            echo "vllm branch: ${{ matrix.vllm_branch }}"
            echo "vllm-ascend branch: ${{ matrix.vllm_ascend_branch }}"
            echo "------------------------"
+
            cd /github/home
-            bash benchmarks/scripts/run-performance-benchmarks.sh
-            # send the result to es
-            if [[ "${{ github.event_name }}" != "pull request" ]]; then
-              escli add --vllm_branch ${{ matrix.vllm_branch }} \
-              --vllm_ascend_branch ${{ matrix.vllm_ascend_branch }} \
-              --commit_id $commit_id \
-              --commit_title "$commit_title" \
-              --created_at "$commit_time_no_tz" \
-              --res_dir ./benchmarks/results 
-              rm -rf ./benchmarks/results
+            ERROR_MSG=""
+            if ! bash benchmarks/scripts/run-performance-benchmarks.sh; then
+              ERROR_MSG="Benchmark failed to run"
            fi
+            # send the result to es
+            escli add --vllm_branch ${{ matrix.vllm_branch }} \
+            --vllm_ascend_branch ${{ matrix.vllm_ascend_branch }} \
+            --commit_id $commit_id \
+            --commit_title "$commit_title" \
+            --created_at "$commit_time_no_tz" \
+            --res_dir ./benchmarks/results \
+            --error "$ERROR_MSG" \
+            --extra_feat '{"VLLM_USE_V1": "${{ matrix.vllm_use_v1 }}"}'
+            rm -rf ./benchmarks/results
            cd -
          done < commit_log.txt
--- a/.github/workflows/pre-commit.yml
+++ b/.github/workflows/pre-commit.yml
@ -0,0 +1,37 @@
+name: pre-commit
+
+on:
+    workflow_call:
+
+permissions:
+  contents: read
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout vllm-project/vllm-ascend repo
+      uses: actions/checkout@v4
+    - uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
+      with:
+        python-version: "3.10"
+    - run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
+    - run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
+    - name: Checkout vllm-project/vllm repo
+      uses: actions/checkout@v4
+      with:
+        repository: vllm-project/vllm
+        path: ./vllm-empty
+    - name: Install vllm
+      working-directory: vllm-empty
+      run: |
+        pip install -r requirements/build.txt --extra-index-url https://download.pytorch.org/whl/cpu
+        VLLM_TARGET_DEVICE=empty pip install .
+    - name: Install vllm-ascend dev
+      run: |
+        pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu
+    - uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
+      env:
+        SHELLCHECK_OPTS: "--exclude=SC2046,SC2006,SC2086" # Exclude SC2046, SC2006, SC2086 for actionlint
+      with:
+        extra_args: --all-files --hook-stage manual
--- a/.github/workflows/release_code.yml
+++ b/.github/workflows/release_code.yml
@ -32,20 +32,8 @@ on:
      - 'CMakeLists.txt'
      - 'csrc/**'
  push:
-    branches:
-      - 'main'
-      - '*-dev'
    tags:
      - 'v*'
-    paths:
-      - '.github/workflows/release_code.yml'
-      - 'vllm_ascend/**'
-      - 'setup.py'
-      - 'pyproject.toml'
-      - 'requirements.txt'
-      - 'cmake/**'
-      - 'CMakeLists.txt'
-      - 'csrc/**'

 jobs:
  build:
--- a/.github/workflows/release_whl.yml
+++ b/.github/workflows/release_whl.yml
@ -18,6 +18,9 @@
 name: build / wheel

 on:
+  schedule:
+    # Runs at 23:00 UTC (7:00 AM Beijing) every day
+    - cron: '0 23 * * *'
  pull_request:
    branches:
      - 'main'
@ -33,21 +36,8 @@ on:
      - 'CMakeLists.txt'
      - 'csrc/**'
  push:
-    branches:
-      - 'main'
-      - '*-dev'
    tags:
      - 'v*'
-    paths:
-      - '.github/workflows/release_whl.yml'
-      - '.github/Dockerfile.buildwheel'
-      - 'vllm_ascend/**'
-      - 'setup.py'
-      - 'pyproject.toml'
-      - 'requirements.txt'
-      - 'cmake/**'
-      - 'CMakeLists.txt'
-      - 'csrc/**'

 jobs:
  build:
@ -55,7 +45,11 @@ jobs:
    strategy:
      matrix:
        os: [ubuntu-24.04, ubuntu-24.04-arm]
-        python-version: ['3.9', '3.10', '3.11']
+        # PR only trigger latest version
+        python-version: ${{ fromJSON(
+          (github.event_name == 'pull_request' && '["3.11"]') ||
+          '["3.9", "3.10", "3.11"]'
+         ) }}
    runs-on: ${{ matrix.os }}
    steps:
    - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -71,22 +65,51 @@ jobs:
        --build-arg PY_VERSION=${{ matrix.python-version }} \
        -t wheel:v1 .
        docker run --rm \
+        -u $(id -u):$(id -g) \
        -v $(pwd):/outpwd \
        wheel:v1 \
        bash -c "cp -r /workspace/vllm-ascend/dist /outpwd"
        ls dist
-      
-    - name: Archive wheel
-      uses: actions/upload-artifact@v4
-      with:
-        name: vllm-ascend-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
-        path: dist/*

    - name: Set up Python ${{ matrix.python-version }}
      if: startsWith(github.ref, 'refs/tags/')
      uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
      with:
        python-version: ${{ matrix.python-version }}
+      
+    - name: Repair wheels with auditwheel
+      run: |
+        python3 -m pip install auditwheel
+        python3 -m pip install patchelf
+        mkdir -p dist/repaired
+        for whl in dist/*.whl; do
+          auditwheel repair "$whl" -w dist/repaired/ \
+          --exclude libplatform.so \
+          --exclude libregister.so \
+          --exclude libge_common_base.so \
+          --exclude libc10.so \
+          --exclude libc_sec.so \
+          --exclude "libascend*.so" \
+          --exclude "libtorch*.so"
+        done
+        rm -f dist/*.whl
+        mv dist/repaired/*.whl dist/
+        rmdir dist/repaired
+        ls dist
+
+    - name: Verify automatic platform tags
+      run: |
+        cd dist
+        for wheel in *.whl; do
+          echo "verification file: $wheel"
+          auditwheel show "$wheel"
+        done
+
+    - name: Archive wheel
+      uses: actions/upload-artifact@v4
+      with:
+        name: vllm-ascend-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
+        path: dist/*

    - name: Release
      if: startsWith(github.ref, 'refs/tags/')
--- a/.github/workflows/shellcheck.yml
+++ b/.github/workflows/shellcheck.yml
@ -1,49 +0,0 @@
-#
-# Copyright 2023 The vLLM team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Adapted from vllm-project/vllm/blob/main/.github
-#
-
-name: Lint shell scripts
-on:
-  pull_request:
-    branches:
-      - 'main'
-      - '*-dev'
-    paths:
-      - '**/*.sh'
-      - '.github/workflows/shellcheck.yml'
-
-env:
-  LC_ALL: en_US.UTF-8
-
-defaults:
-  run:
-    shell: bash
-
-permissions:
-  contents: read
-
-jobs:
-  shellcheck:
-    runs-on: ubuntu-latest
-    steps:
-      - name: "Checkout"
-        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
-        with:
-          fetch-depth: 0
-
-      - name: "Check shell scripts"
-        run: |
-          tools/shellcheck.sh
--- a/.github/workflows/vllm_ascend_doctest.yaml
+++ b/.github/workflows/vllm_ascend_doctest.yaml
@ -30,8 +30,8 @@ on:
      - 'tests/e2e/common.sh'
      - 'tests/e2e/run_doctests.sh'
  schedule:
-    # Runs every 4 hours
-    - cron:  '0 */4 * * *'
+    # Runs every 12 hours
+    - cron:  '0 */12 * * *'

 # Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
 # declared as "shell: bash -el {0}" on steps that need to be properly activated.
@ -46,7 +46,7 @@ jobs:
      # Each version should be tested
      fail-fast: false
      matrix:
-        vllm_verison: [main, v0.7.3-dev, main-openeuler, v0.7.3-dev-openeuler]
+        vllm_verison: [v0.9.1-dev, v0.9.1-dev-openeuler, main, main-openeuler]
    name: vLLM Ascend test
    runs-on: linux-arm64-npu-1
    container:
@ -65,34 +65,19 @@ jobs:
          cd /vllm-workspace/vllm
          git --no-pager log -1 || true

-      - name: Config OS mirrors - Ubuntu
-        if: ${{ !endsWith(matrix.vllm_verison, '-openeuler') }}
-        run: |
-          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
-          apt-get update -y
-          apt install git curl -y
-
-      - name: Config OS mirrors - openEuler
-        if: ${{ endsWith(matrix.vllm_verison, '-openeuler') }}
-        run: |
-          yum update -y
-          yum install git curl -y
-
-      - name: Config pip mirrors
-        run: |
-          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
-
      - name: Checkout vllm-project/vllm-ascend repo
        uses: actions/checkout@v4

      - name: Run vllm-ascend/tests/e2e/run_doctests.sh
        run: |
          # PWD: /__w/vllm-ascend/vllm-ascend
-          # Address old branch like v0.7.3:
-          if [ ! -d /vllm-workspace/vllm-ascend/tests/e2e ]; then
-            echo "Warning: the doctest path doesn't exists, copy now"
-            cp -r tests/e2e /vllm-workspace/vllm-ascend/tests/
-          fi
+          # Make sure e2e tests are latest
+          echo "Replacing /vllm-workspace/vllm-ascend/tests/e2e ..."
+          rm -rf /vllm-workspace/vllm-ascend/tests/e2e
+          mkdir -p /vllm-workspace/vllm-ascend/tests
+          # Overwrite e2e and examples
+          cp -r tests/e2e /vllm-workspace/vllm-ascend/tests/
+          cp -r examples /vllm-workspace/vllm-ascend/

          # Simulate container to enter directory
          cd /workspace
--- a/.github/workflows/vllm_ascend_test.yaml
+++ b/.github/workflows/vllm_ascend_test.yaml
@ -18,21 +18,13 @@
 name: 'test'

 on:
-  schedule:
-    - cron: '0 23 * * *'
+  push:
+    branches:
+      - 'main'
  pull_request:
    branches:
      - 'main'
      - '*-dev'
-    paths:
-      - '*.txt'
-      - '**/*.py'
-      - '.github/workflows/vllm_ascend_test.yaml'
-      - '!docs/**'
-      - 'pytest.ini'
-      - '!benchmarks/**'
-      - 'tools/mypy.sh'
-      - 'mypy.ini'

 # Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
 # declared as "shell: bash -el {0}" on steps that need to be properly activated.
@ -41,89 +33,119 @@ defaults:
  run:
    shell: bash -el {0}

+# only cancel in-progress runs of the same workflow
+# and ignore the lint / 1 card / 4 cards test type
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
 jobs:
  lint:
+    uses: ./.github/workflows/pre-commit.yml
+
+  changes:
    runs-on: ubuntu-latest
+    outputs:
+      e2e_tracker: ${{ steps.filter.outputs.e2e_tracker }}
+      ut_tracker: ${{ steps.filter.outputs.ut_tracker }}
+    steps:
+      - uses: actions/checkout@v4
+      - uses: dorny/paths-filter@v3
+        id: filter
+        with:
+          filters: |
+            e2e_tracker:
+              - '.github/workflows/vllm_ascend_test.yaml'
+              - 'vllm_ascend/**'
+              - 'csrc/**'
+              - 'cmake/**'
+              - 'tests/e2e/**'
+              - 'CMakeLists.txt'
+              - 'setup.py'
+              - 'requirements.txt'
+              - 'requirements-dev.txt'
+              - 'requirements-lint.txt'
+              - 'packages.txt'
+            ut_tracker:
+              - 'tests/ut/**'
+  ut:
+    needs: [lint, changes]
+    name: unit test
+    # only trigger unit test after lint passed and the change is e2e and ut related.
+    if: ${{ needs.lint.result == 'success' && (needs.changes.outputs.e2e_tracker == 'true' || needs.changes.outputs.ut_tracker == 'true') }}
+    runs-on: ubuntu-latest
+    container:
+      image: quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
+      env:
+        VLLM_LOGGING_LEVEL: ERROR
+        VLLM_USE_MODELSCOPE: True
    strategy:
      matrix:
-        python-version: ["3.10"]
+        vllm_version: [main, v0.9.2]
    steps:
-      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
-        with:
-          python-version: ${{ matrix.python-version }}
-      - name: Install dependencies
+      - name: Install packages
        run: |
-          python -m pip install --upgrade pip
-          pip install -r requirements-lint.txt
-      - name: Run codespell check
-        run: |
-          CODESPELL_EXCLUDES=('--skip' 'tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**')
-          CODESPELL_IGNORE_WORDS=('-L' 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn')
-
-          codespell --toml pyproject.toml "${CODESPELL_EXCLUDES[@]}" "${CODESPELL_IGNORE_WORDS[@]}"
-      - name: Analysing the code with ruff
-        run: |
-          echo "::add-matcher::.github/workflows/matchers/ruff.json"
-          ruff check --output-format github .
-      - name: Run isort
-        run: |
-          isort . --check-only
-      - name: Running yapf
-        run: |
-          python -m pip install --upgrade pip
-          pip install toml
-          pip install yapf==0.32.0
-          yapf --diff --recursive .
-
-      - name: Install dependencies
-        run: |
-          pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu
+          apt-get update -y
+          apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2

      - name: Checkout vllm-project/vllm repo
        uses: actions/checkout@v4
        with:
          repository: vllm-project/vllm
-          path: vllm-empty
+          ref: ${{ matrix.vllm_version }}
+          path: ./vllm-empty

      - name: Install vllm-project/vllm from source
-        working-directory: vllm-empty
+        working-directory: ./vllm-empty
        run: |
-          pip install -r requirements/build.txt --extra-index-url https://download.pytorch.org/whl/cpu
-          VLLM_TARGET_DEVICE=empty pip install .
+          VLLM_TARGET_DEVICE=empty python3 -m pip install . --extra-index https://download.pytorch.org/whl/cpu/
+          python3 -m pip uninstall -y triton

-      - name: Mypy Check
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+
+      - name: Install vllm-project/vllm-ascend
        run: |
-          echo "::add-matcher::.github/workflows/matchers/mypy.json"
-          tools/mypy.sh 1 ${{ matrix.python-version }}
+          export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
+          export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
+          python3 -m pip install -r requirements-dev.txt --extra-index https://download.pytorch.org/whl/cpu/
+          python3 -m pip install -v . --extra-index https://download.pytorch.org/whl/cpu/
+
+      - name: Run unit test
+        env:
+          VLLM_WORKER_MULTIPROC_METHOD: spawn
+          TORCH_DEVICE_BACKEND_AUTOLOAD: 0
+        run: |
+          export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
+          pytest -sv --cov --cov-report=xml:unittests-coverage.xml tests/ut
+
+      - name: Upload coverage to Codecov
+        if: ${{ matrix.vllm_version == 'main' }}
+        uses: codecov/codecov-action@v5
+        env:
+          CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
+        with:
+          flags: unittests
+          name: vllm-ascend
+          verbose: true

  e2e:
-    needs: [lint]
-    if: ${{ needs.lint.result == 'success' }}
+    needs: [lint, changes]
+    # only trigger e2e test after lint passed and the change is e2e related with pull request.
+    if: ${{ github.event_name == 'pull_request' && needs.lint.result == 'success' && needs.changes.outputs.e2e_tracker == 'true' }}
    strategy:
      max-parallel: 2
      matrix:
-        os: [linux-arm64-npu-1, linux-arm64-npu-4]
-        vllm_version: [main, v0.9.0]
-    concurrency:
-      group: >
-        ${{
-        matrix.os == 'linux-arm64-npu-4'
-          && github.event.pull_request.number
-          && format('pr-{0}-limit-npu-4', github.event.pull_request.number)
-        || format('job-{0}-{1}-{2}', matrix.os, matrix.vllm_version, github.event.pull_request.number)
-        }}
-      cancel-in-progress: false
-    name: vLLM Ascend test
+        os: [linux-arm64-npu-1]
+        vllm_version: [main, v0.9.2]
+    name: singlecard e2e test
    runs-on: ${{ matrix.os }}
    container:
      # TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
      env:
-        HF_ENDPOINT: https://hf-mirror.com
-        HF_TOKEN: ${{ secrets.HF_TOKEN }}
        VLLM_LOGGING_LEVEL: ERROR
+        VLLM_USE_MODELSCOPE: True
    steps:
      - name: Check npu and CANN info
        run: |
@ -132,11 +154,11 @@ jobs:

      - name: Config mirrors
        run: |
-          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
-          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+          sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
+          pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
+          pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
          apt-get update -y
          apt install git -y
-          git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/

      - name: Checkout vllm-project/vllm-ascend repo
        uses: actions/checkout@v4
@ -159,64 +181,106 @@ jobs:
          VLLM_TARGET_DEVICE=empty pip install -e .

      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
        run: |
          pip install -r requirements-dev.txt
          pip install -v -e .

-      - name: Run vllm-project/vllm-ascend test for V1 Engine
+      - name: Run e2e test
        env:
-          VLLM_USE_V1: 1
          VLLM_WORKER_MULTIPROC_METHOD: spawn
+          VLLM_USE_MODELSCOPE: True
        run: |
-          if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/singlecard/test_offline_inference.py
-            pytest -sv tests/singlecard/test_scheduler.py
-            # guided decoding doesn't work, fix it later
-            # pytest -sv tests/singlecard/test_guided_decoding.py.py
-            # test_ascend_config.py should be ran separately because it will regenerate the global config many times.
-            pytest -sv tests/singlecard/test_ascend_config.py
-            pytest -sv tests/singlecard/test_camem.py
-            pytest -sv tests/singlecard/ \
-            --ignore=tests/singlecard/test_offline_inference.py \
-            --ignore=tests/singlecard/test_scheduler.py \
-            --ignore=tests/singlecard/test_guided_decoding.py \
-            --ignore=tests/singlecard/test_ascend_config.py \
-            --ignore=tests/singlecard/test_camem.py
-          else
-            pytest -sv tests/multicard/test_ilama_lora_tp2.py
-            # To avoid oom, we need to run the test in a single process.
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_topk
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py --ignore=tests/multicard/test_offline_inference_distributed.py
-          fi
+          pytest -sv tests/e2e/singlecard/test_offline_inference.py
+          pytest -sv tests/e2e/singlecard/test_ilama_lora.py
+          pytest -sv tests/e2e/singlecard/test_guided_decoding.py
+          pytest -sv tests/e2e/singlecard/test_camem.py
+          pytest -sv tests/e2e/singlecard/test_embedding.py
+          pytest -sv tests/e2e/singlecard/ \
+          --ignore=tests/e2e/singlecard/test_offline_inference.py \
+          --ignore=tests/e2e/singlecard/test_ilama_lora.py \
+          --ignore=tests/e2e/singlecard/test_guided_decoding.py \
+          --ignore=tests/e2e/singlecard/test_camem.py \
+          --ignore=tests/e2e/singlecard/test_embedding.py \
+          --ignore=tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py \
+          --ignore=tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
+          # ------------------------------------ v1 spec decode test ------------------------------------ #
+          VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py
+          # TODO: revert me when test_v1_spec_decode.py::test_ngram_correctness is fixed
+          VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py

-      - name: Run vllm-project/vllm-ascend test on V0 engine
-        env:
-          VLLM_USE_V1: 0
+  e2e-4-cards:
+    needs: [e2e]
+    if: ${{ needs.e2e.result == 'success' }}
+    strategy:
+      max-parallel: 1
+      matrix:
+        os: [linux-arm64-npu-4]
+        vllm_version: [main, v0.9.2]
+    name: multicard e2e test
+    runs-on: ${{ matrix.os }}
+    container:
+      # TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
+      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
+      env:
+        VLLM_LOGGING_LEVEL: ERROR
+        VLLM_USE_MODELSCOPE: True
+    steps:
+      - name: Check npu and CANN info
        run: |
-          if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
-            VLLM_USE_MODELSCOPE=True  pytest -sv tests/singlecard/test_offline_inference.py
-            pytest -sv tests/singlecard/test_scheduler.py
-            # guided decoding doesn't work, fix it later
-            # pytest -sv tests/singlecard/test_guided_decoding.py.py
-            pytest -sv tests/singlecard/test_camem.py
-            # test_ascend_config.py should be ran separately because it will regenerate the global config many times.
-            pytest -sv tests/singlecard/test_ascend_config.py
-            pytest -sv tests/singlecard/test_prompt_embedding.py
-            pytest -sv tests/singlecard/ \
-              --ignore=tests/singlecard/test_offline_inference.py \
-              --ignore=tests/singlecard/test_scheduler.py \
-              --ignore=tests/singlecard/test_guided_decoding.py \
-              --ignore=tests/singlecard/test_camem.py \
-              --ignore=tests/singlecard/test_ascend_config.py \
-              --ignore=tests/singlecard/test_prompt_embedding.py
-          else
-            pytest -sv tests/multicard/test_ilama_lora_tp2.py
-            # Fixme: run VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py will raise error.
-            # To avoid oom, we need to run the test in a single process.
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_topk
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py --ignore=tests/multicard/test_offline_inference_distributed.py
-          fi
+          npu-smi info
+          cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
+
+      - name: Config mirrors
+        run: |
+          sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
+          pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
+          pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
+          apt-get update -y
+          apt install git -y
+
+      - name: Checkout vllm-project/vllm-ascend repo
+        uses: actions/checkout@v4
+
+      - name: Install system dependencies
+        run: |
+          apt-get -y install `cat packages.txt`
+          apt-get -y install gcc g++ cmake libnuma-dev
+
+      - name: Checkout vllm-project/vllm repo
+        uses: actions/checkout@v4
+        with:
+          repository: vllm-project/vllm
+          ref: ${{ matrix.vllm_version }}
+          path: ./vllm-empty
+
+      - name: Install vllm-project/vllm from source
+        working-directory: ./vllm-empty
+        run: |
+          VLLM_TARGET_DEVICE=empty pip install -e .
+
+      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
+        run: |
+          pip install -r requirements-dev.txt
+          pip install -v -e .
+
+      - name: Run vllm-project/vllm-ascend test
+        env:
+          VLLM_WORKER_MULTIPROC_METHOD: spawn
+          VLLM_USE_MODELSCOPE: True
+        run: |
+          pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py
+          # Fixme: run VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py will raise error.
+          # To avoid oom, we need to run the test in a single process.
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W8A8
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_dbo
+          pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeekV3_dbo
+          pytest -sv tests/e2e/multicard/test_data_parallel.py
+          pytest -sv tests/e2e/multicard/ --ignore=tests/e2e/multicard/test_ilama_lora_tp2.py \
+            --ignore=tests/e2e/multicard/test_offline_inference_distributed.py \
+            --ignore=tests/e2e/multicard/test_data_parallel.py
--- a/.github/workflows/vllm_ascend_test_long_term.yaml
+++ b/.github/workflows/vllm_ascend_test_long_term.yaml
@ -43,16 +43,15 @@ jobs:
      max-parallel: 2
      matrix:
        os: [linux-arm64-npu-1, linux-arm64-npu-4]
-        vllm_version: [main, v0.9.0]
+        vllm_version: [main, v0.9.2]
    name: vLLM Ascend long term test
    runs-on: ${{ matrix.os }}
    container:
      # TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
      image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
      env:
-        HF_ENDPOINT: https://hf-mirror.com
-        HF_TOKEN: ${{ secrets.HF_TOKEN }}
        VLLM_LOGGING_LEVEL: ERROR
+        VLLM_USE_MODELSCOPE: True
    steps:
      - name: Check npu and CANN info
        run: |
@ -61,11 +60,11 @@ jobs:

      - name: Config mirrors
        run: |
-          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
-          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+          sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
+          pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
+          pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
          apt-get update -y
          apt install git -y
-          git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/

      - name: Checkout vllm-project/vllm-ascend repo
        uses: actions/checkout@v4
@ -88,6 +87,8 @@ jobs:
          VLLM_TARGET_DEVICE=empty pip install -e .

      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
        run: |
          pip install -r requirements-dev.txt
          pip install -v -e .
@ -95,12 +96,8 @@ jobs:
      - name: Run vllm-project/vllm-ascend long term test
        run: |
          if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
-            # spec decode test
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/long_term/spec_decode/e2e/test_v1_mtp_correctness.py
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/long_term/spec_decode/e2e/test_v1_spec_decode.py
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/long_term/spec_decode/e2e/test_mtp_correctness.py  # it needs a clean process
-            pytest -sv tests/long_term/spec_decode --ignore=tests/long_term/spec_decode/e2e/test_mtp_correctness.py --ignore=tests/long_term/spec_decode/e2e/test_v1_spec_decode.py --ignore=tests/long_term/spec_decode/e2e/test_v1_mtp_correctness.py
-            pytest -sv tests/long_term/test_accuracy.py
+            pytest -sv tests/e2e/long_term/accuracy/accuracy_singlecard.py
          else
-            VLLM_USE_MODELSCOPE=True pytest -sv tests/long_term/test_deepseek_v2_lite_tp2_accuracy.py
+            # accuracy test multi card
+            pytest -sv tests/e2e/long_term/accuracy/accuracy_multicard.py
          fi
--- a/.github/workflows/vllm_ascend_test_pd.yaml
+++ b/.github/workflows/vllm_ascend_test_pd.yaml
@ -41,7 +41,11 @@ jobs:
    if: ${{ contains(github.event.pull_request.labels.*.name, 'pd-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' }}
    strategy:
      matrix:
-        vllm_verison: [main, v0.9.0]
+        vllm_verison: [
+            # revert me when V1 disaggregation prefill is merged in main
+            # main, 
+            v0.9.1
+          ]
    name: vLLM Ascend prefilling decoding disaggregation test
    runs-on: linux-arm64-npu-static-8

@ -60,8 +64,7 @@ jobs:
        --device /dev/devmm_svm
        --device /dev/hisi_hdc
      env:
-        HF_ENDPOINT: https://hf-mirror.com
-        HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        VLLM_USE_MODELSCOPE: True
    steps:
      - name: Check npu and CANN info
        run: |
@ -70,6 +73,7 @@ jobs:

      - name: Config mirrors
        run: |
+          # keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
          sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
          pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
          apt-get update -y
@ -97,6 +101,8 @@ jobs:
          VLLM_TARGET_DEVICE=empty pip install -e .

      - name: Install vllm-project/vllm-ascend
+        env:
+          PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
        run: |
          pip install -r requirements-dev.txt
          pip install -v -e .
--- a/.gitignore
+++ b/.gitignore
@ -196,3 +196,9 @@ kernel_meta/

 # version file generated by setuptools-scm
 /vllm_ascend/_version.py
+# build info file generated by setup.py
+/vllm_ascend/_build_info.py
+/vllm_ascend/include/
+
+# generated by CANN
+fusion_result.json
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -0,0 +1,141 @@
+default_install_hook_types:
+  - pre-commit
+  - commit-msg
+default_stages:
+  - pre-commit # Run locally
+  - manual # Run in CI
+exclude: 'examples/.*' # Exclude examples from all hooks by default
+repos:
+- repo: https://github.com/codespell-project/codespell
+  rev: v2.4.1
+  hooks:
+    - id: codespell
+      args: [
+        --toml, pyproject.toml,
+        '--skip', 'tests/e2e/multicard/test_torchair_graph_mode.py,tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**,.github/**,typos.toml',
+        '-L', 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn'
+      ]
+      additional_dependencies:
+        - tomli
+- repo: https://github.com/google/yapf
+  rev: v0.43.0
+  hooks:
+  - id: yapf
+    args: [--in-place, --verbose]
+    # Keep the same list from yapfignore here to avoid yapf failing without any inputs
+    exclude: '(.github|benchmarks|examples|docs)/.*'
+- repo: https://github.com/astral-sh/ruff-pre-commit
+  rev: v0.11.7
+  hooks:
+  - id: ruff
+    args: [--output-format, github, --fix]
+  - id: ruff-format
+    files: ^(benchmarks|examples)/.*
+- repo: https://github.com/crate-ci/typos
+  rev: v1.32.0
+  hooks:
+  - id: typos
+- repo: https://github.com/PyCQA/isort
+  rev: 6.0.1
+  hooks:
+  - id: isort
+# - repo: https://github.com/pre-commit/mirrors-clang-format
+#   rev: v20.1.3
+#   hooks:
+#   - id: clang-format
+#     files: ^csrc/.*\.(cpp|hpp|cc|hh|cxx|hxx)$
+#     types_or: [c++]
+#     args: [--style=google, --verbose]
+# - repo: https://github.com/jackdewinter/pymarkdown
+#   rev: v0.9.29
+#   hooks:
+#   - id: pymarkdown
+#     args: [fix]
+- repo: https://github.com/rhysd/actionlint
+  rev: v1.7.7
+  hooks:
+  - id: actionlint
+- repo: local
+  hooks:
+  # For local development, you can run mypy using tools/mypy.sh script if needed.
+  # - id: mypy-local
+  #   name: Run mypy for local Python installation
+  #   entry: tools/mypy.sh 0 "local"
+  #   language: system
+  #   types: [python]
+  #   stages: [pre-commit] # Don't run in CI
+  - id: mypy-3.9 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
+    name: Run mypy for Python 3.9
+    entry: tools/mypy.sh 1 "3.9"
+    # Use system python because vllm installation is required
+    language: system
+    types: [python]
+    stages: [manual] # Only run in CI
+  - id: mypy-3.10 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
+    name: Run mypy for Python 3.10
+    entry: tools/mypy.sh 1 "3.10"
+    # Use system python because vllm installation is required
+    language: system
+    types: [python]
+    stages: [manual] # Only run in CI
+  - id: mypy-3.11 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
+    name: Run mypy for Python 3.11
+    entry: tools/mypy.sh 1 "3.11"
+    # Use system python because vllm installation is required
+    language: system
+    types: [python]
+    stages: [manual] # Only run in CI
+  - id: mypy-3.12 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
+    name: Run mypy for Python 3.12
+    entry: tools/mypy.sh 1 "3.12"
+    # Use system python because vllm installation is required
+    language: system
+    types: [python]
+    stages: [manual] # Only run in CI
+  # FIXME: enable shellcheck
+  # - id: shellcheck
+  #   name: Lint shell scripts
+  #   entry: tools/shellcheck.sh
+  #   language: script
+  #   types: [shell]
+  - id: png-lint
+    name: Lint PNG exports from excalidraw
+    entry: tools/png-lint.sh
+    language: script
+    types: [png]
+  - id: signoff-commit
+    name: Sign-off Commit
+    entry: bash
+    args:
+      - -c
+      - |
+        if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" "$(git rev-parse --git-path COMMIT_EDITMSG)"; then
+          printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> "$(git rev-parse --git-path COMMIT_EDITMSG)"
+        fi
+    language: system
+    verbose: true
+    stages: [commit-msg]
+  - id: check-filenames
+    name: Check for spaces in all filenames
+    entry: bash
+    args:
+      - -c
+      - 'git ls-files | grep " " && echo "Filenames should not contain spaces!" && exit 1 || exit 0'
+    language: system
+    always_run: true
+    pass_filenames: false
+  - id: enforce-import-regex-instead-of-re
+    name: Enforce import regex as re
+    entry: python tools/enforce_regex_import.py
+    language: python
+    types: [python]
+    pass_filenames: false
+    additional_dependencies: [regex]
+  # Keep `suggestion` last
+  - id: suggestion
+    name: Suggestion
+    entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
+    language: system
+    verbose: true
+    pass_filenames: false
+  # Insert new entries above the `suggestion` entry
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -96,5 +96,3 @@ target_link_libraries(
 target_link_options(vllm_ascend_C PRIVATE "-Wl,-rpath,$ORIGIN:$ORIGIN/lib")

 install(TARGETS vllm_ascend_C vllm_ascend_kernels DESTINATION ${VLLM_ASCEND_INSTALL_PATH})
-
-
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,3 @@
+# Contributing to vLLM Ascend
+
+You may find information about contributing to vLLM Ascend on [Developer Guide - Contributing](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html), including step-by-step guide to help you setup development environment, contribute first PR and test locally.
--- a/5
+++ b/5
@ -37,7 +37,7 @@ RUN pip config set global.index-url ${PIP_INDEX_URL}

 # Install vLLM
 ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
-ARG VLLM_TAG=v0.9.0
+ARG VLLM_TAG=v0.9.2
 RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
 # In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
 RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
@ -46,7 +46,8 @@ RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm

 # Install vllm-ascend
 # Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
-RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
    source /usr/local/Ascend/nnal/atb/set_env.sh && \
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
--- a/Dockerfile.310p
+++ b/Dockerfile.310p
@ -0,0 +1,61 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+# Define environments
+ENV DEBIAN_FRONTEND=noninteractive
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN apt-get update -y && \
+    apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
+    rm -rf /var/cache/apt/* && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    export SOC_VERSION=ASCEND310P3 && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/Dockerfile.310p.openEuler
+++ b/Dockerfile.310p.openEuler
@ -0,0 +1,58 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-310p-openeuler22.03-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN yum update -y && \
+    yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
+    rm -rf /var/cache/yum
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    export SOC_VERSION=ASCEND310P3 && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/Dockerfile.a3
+++ b/Dockerfile.a3
@ -0,0 +1,60 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-a3-ubuntu22.04-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+# Define environments
+ENV DEBIAN_FRONTEND=noninteractive
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN apt-get update -y && \
+    apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev && \
+    rm -rf /var/cache/apt/* && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/Dockerfile.a3.openEuler
+++ b/Dockerfile.a3.openEuler
@ -0,0 +1,57 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+FROM quay.io/ascend/cann:8.1.rc1-a3-openeuler22.03-py3.10
+
+ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
+ARG COMPILE_CUSTOM_KERNELS=1
+
+ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
+
+RUN yum update -y && \
+    yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
+    rm -rf /var/cache/yum
+
+RUN pip config set global.index-url ${PIP_INDEX_URL}
+
+WORKDIR /workspace
+
+COPY . /vllm-workspace/vllm-ascend/
+
+# Install vLLM
+ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
+ARG VLLM_TAG=v0.9.2
+
+RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
+# In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
+RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip uninstall -y triton && \
+    python3 -m pip cache purge
+
+# Install vllm-ascend
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+    source /usr/local/Ascend/nnal/atb/set_env.sh && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
+    python3 -m pip cache purge
+
+# Install modelscope (for fast download) and ray (for multinode)
+RUN python3 -m pip install modelscope ray && \
+    python3 -m pip cache purge
+
+CMD ["/bin/bash"]
--- a/Dockerfile.openEuler
+++ b/Dockerfile.openEuler
@ -34,7 +34,7 @@ COPY . /vllm-workspace/vllm-ascend/

 # Install vLLM
 ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
-ARG VLLM_TAG=v0.9.0
+ARG VLLM_TAG=v0.9.2

 RUN git clone --depth 1 $VLLM_REPO --branch $VLLM_TAG /vllm-workspace/vllm
 # In x86, triton will be installed by vllm. But in Ascend, triton doesn't work correctly. we need to uninstall it.
@ -43,7 +43,8 @@ RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ -
    python3 -m pip cache purge

 # Install vllm-ascend
-RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
+RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
    source /usr/local/Ascend/nnal/atb/set_env.sh && \
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
    python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
--- a/README.md
+++ b/README.md
@ -19,6 +19,10 @@ vLLM Ascend Plugin

 ---
 *Latest News* 🔥
+
+- [2025/06] [User stories](https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html) page is now live! It kicks off with ‌LLaMA-Factory/verl//TRL/GPUStack‌ to demonstrate how ‌vLLM Ascend‌ assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
+- [2025/06] [Contributors](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html) page is now live! All contributions deserve to be recorded, thanks for all contributors.
+- [2025/05] We've released first official version [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)! We collaborated with the vLLM community to publish a blog post sharing our practice: [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html).
 - [2025/03] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/VtxO9WXa5fC-mKqlxNUJUQ) with vLLM team! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
 - [2025/02] vLLM community officially created [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) repo for running vLLM seamlessly on the Ascend NPU.
 - [2024/12] We are working with the vLLM community to support [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
@ -38,15 +42,20 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
 - Software:
  * Python >= 3.9, < 3.12
  * CANN >= 8.1.RC1
-  * PyTorch >= 2.5.1, torch-npu >= 2.5.1
+  * PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
  * vLLM (the same version as vllm-ascend)

 ## Getting Started

-Please refer to [QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details.
+Please use the following recommended versions to get started quickly:
+
+| Version    | Release type | Doc                                  |
+|------------|--------------|--------------------------------------|
+|v0.9.2rc1|Latest release candidate|[QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details|
+|v0.7.3.post1|Latest stable version|[QuickStart](https://vllm-ascend.readthedocs.io/en/stable/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/stable/installation.html) for more details|

 ## Contributing
-See [CONTRIBUTING](https://vllm-ascend.readthedocs.io/en/main/developer_guide/contributing.html) for more details, which is a step-by-step guide to help you set up development environment, build and test.
+See [CONTRIBUTING](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html) for more details, which is a step-by-step guide to help you set up development environment, build and test.

 We welcome and value any contributions and collaborations:
 - Please let us know if you encounter a bug by [filing an issue](https://github.com/vllm-project/vllm-ascend/issues)
@ -65,9 +74,10 @@ Below is maintained branches:
 |------------|--------------|--------------------------------------|
 | main       | Maintained   | CI commitment for vLLM main branch and vLLM 0.9.x branch   |
 | v0.7.1-dev | Unmaintained | Only doc fixed is allowed |
-| v0.7.3-dev | Maintained   | CI commitment for vLLM 0.7.3 version |
+| v0.7.3-dev | Maintained   | CI commitment for vLLM 0.7.3 version, only bug fix is allowed and no new release tag any more. |
+| v0.9.1-dev | Maintained   | CI commitment for vLLM 0.9.1 version |

-Please refer to [Versioning policy](https://vllm-ascend.readthedocs.io/en/main/developer_guide/versioning_policy.html) for more details.
+Please refer to [Versioning policy](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html) for more details.

 ## Weekly Meeting

--- a/README.zh.md
+++ b/README.zh.md
@ -20,6 +20,9 @@ vLLM Ascend Plugin
 ---
 *最新消息* 🔥

+- [2025/06] [用户案例](https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html)现已上线！展示了LLaMA-Factory/verl/TRL/GPUStack等用户案例，展示了vLLM Ascend如何帮助昇腾用户在模型微调、评估、强化学习 (RL) 以及部署等场景中提升体验。
+- [2025/06] [贡献者](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html)页面现已上线！所有的贡献都值得被记录，感谢所有的贡献者。
+- [2025/05] 我们发布了首个正式版本 [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)！我们与 vLLM 社区合作发布了一篇博客文章，分享了我们的实践：[Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html)。
 - [2025/03] 我们和vLLM团队举办了[vLLM Beijing Meetup](https://mp.weixin.qq.com/s/CGDuMoB301Uytnrkc2oyjg)! 你可以在[这里](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF)找到演讲材料.
 - [2025/02] vLLM社区正式创建了[vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)仓库，让vLLM可以无缝运行在Ascend NPU。
 - [2024/12] 我们正在与 vLLM 社区合作，以支持 [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162).
@ -39,15 +42,20 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
 - 软件：
  * Python >= 3.9, < 3.12
  * CANN >= 8.1.RC1
-  * PyTorch >= 2.5.1, torch-npu >= 2.5.1
+  * PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
  * vLLM (与vllm-ascend版本一致)

 ## 开始使用

-请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多.
+推荐您使用以下版本快速开始使用：
+
+| Version    | Release type | Doc                                  |
+|------------|--------------|--------------------------------------|
+|v0.9.2rc1| 最新RC版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多|
+|v0.7.3.post1| 最新正式/稳定版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/stable/quick_start.html)和[安装指南](https://vllm-ascend.readthedocs.io/en/stable/installation.html)了解更多|

 ## 贡献
-请参考 [CONTRIBUTING]((https://vllm-ascend.readthedocs.io/en/main/developer_guide/contributing.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。
+请参考 [CONTRIBUTING]((https://vllm-ascend.readthedocs.io/en/latest/developer_guide/contribution/index.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。

 我们欢迎并重视任何形式的贡献与合作：
 - 请通过[Issue](https://github.com/vllm-project/vllm-ascend/issues)来告知我们您遇到的任何Bug。
@ -65,9 +73,10 @@ vllm-ascend有主干分支和开发分支。
 |------------|------------|---------------------|
 | main       | Maintained | 基于vLLM main分支CI看护   |
 | v0.7.1-dev | Unmaintained | 只允许文档修复 |
-| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护 |
+| v0.7.3-dev | Maintained | 基于vLLM v0.7.3版本CI看护, 只允许Bug修复，不会再发布新版本 |
+| v0.9.1-dev | Maintained | 基于vLLM v0.9.1版本CI看护 |

-请参阅[版本策略](https://vllm-ascend.readthedocs.io/en/main/developer_guide/versioning_policy.html)了解更多详细信息。
+请参阅[版本策略](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html)了解更多详细信息。

 ## 社区例会

--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@ -1,5 +1,5 @@
 # Introduction
-This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
+This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.

 # Overview
 **Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
@ -7,21 +7,21 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
    - Input length: 32 tokens.
    - Output length: 128 tokens.
    - Batch size: fixed (8).
-    - Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
+    - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
    - Evaluation metrics: end-to-end latency (mean, median, p99).

 - Throughput tests
    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
    - Output length: the corresponding output length of these 200 prompts.
    - Batch size: dynamically determined by vllm to achieve maximum throughput.
-    - Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
+    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
    - Evaluation metrics: throughput.
 - Serving tests
    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
    - Output length: the corresponding output length of these 200 prompts.
    - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
    - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
-    - Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
+    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
    - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

 **Benchmarking Duration**: about 800 senond for single model.
@ -38,20 +38,129 @@ Before running the benchmarks, ensure the following:
    pip install -r benchmarks/requirements-bench.txt
    ```
    
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. feel free to add your own models and parameters in the JSON to run your customized benchmarks.
+- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. 
+- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
+
+  ```shell
+  [
+  {
+    "test_name": "serving_qwen2_5vl_7B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "trust_remote_code": "",
+      "max_model_len": 16384
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+      "backend": "openai-chat",
+      "dataset_name": "hf",
+      "hf_split": "train",
+      "endpoint": "/v1/chat/completions",
+      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
+      "num_prompts": 200
+    }
+  }
+  ]
+  ```
+  this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
+
+  - **Test Overview**
+     - Test Name: serving_qwen2_5vl_7B_tp1
+
+     - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
+
+   - Server Parameters
+      - Model: Qwen/Qwen2.5-VL-7B-Instruct
+
+      - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
+
+      - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
+
+      - disable_log_stats: disables logging of performance statistics.
+
+      - disable_log_requests: disables logging of individual requests.
+
+      - Trust Remote Code: enabled (allows execution of model-specific custom code)
+
+      - Max Model Length: 16,384 tokens (maximum context length supported by the model)
+
+  - Client Parameters
+
+     - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
+
+     - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
+
+     - Dataset Source: Hugging Face (hf)
+
+     - Dataset Split: train
+
+     - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
+
+     - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
+
+     - Number of Prompts: 200 (the total number of prompts used during the test)
+
+

 ## Run benchmarks
+
+### Use benchmark script
 The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
 ```
 bash benchmarks/scripts/run-performance-benchmarks.sh
 ```
 Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
 ```
-|-- latency_llama8B_tp1.json
-|-- serving_llama8B_tp1_sharegpt_qps_1.json
-|-- serving_llama8B_tp1_sharegpt_qps_16.json
-|-- serving_llama8B_tp1_sharegpt_qps_4.json
-|-- serving_llama8B_tp1_sharegpt_qps_inf.json
-|-- throughput_llama8B_tp1.json
+.
+|-- serving_qwen2_5_7B_tp1_qps_1.json
+|-- serving_qwen2_5_7B_tp1_qps_16.json
+|-- serving_qwen2_5_7B_tp1_qps_4.json
+|-- serving_qwen2_5_7B_tp1_qps_inf.json
+|-- latency_qwen2_5_7B_tp1.json
+|-- throughput_qwen2_5_7B_tp1.json
 ```
 These files contain detailed benchmarking results for further analysis.
+
+### Use benchmark cli
+
+For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
+Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
+#### Online serving
+1. Launch the server:
+   ```shell
+   vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
+   ```
+2. Running performance tests using cli
+   ```shell
+    vllm bench serve --model Qwen2.5-VL-7B-Instruct\
+    --endpoint-type "openai-chat" --dataset-name hf \
+    --hf-split train --endpoint "/v1/chat/completions" \
+    --dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
+    --num-prompts 200 \
+    --request-rate 16
+   ```
+
+#### Offline
+- **Throughput**
+    ```shell
+    vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
+    --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
+    --dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
+    --num-prompts 200 --backend vllm
+    ```
+- **Latency**
+    ```shell
+    vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
+    --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
+    --load-format dummy --num-iters-warmup 5 --num-iters 15
+    ```
--- a/benchmarks/ops/ben_vocabparallelembedding.py
+++ b/benchmarks/ops/ben_vocabparallelembedding.py
@ -0,0 +1,158 @@
+from typing import Tuple
+
+import numpy as np
+import pytest
+import torch
+import torch_npu  # noqa: F401
+import vllm  # noqa: F401
+
+import vllm_ascend.platform  # noqa: F401
+
+
+def benchmark_npu(fn, num_iterations=100, num_warmup_iterations=50):
+    """
+    Benchmark function for NPU operations
+
+    Args:
+        fn: Function to benchmark
+        num_iterations: Number of timing iterations
+        num_warmup_iterations: Number of warmup iterations
+
+    Returns:
+        float: Minimum elapsed time in seconds
+    """
+    start = torch.npu.Event(enable_timing=True)
+    end = torch.npu.Event(enable_timing=True)
+    times = np.zeros(num_iterations + num_warmup_iterations)
+
+    # Run iterations
+    for i in range(num_warmup_iterations + num_iterations):
+        with torch.no_grad():
+            start.record()
+            fn()  # Execute the function
+            end.record()
+        torch.npu.synchronize()
+        times[i] = start.elapsed_time(end)
+
+    # Remove warmup iterations and convert to seconds
+    times = times[num_warmup_iterations:]
+    elapsed_time = np.amin(times) / 1000
+    return elapsed_time
+
+
+def get_masked_input_and_mask_ref(
+    input_: torch.Tensor,
+    org_vocab_start_index: int,
+    org_vocab_end_index: int,
+    num_org_vocab_padding: int,
+    added_vocab_start_index: int,
+    added_vocab_end_index: int,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Reference implementation for verification"""
+    org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
+    added_vocab_mask = (input_ >= added_vocab_start_index) & (
+        input_ < added_vocab_end_index
+    )
+    added_offset = (
+        added_vocab_start_index
+        - (org_vocab_end_index - org_vocab_start_index)
+        - num_org_vocab_padding
+    )
+    valid_offset = (org_vocab_start_index * org_vocab_mask) + (
+        added_offset * added_vocab_mask
+    )
+    vocab_mask = org_vocab_mask | added_vocab_mask
+    masked_input = vocab_mask * (input_ - valid_offset)
+    return masked_input, ~vocab_mask
+
+
+DTYPES = [torch.int32]
+SHAPES = [(3, 4, 5)]
+DEVICES = [f"npu:{0}"]
+SEEDS = [0]
+
+
+@pytest.mark.parametrize("shape", SHAPES)
+@pytest.mark.parametrize("dtype", DTYPES)
+@pytest.mark.parametrize("device", DEVICES)
+@pytest.mark.parametrize("seed", SEEDS)
+@torch.inference_mode()
+def test_get_masked_input_and_mask(
+    shape: Tuple[int, ...],
+    dtype: torch.dtype,
+    device: str,
+    seed: int,
+) -> None:
+    # Set random seed and device
+    torch.manual_seed(seed)
+    torch.set_default_device(device)
+
+    # Generate random input tensor
+    input_tensor = torch.randint(0, 1000, shape, dtype=dtype)
+
+    # Test parameters
+    test_case = {
+        "org_start": 100,
+        "org_end": 200,
+        "padding": 0,
+        "added_start": 300,
+        "added_end": 400,
+    }
+
+    # Define reference function
+    def ref_fn():
+        return get_masked_input_and_mask_ref(
+            input_tensor,
+            test_case["org_start"],
+            test_case["org_end"],
+            test_case["padding"],
+            test_case["added_start"],
+            test_case["added_end"],
+        )
+
+    # Define custom function
+    def custom_fn():
+        return torch.ops._C.get_masked_input_and_mask(
+            input_tensor,
+            test_case["org_start"],
+            test_case["org_end"],
+            test_case["padding"],
+            test_case["added_start"],
+            test_case["added_end"],
+        )
+
+    # Get results for correctness testing
+    ref_masked_input, ref_mask = ref_fn()
+    custom_masked_input, custom_mask = custom_fn()
+
+    # Benchmark both implementations
+    ref_time = benchmark_npu(ref_fn)
+    custom_time = benchmark_npu(custom_fn)
+
+    # Print performance results
+    print("\nPerformance Results:")
+    print(f"Reference implementation: {ref_time * 1000:.3f} ms")
+    print(f"Custom implementation: {custom_time * 1000:.3f} ms")
+    print(f"Speedup: {ref_time / custom_time:.2f}x")
+
+    # Compare results for correctness
+    ref_masked_input = ref_masked_input.to(dtype)
+    print("\nResults comparison:")
+    print("custom_masked_input:", custom_masked_input)
+    print("ref_masked_input:", ref_masked_input)
+    print("custom_mask:", custom_mask)
+    print("ref_mask:", ref_mask)
+    torch.testing.assert_close(
+        custom_masked_input,
+        ref_masked_input,
+        rtol=1e-5,
+        atol=1e-5,
+        msg=f"Masked input mismatch for case: {test_case}",
+    )
+    torch.testing.assert_close(
+        custom_mask,
+        ref_mask,
+        rtol=1e-5,
+        atol=1e-5,
+        msg=f"Mask mismatch for case: {test_case}",
+    )
--- a/benchmarks/requirements-bench.txt
+++ b/benchmarks/requirements-bench.txt
@ -1,5 +1,4 @@
 pandas
 datasets
 modelscope
-libcst
 tabulate
--- a/benchmarks/scripts/convert_json_to_markdown.py
+++ b/benchmarks/scripts/convert_json_to_markdown.py
@ -49,36 +49,43 @@ def read_markdown(file):


 def results_to_json(latency, throughput, serving):
-    return json.dumps({
-        'latency': latency.to_dict(),
-        'throughput': throughput.to_dict(),
-        'serving': serving.to_dict()
-    })
+    return json.dumps(
+        {
+            "latency": latency.to_dict(),
+            "throughput": throughput.to_dict(),
+            "serving": serving.to_dict(),
+        }
+    )


 if __name__ == "__main__":
    parser = argparse.ArgumentParser(
-        description="Process the results of the benchmark tests.")
+        description="Process the results of the benchmark tests."
+    )
    parser.add_argument(
        "--results_folder",
        type=str,
        default="../results/",
-        help="The folder where the benchmark results are stored.")
+        help="The folder where the benchmark results are stored.",
+    )
    parser.add_argument(
        "--output_folder",
        type=str,
        default="../results/",
-        help="The folder where the benchmark results are stored.")
-    parser.add_argument("--markdown_template",
-                        type=str,
-                        default="./perf_result_template.md",
-                        help="The template file for the markdown report.")
-    parser.add_argument("--tag",
-                        default="main",
-                        help="Tag to be used for release message.")
-    parser.add_argument("--commit_id",
-                        default="",
-                        help="Commit ID to be used for release message.")
+        help="The folder where the benchmark results are stored.",
+    )
+    parser.add_argument(
+        "--markdown_template",
+        type=str,
+        default="./perf_result_template.md",
+        help="The template file for the markdown report.",
+    )
+    parser.add_argument(
+        "--tag", default="main", help="Tag to be used for release message."
+    )
+    parser.add_argument(
+        "--commit_id", default="", help="Commit ID to be used for release message."
+    )

    args = parser.parse_args()
    results_folder = (CUR_PATH / args.results_folder).resolve()
@ -87,7 +94,6 @@ if __name__ == "__main__":

    # collect results
    for test_file in results_folder.glob("*.json"):
-
        with open(test_file) as f:
            raw_result = json.loads(f.read())

@ -111,7 +117,8 @@ if __name__ == "__main__":
            for perc in [10, 25, 50, 75, 90, 99]:
                # Multiply 1000 to convert the time unit from s to ms
                raw_result.update(
-                    {f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]})
+                    {f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
+                )
            raw_result["avg_latency"] = raw_result["avg_latency"] * 1000

            # add the result to raw_result
@ -129,55 +136,53 @@ if __name__ == "__main__":
            continue

        print(f"Skipping {test_file}")
-    serving_results.sort(key=lambda x: (len(x['test_name']), x['test_name']))
+    serving_results.sort(key=lambda x: (len(x["test_name"]), x["test_name"]))

    latency_results = pd.DataFrame.from_dict(latency_results)
    serving_results = pd.DataFrame.from_dict(serving_results)
    throughput_results = pd.DataFrame.from_dict(throughput_results)

-    raw_results_json = results_to_json(latency_results, throughput_results,
-                                       serving_results)
+    raw_results_json = results_to_json(
+        latency_results, throughput_results, serving_results
+    )

    # remapping the key, for visualization purpose
    if not latency_results.empty:
-        latency_results = latency_results[list(
-            latency_column_mapping.keys())].rename(
-                columns=latency_column_mapping)
+        latency_results = latency_results[list(latency_column_mapping.keys())].rename(
+            columns=latency_column_mapping
+        )
    if not serving_results.empty:
-        serving_results = serving_results[list(
-            serving_column_mapping.keys())].rename(
-                columns=serving_column_mapping)
+        serving_results = serving_results[list(serving_column_mapping.keys())].rename(
+            columns=serving_column_mapping
+        )
    if not throughput_results.empty:
-        throughput_results = throughput_results[list(
-            throughput_results_column_mapping.keys())].rename(
-                columns=throughput_results_column_mapping)
+        throughput_results = throughput_results[
+            list(throughput_results_column_mapping.keys())
+        ].rename(columns=throughput_results_column_mapping)

-    processed_results_json = results_to_json(latency_results,
-                                             throughput_results,
-                                             serving_results)
+    processed_results_json = results_to_json(
+        latency_results, throughput_results, serving_results
+    )

    # get markdown tables
-    latency_md_table = tabulate(latency_results,
-                                headers='keys',
-                                tablefmt='pipe',
-                                showindex=False)
-    serving_md_table = tabulate(serving_results,
-                                headers='keys',
-                                tablefmt='pipe',
-                                showindex=False)
-    throughput_md_table = tabulate(throughput_results,
-                                   headers='keys',
-                                   tablefmt='pipe',
-                                   showindex=False)
+    latency_md_table = tabulate(
+        latency_results, headers="keys", tablefmt="pipe", showindex=False
+    )
+    serving_md_table = tabulate(
+        serving_results, headers="keys", tablefmt="pipe", showindex=False
+    )
+    throughput_md_table = tabulate(
+        throughput_results, headers="keys", tablefmt="pipe", showindex=False
+    )

    # document the result
    print(output_folder)
    with open(output_folder / "benchmark_results.md", "w") as f:
-
        results = read_markdown(markdown_template)
        results = results.format(
            latency_tests_markdown_table=latency_md_table,
            throughput_tests_markdown_table=throughput_md_table,
            serving_tests_markdown_table=serving_md_table,
-            benchmarking_results_in_json_string=processed_results_json)
+            benchmarking_results_in_json_string=processed_results_json,
+        )
        f.write(results)
--- a/benchmarks/scripts/patch_benchmark_dataset.py
+++ b/benchmarks/scripts/patch_benchmark_dataset.py
@ -1,68 +0,0 @@
-from argparse import ArgumentParser
-
-import libcst as cst
-import libcst.matchers as m
-
-# Patch the benchmark_dataset.py file to set streaming=False in load_dataset calls
-
-
-# TDOO(Potabk): Remove this patch when the issue is fixed in the upstream
-class StreamingFalseTransformer(cst.CSTTransformer):
-
-    def __init__(self):
-        self.in_target_class = False
-        self.in_target_func = False
-
-    def visit_ClassDef(self, node):
-        if node.name.value == "HuggingFaceDataset":
-            self.in_target_class = True
-
-    def leave_ClassDef(self, original_node, updated_node):
-        self.in_target_class = False
-        return updated_node
-
-    def visit_FunctionDef(self, node):
-        if self.in_target_class and node.name.value == "load_data":
-            self.in_target_func = True
-
-    def leave_FunctionDef(self, original_node, updated_node):
-        self.in_target_func = False
-        return updated_node
-
-    def leave_Call(self, original_node, updated_node):
-        if self.in_target_class and self.in_target_func:
-            if m.matches(updated_node.func, m.Name("load_dataset")):
-                new_args = []
-                for arg in updated_node.args:
-                    if arg.keyword and arg.keyword.value == "streaming":
-                        new_arg = arg.with_changes(value=cst.Name("False"))
-                        new_args.append(new_arg)
-                    else:
-                        new_args.append(arg)
-                return updated_node.with_changes(args=new_args)
-        return updated_node
-
-
-def patch_file(path):
-    with open(path, "r", encoding="utf-8") as f:
-        source = f.read()
-
-    module = cst.parse_module(source)
-    modified = module.visit(StreamingFalseTransformer())
-
-    with open(path, "w", encoding="utf-8") as f:
-        f.write(modified.code)
-
-    print(f"Patched: {path}")
-
-
-if __name__ == '__main__':
-    parser = ArgumentParser(
-        description=
-        "Patch benchmark_dataset.py to set streaming=False in load_dataset calls"
-    )
-    parser.add_argument("--path",
-                        type=str,
-                        help="Path to the benchmark_dataset.py file")
-    args = parser.parse_args()
-    patch_file(args.path)
--- a/benchmarks/scripts/run-performance-benchmarks.sh
+++ b/benchmarks/scripts/run-performance-benchmarks.sh
@ -1,5 +1,5 @@
 #!/bin/bash
-
+set -e

 check_npus() {
  # shellcheck disable=SC2155
@ -54,13 +54,20 @@ json2args() {
 }

 wait_for_server() {
-  # wait for vllm server to start
-  # return 1 if vllm server crashes
-  timeout 1200 bash -c '
-    until curl -s -X GET localhost:8000/health; do
-      echo "Waiting for vllm server to start..."
-      sleep 1
-    done' && return 0 || return 1
+  local waited=0
+  local timeout_sec=1200
+
+  while (( waited < timeout_sec )); do
+    if curl -s -X GET localhost:8000/health > /dev/null; then
+      return 0
+    fi
+    echo "Waiting for vllm server to start..."
+    sleep 1
+    ((waited++))
+  done
+
+  echo "Timeout waiting for server"
+  return 1
 }

 get_cur_npu_id() {
@ -114,7 +121,7 @@ run_latency_tests() {
    latency_params=$(echo "$params" | jq -r '.parameters')
    latency_args=$(json2args "$latency_params")

-    latency_command="python3 vllm_benchmarks/benchmark_latency.py \
+    latency_command="vllm bench latency \
      --output-json $RESULTS_FOLDER/${test_name}.json \
      $latency_args"

@ -157,7 +164,7 @@ run_throughput_tests() {
    throughput_params=$(echo "$params" | jq -r '.parameters')
    throughput_args=$(json2args "$throughput_params")

-    throughput_command="python3 vllm_benchmarks/benchmark_throughput.py \
+    throughput_command="vllm bench throughput \
      --output-json $RESULTS_FOLDER/${test_name}.json \
      $throughput_args"

@ -243,7 +250,7 @@ run_serving_tests() {

      new_test_name=$test_name"_qps_"$qps

-      client_command="python3 vllm_benchmarks/benchmark_serving.py \
+      client_command="vllm bench serve \
        --save-result \
        --result-dir $RESULTS_FOLDER \
        --result-filename ${new_test_name}.json \
@ -271,17 +278,10 @@ cleanup_on_error() {
  rm -rf $RESULTS_FOLDER
 }

-get_benchmarks_scripts() {
-  git clone -b main --depth=1 https://github.com/vllm-project/vllm.git && \
-  mv vllm/benchmarks vllm_benchmarks
-  rm -rf ./vllm
-}
-
 main() {
-
  START_TIME=$(date +%s)
  check_npus
-
+  
  # dependencies
  (which wget && which curl) || (apt-get update && apt-get install -y wget curl)
  (which jq) || (apt-get update && apt-get -y install jq)
@ -294,12 +294,10 @@ main() {
  export VLLM_LOG_LEVEL="WARNING"
  
  # set env
-  export HF_ENDPOINT="https://hf-mirror.com"
+  export VLLM_USE_MODELSCOPE=True

  # prepare for benchmarking
  cd benchmarks || exit 1
-  get_benchmarks_scripts
-  python3 scripts/patch_benchmark_dataset.py --path vllm_benchmarks/benchmark_dataset.py
  trap cleanup EXIT

  QUICK_BENCHMARK_ROOT=./
--- a/benchmarks/scripts/run_accuracy.py
+++ b/benchmarks/scripts/run_accuracy.py
@ -21,108 +21,159 @@ import gc
 import json
 import multiprocessing
 import sys
+import time
 from multiprocessing import Queue

 import lm_eval
 import torch

-UNIMODAL_MODEL_NAME = ["Qwen/Qwen2.5-7B-Instruct", "Qwen/Qwen3-8B-Base"]
+# URLs for version information in Markdown report
+VLLM_URL = "https://github.com/vllm-project/vllm/commit/"
+VLLM_ASCEND_URL = "https://github.com/vllm-project/vllm-ascend/commit/"
+
+# Model and task configurations
+UNIMODAL_MODEL_NAME = ["Qwen/Qwen3-8B-Base", "Qwen/Qwen3-30B-A3B"]
 UNIMODAL_TASK = ["ceval-valid", "gsm8k"]
 MULTIMODAL_NAME = ["Qwen/Qwen2.5-VL-7B-Instruct"]
 MULTIMODAL_TASK = ["mmmu_val"]

-batch_size_dict = {"ceval-valid": 1, "mmlu": 1, "gsm8k": "auto", "mmmu_val": 1}
+# Batch size configurations per task
+BATCH_SIZE = {"ceval-valid": 1, "mmlu": 1, "gsm8k": "auto", "mmmu_val": 1}

-MODEL_RUN_INFO = {
-    "Qwen/Qwen2.5-7B-Instruct":
-    ("export MODEL_ARGS='pretrained={model}, max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
-     "lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
-     "--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
-     ),
-    "Qwen/Qwen3-8B-Base":
-    ("export MODEL_ARGS='pretrained={model}, max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
-     "lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
-     "--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
-     ),
-    "Qwen/Qwen2.5-VL-7B-Instruct":
-    ("export MODEL_ARGS='pretrained={model}, max_model_len=8192,dtype=auto,tensor_parallel_size=4,max_images=2'\n"
-     "lm_eval --model vllm-vlm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
-     "--apply_chat_template --fewshot_as_multiturn  --batch_size 1"),
+# Model type mapping (vllm for text, vllm-vlm for vision-language)
+MODEL_TYPE = {
+    "Qwen/Qwen3-8B-Base": "vllm",
+    "Qwen/Qwen3-30B-A3B": "vllm",
+    "Qwen/Qwen2.5-VL-7B-Instruct": "vllm-vlm",
 }

+# Command templates for running evaluations
+MODEL_RUN_INFO = {
+    "Qwen/Qwen3-30B-A3B": (
+        "export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True'\n"
+        "lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
+        "--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
+    ),
+    "Qwen/Qwen3-8B-Base": (
+        "export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
+        "lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
+        "--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
+    ),
+    "Qwen/Qwen2.5-VL-7B-Instruct": (
+        "export MODEL_ARGS='pretrained={model},max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2'\n"
+        "lm_eval --model vllm-vlm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
+        "--apply_chat_template --fewshot_as_multiturn  --batch_size 1"
+    ),
+}

-def run_accuracy_unimodal(queue, model, dataset):
+# Evaluation metric filters per task
+FILTER = {
+    "gsm8k": "exact_match,flexible-extract",
+    "ceval-valid": "acc,none",
+    "mmmu_val": "acc,none",
+}
+
+# Expected accuracy values for models
+EXPECTED_VALUE = {
+    "Qwen/Qwen3-30B-A3B": {"ceval-valid": 0.83, "gsm8k": 0.85},
+    "Qwen/Qwen3-8B-Base": {"ceval-valid": 0.82, "gsm8k": 0.83},
+    "Qwen/Qwen2.5-VL-7B-Instruct": {"mmmu_val": 0.51},
+}
+PARALLEL_MODE = {
+    "Qwen/Qwen3-8B-Base": "TP",
+    "Qwen/Qwen2.5-VL-7B-Instruct": "TP",
+    "Qwen/Qwen3-30B-A3B": "EP",
+}
+
+# Execution backend configuration
+EXECUTION_MODE = {
+    "Qwen/Qwen3-8B-Base": "ACLGraph",
+    "Qwen/Qwen2.5-VL-7B-Instruct": "ACLGraph",
+    "Qwen/Qwen3-30B-A3B": "ACLGraph",
+}
+
+# Model arguments for evaluation
+MODEL_ARGS = {
+    "Qwen/Qwen3-8B-Base": "pretrained=Qwen/Qwen3-8B-Base,max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6",
+    "Qwen/Qwen2.5-VL-7B-Instruct": "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2",
+    "Qwen/Qwen3-30B-A3B": "pretrained=Qwen/Qwen3-30B-A3B,max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True",
+}
+
+# Whether to apply chat template formatting
+APPLY_CHAT_TEMPLATE = {
+    "Qwen/Qwen3-8B-Base": True,
+    "Qwen/Qwen2.5-VL-7B-Instruct": True,
+    "Qwen/Qwen3-30B-A3B": False,
+}
+# Few-shot examples handling as multi-turn dialogues.
+FEWSHOT_AS_MULTITURN = {
+    "Qwen/Qwen3-8B-Base": True,
+    "Qwen/Qwen2.5-VL-7B-Instruct": True,
+    "Qwen/Qwen3-30B-A3B": False,
+}
+
+# Relative tolerance for accuracy checks
+RTOL = 0.03
+ACCURACY_FLAG = {}
+
+
+def run_accuracy_test(queue, model, dataset):
+    """Run accuracy evaluation for a model on a dataset in separate process"""
    try:
-        model_args = f"pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6"
-        results = lm_eval.simple_evaluate(
-            model="vllm",
-            model_args=model_args,
-            tasks=dataset,
-            apply_chat_template=True,
-            fewshot_as_multiturn=True,
-            batch_size=batch_size_dict[dataset],
-            num_fewshot=5,
-        )
-        print(f"Success: {model} on {dataset}")
+        eval_params = {
+            "model": MODEL_TYPE[model],
+            "model_args": MODEL_ARGS[model],
+            "tasks": dataset,
+            "apply_chat_template": APPLY_CHAT_TEMPLATE[model],
+            "fewshot_as_multiturn": FEWSHOT_AS_MULTITURN[model],
+            "batch_size": BATCH_SIZE[dataset],
+        }
+
+        if MODEL_TYPE[model] == "vllm":
+            eval_params["num_fewshot"] = 5
+
+        results = lm_eval.simple_evaluate(**eval_params)
+        print(f"Success: {model} on {dataset} ")
        measured_value = results["results"]
        queue.put(measured_value)
    except Exception as e:
-        print(f"Error in run_accuracy_unimodal: {e}")
+        print(f"Error in run_accuracy_test: {e}")
        queue.put(e)
        sys.exit(1)
    finally:
-        torch.npu.empty_cache()
+        if "results" in locals():
+            del results
        gc.collect()
-
-
-def run_accuracy_multimodal(queue, model, dataset):
-    try:
-        model_args = f"pretrained={model},max_model_len=8192,dtype=auto,tensor_parallel_size=4,max_images=2"
-        results = lm_eval.simple_evaluate(
-            model="vllm-vlm",
-            model_args=model_args,
-            tasks=dataset,
-            apply_chat_template=True,
-            fewshot_as_multiturn=True,
-            batch_size=batch_size_dict[dataset],
-        )
-        print(f"Success: {model} on {dataset}")
-        measured_value = results["results"]
-        queue.put(measured_value)
-    except Exception as e:
-        print(f"Error in run_accuracy_multimodal: {e}")
-        queue.put(e)
-        sys.exit(1)
-    finally:
        torch.npu.empty_cache()
-        gc.collect()
+        time.sleep(5)


 def generate_md(model_name, tasks_list, args, datasets):
-    run_cmd = MODEL_RUN_INFO[model_name].format(model=model_name,
-                                                datasets=datasets)
+    """Generate Markdown report with evaluation results"""
+    # Format the run command
+    run_cmd = MODEL_RUN_INFO[model_name].format(model=model_name, datasets=datasets)
    model = model_name.split("/")[1]
-    preamble = f"""# 🎯 {model} Accuracy Test
-  <div>
-    <strong>vLLM version:</strong> vLLM: {args.vllm_version}, vLLM Ascend: {args.vllm_ascend_version} <br>
-  </div>
-  <div>
-      <strong>Software Environment:</strong> CANN: {args.cann_version}, PyTorch: {args.torch_version}, torch-npu: {args.torch_npu_version} <br>
-  </div>
-  <div>
-      <strong>Hardware Environment</strong>: Atlas A2 Series <br>
-  </div>
-  <div>
-      <strong>Datasets</strong>: {datasets} <br>
-  </div>
-  <div>
-      <strong>Command</strong>: 

-  ```bash
-  {run_cmd}
-  ```
-  </div>
-  <div>&nbsp;</div>
+    # Version information section
+    version_info = (
+        f"**vLLM Version**: vLLM: {args.vllm_version} "
+        f"([{args.vllm_commit}]({VLLM_URL + args.vllm_commit})), "
+        f"vLLM Ascend: {args.vllm_ascend_version} "
+        f"([{args.vllm_ascend_commit}]({VLLM_ASCEND_URL + args.vllm_ascend_commit}))  "
+    )
+
+    # Report header with system info
+    preamble = f"""# {model}
+{version_info}
+**Software Environment**: CANN: {args.cann_version}, PyTorch: {args.torch_version}, torch-npu: {args.torch_npu_version}  
+**Hardware Environment**: Atlas A2 Series  
+**Datasets**: {datasets}  
+**Parallel Mode**: {PARALLEL_MODE[model_name]}  
+**Execution Mode**: {EXECUTION_MODE[model_name]}  
+**Command**:  
+```bash
+{run_cmd}
+```
  """

    header = (
@ -131,6 +182,7 @@ def generate_md(model_name, tasks_list, args, datasets):
    )
    rows = []
    rows_sub = []
+    # Process results for each task
    for task_dict in tasks_list:
        for key, stats in task_dict.items():
            alias = stats.get("alias", key)
@ -153,25 +205,48 @@ def generate_md(model_name, tasks_list, args, datasets):
                n_shot = "5"
            else:
                n_shot = "0"
-            row = (f"| {task_name:<37} "
-                   f"| {flt:<6} "
-                   f"| {n_shot:6} "
-                   f"| {metric:<6} "
-                   f"| ↑ {value:>5.4f} "
-                   f"| ± {stderr:>5.4f} |")
+            flag = ACCURACY_FLAG.get(task_name, "")
+            row = (
+                f"| {task_name:<37} "
+                f"| {flt:<6} "
+                f"| {n_shot:6} "
+                f"| {metric:<6} "
+                f"| {flag}{value:>5.4f} "
+                f"| ± {stderr:>5.4f} |"
+            )
            if not task_name.startswith("-"):
                rows.append(row)
-                rows_sub.append("<details>" + "\n" + "<summary>" + task_name +
-                                " details" + "</summary>" + "\n" * 2 + header)
+                rows_sub.append(
+                    "<details>"
+                    + "\n"
+                    + "<summary>"
+                    + task_name
+                    + " details"
+                    + "</summary>"
+                    + "\n" * 2
+                    + header
+                )
            rows_sub.append(row)
        rows_sub.append("</details>")
-    md = preamble + "\n" + header + "\n" + "\n".join(rows) + "\n" + "\n".join(
-        rows_sub) + "\n"
+    # Combine all Markdown sections
+    md = (
+        preamble
+        + "\n"
+        + header
+        + "\n"
+        + "\n".join(rows)
+        + "\n"
+        + "\n".join(rows_sub)
+        + "\n"
+    )
    print(md)
    return md


 def safe_md(args, accuracy, datasets):
+    """
+    Safely generate and save Markdown report from accuracy results.
+    """
    data = json.loads(json.dumps(accuracy))
    for model_key, tasks_list in data.items():
        md_content = generate_md(model_key, tasks_list, args, datasets)
@ -181,37 +256,50 @@ def safe_md(args, accuracy, datasets):


 def main(args):
+    """Main evaluation workflow"""
    accuracy = {}
    accuracy[args.model] = []
    result_queue: Queue[float] = multiprocessing.Queue()
    if args.model in UNIMODAL_MODEL_NAME:
-        datasets = ",".join(UNIMODAL_TASK)
-        for dataset in UNIMODAL_TASK:
-            p = multiprocessing.Process(target=run_accuracy_unimodal,
-                                        args=(result_queue, args.model,
-                                              dataset))
-            p.start()
+        datasets = UNIMODAL_TASK
+    else:
+        datasets = MULTIMODAL_TASK
+    datasets_str = ",".join(datasets)
+    # Evaluate model on each dataset
+    for dataset in datasets:
+        accuracy_expected = EXPECTED_VALUE[args.model][dataset]
+        p = multiprocessing.Process(
+            target=run_accuracy_test, args=(result_queue, args.model, dataset)
+        )
+        p.start()
+        p.join()
+        if p.is_alive():
+            p.terminate()
            p.join()
-            result = result_queue.get()
-            print(result)
-            accuracy[args.model].append(result)
-    if args.model in MULTIMODAL_NAME:
-        datasets = ",".join(MULTIMODAL_TASK)
-        for dataset in MULTIMODAL_TASK:
-            p = multiprocessing.Process(target=run_accuracy_multimodal,
-                                        args=(result_queue, args.model,
-                                              dataset))
-            p.start()
-            p.join()
-            result = result_queue.get()
-            print(result)
-            accuracy[args.model].append(result)
+        gc.collect()
+        torch.npu.empty_cache()
+        time.sleep(10)
+        result = result_queue.get()
+        print(result)
+        if (
+            accuracy_expected - RTOL
+            < result[dataset][FILTER[dataset]]
+            < accuracy_expected + RTOL
+        ):
+            ACCURACY_FLAG[dataset] = "✅"
+        else:
+            ACCURACY_FLAG[dataset] = "❌"
+        accuracy[args.model].append(result)
    print(accuracy)
-    safe_md(args, accuracy, datasets)
+    safe_md(args, accuracy, datasets_str)


 if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
+    multiprocessing.set_start_method("spawn", force=True)
+    # Initialize argument parser
+    parser = argparse.ArgumentParser(
+        description="Run model accuracy evaluation and generate report"
+    )
    parser.add_argument("--output", type=str, required=True)
    parser.add_argument("--model", type=str, required=True)
    parser.add_argument("--vllm_ascend_version", type=str, required=False)
@ -219,8 +307,7 @@ if __name__ == "__main__":
    parser.add_argument("--torch_npu_version", type=str, required=False)
    parser.add_argument("--vllm_version", type=str, required=False)
    parser.add_argument("--cann_version", type=str, required=False)
+    parser.add_argument("--vllm_commit", type=str, required=False)
+    parser.add_argument("--vllm_ascend_commit", type=str, required=False)
    args = parser.parse_args()
-    # TODO(yikun):
-    # 1. add a exit 1 if accuracy is not as expected
-    # 2. Add ✅, ❌ to markdown if accuracy is not as expected
    main(args)
--- a/benchmarks/tests/latency-tests.json
+++ b/benchmarks/tests/latency-tests.json
@ -9,5 +9,15 @@
      "num_iters_warmup": 5,
      "num_iters": 15
    }
+  },
+  {
+    "test_name": "latency_qwen2_5_7B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
  }
 ]
--- a/benchmarks/tests/serving-tests.json
+++ b/benchmarks/tests/serving-tests.json
@ -18,7 +18,7 @@
    },
    "client_parameters": {
      "model": "Qwen/Qwen2.5-VL-7B-Instruct",
-      "backend": "openai-chat",
+      "endpoint_type": "openai-chat",
      "dataset_name": "hf",
      "hf_split": "train",
      "endpoint": "/v1/chat/completions",
@ -44,7 +44,31 @@
    },
    "client_parameters": {
      "model": "Qwen/Qwen3-8B",
-      "backend": "vllm",
+      "endpoint_type": "vllm",
+      "dataset_name": "sharegpt",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200
+    }
+  },
+  {
+    "test_name": "serving_qwen2_5_7B_tp1",
+    "qps_list": [
+      1,
+      4,
+      16,
+      "inf"
+    ],
+    "server_parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "swap_space": 16,
+      "disable_log_stats": "",
+      "disable_log_requests": "",
+      "load_format": "dummy"
+    },
+    "client_parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "endpoint_type": "vllm",
      "dataset_name": "sharegpt",
      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
      "num_prompts": 200
--- a/benchmarks/tests/throughput-tests.json
+++ b/benchmarks/tests/throughput-tests.json
@ -22,6 +22,17 @@
      "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
      "num_prompts": 200
    }
+  },
+  {
+    "test_name": "throughput_qwen2_5_7B_tp1",
+    "parameters": {
+      "model": "Qwen/Qwen2.5-7B-Instruct",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "dataset_path": "/github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json",
+      "num_prompts": 200,
+      "backend": "vllm"
+    }
  }
 ]

--- a/vllm_ascend/patch/platform/patch_0_9_0/init.py
+++ b/vllm_ascend/patch/platform/patch_0_9_0/init.py
@ -1,6 +1,5 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
-# This file is a part of the vllm-ascend project.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@ -13,5 +12,19 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+# This file is a part of the vllm-ascend project.
 #
-import vllm_ascend.patch.platform.patch_0_9_0.patch_distributed  # noqa
+
+coverage:
+  status:
+    # non-voting, new code must be fully tested
+    patch:
+      default:
+        target: 100%
+        # non-voting
+        informational: true
+    # non-voting
+    project:
+      default:
+        # non-voting
+        informational: true
--- a/csrc/kernels/advance_step.cpp
+++ b/csrc/kernels/advance_step.cpp
@ -1,241 +0,0 @@
-/*
- * Copyright (c) China Merchants Bank Co., Ltd. 2025. All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "kernel_operator.h"
-constexpr int32_t BUFFER_NUM = 1;
-class KernelAdvanceStep{
-public:
-    __aicore__ inline KernelAdvanceStep() {}
-    __aicore__ inline void Init(int32_t tasks_per_core,
-                                int32_t num_queries,
-                                __gm__ int64_t* input_tokens_ptr,
-                                __gm__ int64_t* sampled_token_ids_ptr,
-                                __gm__ int64_t* input_positions_ptr,
-                                __gm__ int32_t* seq_lens_ptr,
-                                __gm__ int32_t* slot_mapping_ptr)
-    {
-        this->tasks_per_core = tasks_per_core;
-
-        this->start_id = this->tasks_per_core * AscendC::GetBlockIdx();
-        this->end_id = this->tasks_per_core * (AscendC::GetBlockIdx() + 1) - 1;
-
-        // actual task nums of each core
-        this->actual_task_per_core = tasks_per_core;
-        if(this->end_id >= num_queries) {
-            this->actual_task_per_core = num_queries - this->start_id;
-            this->end_id = num_queries - 1;
-        }
-
-        int32_t offset_this_core = this->tasks_per_core * AscendC::GetBlockIdx();
-
-        // init outQues
-        pipe.InitBuffer(outQueInputTokens, BUFFER_NUM, this->actual_task_per_core * sizeof(int64_t));
-        pipe.InitBuffer(outQueInputPos, BUFFER_NUM, this->actual_task_per_core * sizeof(int64_t));
-        pipe.InitBuffer(outQueSeqLen, BUFFER_NUM, this->actual_task_per_core * sizeof(int32_t));
-        pipe.InitBuffer(outQueSlotMapping, BUFFER_NUM, this->actual_task_per_core * sizeof(int32_t));
-
-        // init inQues
-        pipe.InitBuffer(inQueSeqLen,BUFFER_NUM, this->actual_task_per_core * sizeof(int32_t));
-        pipe.InitBuffer(inQueSampledTokenIds,BUFFER_NUM, this->actual_task_per_core * sizeof(int64_t));
-
-        // init GlobalMemory
-        inputTokensGm.SetGlobalBuffer((__gm__ int64_t *)input_tokens_ptr + offset_this_core, this->actual_task_per_core);
-        sampledTokenIdsGm.SetGlobalBuffer((__gm__ int64_t *)sampled_token_ids_ptr + offset_this_core, this->actual_task_per_core);
-        inputPositionsGm.SetGlobalBuffer((__gm__ int64_t *)input_positions_ptr + offset_this_core, this->actual_task_per_core);
-        seqLensGm.SetGlobalBuffer((__gm__ int32_t *)seq_lens_ptr + offset_this_core, this->actual_task_per_core);
-        slotMappingGm.SetGlobalBuffer((__gm__ int32_t *)slot_mapping_ptr + offset_this_core, this->actual_task_per_core);
-    }
-    __aicore__ inline void Process(int64_t block_size, __gm__ int32_t* block_tables_ptr,  int64_t block_tables_stride)
-    {
-        // no need for tilling or pipeline parallel within each core, as the amount of data processed is very small
-        CopyIn();
-        Update(block_size, block_tables_ptr, block_tables_stride);
-        CopyOut();
-    }
-
-private:
-     __aicore__ inline void CopyIn()
-    {
-        AscendC::LocalTensor<int32_t> seqLenLocalIn = inQueSeqLen.AllocTensor<int32_t>();
-        AscendC::LocalTensor<int64_t> sampledTokenIdsLocal = inQueSampledTokenIds.AllocTensor<int64_t>();
-
-        AscendC::DataCopyExtParams copyParams32{1, static_cast<uint32_t>(this->actual_task_per_core * sizeof(int32_t)), 0, 0, 0}; // blockLen = tasks_per_core * 32 / 8 个字节（int32为4字节)
-        AscendC::DataCopyExtParams copyParams64{1, static_cast<uint32_t>(this->actual_task_per_core * sizeof(int64_t)), 0, 0, 0}; // blockLen = tasks_per_core * 64 / 8 个字节（int64为8字节）
-
-        // calculate the nums that need padded
-        // so that the total length becomes a multiple of 32 bytes which is a requirement of DataCopy Function.
-        uint8_t remainNum32 =this->actual_task_per_core * sizeof(int32_t) % 32;
-        uint8_t needPadElements32 = remainNum32 == 0 ? remainNum32 : (32 - remainNum32) / sizeof(int32_t);
-
-        AscendC::DataCopyPadExtParams<int32_t> padParams32{true, 0, needPadElements32, 0};
-
-        // calculate the nums that need padded
-        // so that the total length becomes a multiple of 32 bytes which is a requirement of DataCopy Function.
-        uint8_t remainNum64 =this->actual_task_per_core * sizeof(int64_t) % 32;
-        uint8_t needPadElements64 = remainNum64 == 0 ? remainNum64 : (32 - remainNum64) / sizeof(int64_t);
-        AscendC::DataCopyPadExtParams<int64_t> padParams64{true, 0, needPadElements64, 0};
-
-        AscendC::DataCopyPad(seqLenLocalIn, seqLensGm, copyParams32, padParams32);
-        AscendC::DataCopyPad(sampledTokenIdsLocal, sampledTokenIdsGm, copyParams64, padParams64);
-
-        inQueSeqLen.EnQue(seqLenLocalIn);
-        inQueSampledTokenIds.EnQue(sampledTokenIdsLocal);
-    }
-    __aicore__ inline void Update(int64_t block_size, __gm__ int32_t* block_tables_ptr, int64_t block_tables_stride)
-    {
-        // input
-        AscendC::LocalTensor<int32_t> seqLenLocalIn = inQueSeqLen.DeQue<int32_t>();
-        AscendC::LocalTensor<int64_t> sampledTokenIdsLocal = inQueSampledTokenIds.DeQue<int64_t>();
-
-        // output
-        AscendC::LocalTensor<int64_t> inputTokensLocal = outQueInputTokens.AllocTensor<int64_t>();
-        AscendC::LocalTensor<int64_t> inputPosLocal = outQueInputPos.AllocTensor<int64_t>();
-        AscendC::LocalTensor<int32_t> seqLenLocalOut = outQueSeqLen.AllocTensor<int32_t>();
-        AscendC::LocalTensor<int32_t> slotMappingLocal = outQueSlotMapping.AllocTensor<int32_t>();
-
-        auto unary_params = AscendC::UnaryRepeatParams(1, 1, 8, 8);
-
-        //Use "for" instead of AscendC::Adds function because AscendC::Adds does not work
-        //when srcLocalMemory has different datatype from dstLocalMemory
-        for(int i=0; i < this->actual_task_per_core; i++) {
-            inputTokensLocal.SetValue(i, sampledTokenIdsLocal.GetValue(i));
-            inputPosLocal.SetValue(i, seqLenLocalIn.GetValue(i));
-        }
-
-        AscendC::Adds<int32_t, false>(seqLenLocalOut, seqLenLocalIn, 1, (uint64_t)0, 1, unary_params);
-
-        // Gather blockTables with dim=1, block_index. No Ascend Function available, use "for" instead.
-        for(int cur_query_id = this->start_id, i = 0; i < this->actual_task_per_core; cur_query_id++, i++) {
-            __gm__ int32_t const* seq_block_tables_ptr = block_tables_ptr + block_tables_stride * cur_query_id;
-
-            int block_index = inputPosLocal.GetValue(i) / block_size;
-            int block_offset = inputPosLocal.GetValue(i) % block_size;
-
-            int slot_num = seq_block_tables_ptr[block_index] * block_size + block_offset;
-            // Update slot_mapping
-            slotMappingLocal.SetValue(i,slot_num);
-        }
-
-        outQueInputTokens.EnQue(inputTokensLocal);
-        outQueInputPos.EnQue(inputPosLocal);
-        outQueSeqLen.EnQue(seqLenLocalOut);
-        outQueSlotMapping.EnQue(slotMappingLocal);
-
-        inQueSampledTokenIds.FreeTensor(sampledTokenIdsLocal);
-        inQueSeqLen.FreeTensor(seqLenLocalIn);
-
-    }
-    __aicore__ inline void CopyOut()
-    {
-        AscendC::DataCopyExtParams copyParams32{1, static_cast<uint32_t>(this->actual_task_per_core * sizeof(int32_t)),0,0,0};
-        AscendC::DataCopyExtParams copyParams64{1, static_cast<uint32_t>(this->actual_task_per_core * sizeof(int64_t)),0,0,0};
-
-        AscendC::LocalTensor<int64_t> inputTokensLocal = outQueInputTokens.DeQue<int64_t>();
-        AscendC::DataCopyPad(inputTokensGm, inputTokensLocal, copyParams64);
-        outQueInputTokens.FreeTensor(inputTokensLocal);
-
-        AscendC::LocalTensor<int64_t> inputPosLocal = outQueInputPos.DeQue<int64_t>();
-        AscendC::DataCopyPad(inputPositionsGm, inputPosLocal, copyParams64);
-        outQueInputPos.FreeTensor(inputPosLocal);
-
-        AscendC::LocalTensor<int32_t> seqLenLocalOut = outQueSeqLen.DeQue<int32_t>();
-        AscendC::DataCopyPad(seqLensGm, seqLenLocalOut, copyParams32);
-        outQueSeqLen.FreeTensor(seqLenLocalOut);
-
-        AscendC::LocalTensor<int32_t> slotMappingLocal = outQueSlotMapping.DeQue<int32_t>();
-        AscendC::DataCopyPad(slotMappingGm, slotMappingLocal, copyParams32);
-        outQueSlotMapping.FreeTensor(slotMappingLocal);
-    }
-
-private:
-    AscendC::TPipe pipe;
-    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueInputTokens, outQueInputPos,
-                                                            outQueSeqLen, outQueSlotMapping;
-    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueSeqLen,
-                                                           inQueSampledTokenIds,
-                                                           inQueBlockTables;
-
-    AscendC::GlobalTensor<int64_t> inputTokensGm, sampledTokenIdsGm, inputPositionsGm ;
-
-    AscendC::GlobalTensor<int32_t> seqLensGm, slotMappingGm, blockTablesGm;
-
-    int32_t tasks_per_core, start_id, end_id, actual_task_per_core;
-};
-
-extern "C" __global__ __aicore__ void AdvanceStepFlashAttnKernel(
-    int64_t num_seqs,
-    int64_t num_queries,
-    int64_t block_size,
-    __gm__ int64_t* input_tokens_ptr,
-    __gm__ int64_t* sampled_token_ids_ptr,
-    __gm__ int64_t* input_positions_ptr,
-    __gm__ int32_t* seq_lens_ptr,
-    __gm__ int32_t* slot_mapping_ptr,
-    __gm__ int32_t* block_tables_ptr,
-    int64_t block_tables_stride,
-    int32_t tasks_per_core
-)
-{
-    int start_id = tasks_per_core * AscendC::GetBlockIdx();
-    // no task for this core.
-    if(start_id >= num_queries) {
-        return;
-    }
-    KernelAdvanceStep advanceStep;
-    advanceStep.Init(tasks_per_core, num_queries, input_tokens_ptr, sampled_token_ids_ptr, input_positions_ptr, seq_lens_ptr, slot_mapping_ptr);
-    advanceStep.Process(block_size,block_tables_ptr,block_tables_stride);
-}
-
-namespace vllm_ascend
-{
-
-extern void launch_advance_step_flashattn(
-    void* stream,
-    int64_t num_seqs,
-    int64_t num_queries,
-    int64_t block_size,
-    int64_t* input_tokens_ptr,
-    int64_t* sampled_token_ids_ptr,
-    int64_t* input_positions_ptr,
-    int32_t* seq_lens_ptr,
-    int32_t* slot_mapping_ptr,
-    int32_t* block_tables_ptr,
-    int64_t block_tables_stride)
-{
-    int32_t num_cores = 20;
-
-    if(num_cores > num_queries) {
-        num_cores = num_queries;
-    }
-
-    // task num processed of each core
-    int32_t tasks_per_core = (num_queries + num_cores - 1) / num_cores;
-
-    AdvanceStepFlashAttnKernel<<<num_cores, nullptr, stream>>>(
-        num_seqs,
-        num_queries,
-        block_size,
-        input_tokens_ptr,
-        sampled_token_ids_ptr,
-        input_positions_ptr,
-        seq_lens_ptr,
-        slot_mapping_ptr,
-        block_tables_ptr,
-        block_tables_stride,
-        tasks_per_core);
-}
-
-}
--- a/csrc/kernels/get_masked_input_and_mask_kernel.cpp
+++ b/csrc/kernels/get_masked_input_and_mask_kernel.cpp
@ -0,0 +1,378 @@
+/* 
+ * Copyright (c) Huawei Technologies Co., Ltd. 2024. All rights reserved.
+ */
+
+#include "kernel_operator.h"
+#include "kernel_tensor_impl.h"
+#include "kernel_type.h"
+#include "types.h"
+#include "utils.h"
+using vllm_ascend::AccType;
+
+template<typename scalar_t>
+class GetMaskedInputAndMask {
+public:
+    __aicore__ inline GetMaskedInputAndMask() {}
+    
+    __aicore__ inline ~GetMaskedInputAndMask() {
+        pipe.Reset();
+    }
+
+    
+    __aicore__ inline void Init(
+        __gm__ scalar_t* input,
+        __gm__ scalar_t* masked_input, 
+        __gm__ bool* mask_out,
+        const int64_t org_vocab_start_index,
+        const int64_t org_vocab_end_index,
+        const int64_t num_org_vocab_padding,
+        const int64_t added_vocab_start_index,
+        const int64_t added_vocab_end_index,
+        const int64_t size)
+    {
+        // Initialize basic parameters
+        input_ = input;
+        masked_input_ = masked_input;
+        mask_out_ = mask_out;
+        org_vocab_start_index_ = org_vocab_start_index;
+        org_vocab_end_index_ = org_vocab_end_index;
+        size_ = ((size + 31) / 32) * 32;
+        added_offset_ = added_vocab_start_index - 
+            (org_vocab_end_index - org_vocab_start_index) - 
+            num_org_vocab_padding;
+        added_vocab_start_index_ = added_vocab_start_index;
+        added_vocab_end_index_ = added_vocab_end_index;
+
+        // Initialize global tensors
+        inputGlobal.SetGlobalBuffer(input);
+        maskedOutputGlobal.SetGlobalBuffer(masked_input); 
+        maskOutGlobal.SetGlobalBuffer(mask_out);
+
+        // Initialize queues
+        pipe.InitBuffer(inQueue, 1, size_ * sizeof(scalar_t));
+        pipe.InitBuffer(outQueue, 1, size_ * sizeof(scalar_t));
+        pipe.InitBuffer(maskQueue, 1, size_ * sizeof(bool));
+        
+        // Initialize calculation buffers
+        // NOTE: calc_buf_1 and calc_buf_2 are also used for int16 casting on older archs.
+        pipe.InitBuffer(calc_buf_1, size_ * sizeof(float));
+        pipe.InitBuffer(calc_buf_2, size_ * sizeof(float));
+        
+        // Initialize result queues
+        pipe.InitBuffer(result_ge_que, BUFFER_NUM, size_ * sizeof(float));
+        pipe.InitBuffer(result_le_que, BUFFER_NUM, size_ * sizeof(float));
+        pipe.InitBuffer(result_org_mask_que, BUFFER_NUM, size_ * sizeof(float));
+        pipe.InitBuffer(result_add_mask_que, BUFFER_NUM, size_ * sizeof(float));
+
+        // Initialize temporary buffers
+        pipe.InitBuffer(start_buf, size_ * sizeof(float));
+        pipe.InitBuffer(end_buf, size_ * sizeof(float));
+        pipe.InitBuffer(inputFloat_buf, size_ * sizeof(float)); // Also used for half intermediate in casting
+        pipe.InitBuffer(validOffset_buf, size_ * sizeof(float));
+        pipe.InitBuffer(vocabMask_buf_, size_ * sizeof(int8_t));
+        pipe.InitBuffer(ones_buf_, size_ * sizeof(float));
+    }
+
+    __aicore__ inline void Process()
+    {
+        CopyIn();
+        Compute();
+        CopyOut();
+    }
+
+private:
+    __aicore__ inline void CopyIn()
+    {
+        AscendC::LocalTensor<scalar_t> inputLocal = inQueue.AllocTensor<scalar_t>();
+        AscendC::DataCopy(inputLocal, inputGlobal, size_);
+        inQueue.EnQue(inputLocal);
+    }
+
+    __aicore__ inline void CompareWithValue(
+        AscendC::LocalTensor<int8_t>& result,
+        const AscendC::LocalTensor<float>& input,
+        const AscendC::LocalTensor<float>& compare_value,
+        bool is_greater_equal) {
+
+        AscendC::LocalTensor<float> compute_buf = calc_buf_1.Get<float>();
+        if (is_greater_equal) {
+            AscendC::Max(compute_buf, input, compare_value, size_);  
+            AscendC::Sub(compute_buf, compare_value, compute_buf, size_);  
+        } else {
+            AscendC::Max(compute_buf, input, compare_value, size_); 
+            AscendC::Sub(compute_buf, compute_buf, compare_value, size_); 
+        }
+
+        AscendC::Abs(compute_buf, compute_buf, size_);
+        AscendC::Mins(compute_buf, compute_buf, MIN_ACCURACY_FP32, size_);
+        AscendC::Muls(compute_buf, compute_buf, MAX_MUL_1_FP32, size_);
+        AscendC::Muls(compute_buf, compute_buf, MAX_MUL_1_FP32, size_);
+        AscendC::Muls(compute_buf, compute_buf, MAX_MUL_2_FP32, size_);
+        AscendC::Adds(compute_buf, compute_buf, NEGATIVE_ONE_FP32, size_);
+        AscendC::Abs(compute_buf, compute_buf, size_);
+
+        AscendC::LocalTensor<half> compute_buf_fp16 = calc_buf_2.Get<half>();
+        AscendC::Cast(compute_buf_fp16, compute_buf, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(result, compute_buf_fp16, AscendC::RoundMode::CAST_NONE, size_);
+    }
+
+    __aicore__ inline void ComputeRangeMask(
+        AscendC::LocalTensor<int8_t>& range_mask,
+        const AscendC::LocalTensor<float>& input,
+        const float start_value, 
+        const float end_value) {
+        
+        AscendC::LocalTensor<float> start_value_tensor = start_buf.Get<float>();
+        AscendC::LocalTensor<float> end_value_tensor = end_buf.Get<float>();
+
+        AscendC::Duplicate(start_value_tensor, start_value, size_);
+        AscendC::Duplicate(end_value_tensor, end_value, size_);
+        
+        AscendC::LocalTensor<int8_t> ge_result = result_ge_que.AllocTensor<int8_t>();
+        AscendC::LocalTensor<int8_t> lt_result = result_le_que.AllocTensor<int8_t>();
+
+        CompareWithValue(ge_result, start_value_tensor, input, true);
+        CompareWithValue(lt_result, input, end_value_tensor, false);
+        
+#if (__CCE_AICORE__ >= 220) 
+        AscendC::And(range_mask, ge_result, lt_result, size_);
+#else
+        {
+            // WORKAROUND for older arch
+            // No direct int8->int16 cast. Use half as intermediate.
+            // No direct int8 And. Use int16 And.
+            AscendC::LocalTensor<int16_t> ge_result_i16 = calc_buf_1.Get<int16_t>();
+            AscendC::LocalTensor<int16_t> lt_result_i16 = calc_buf_2.Get<int16_t>();
+            AscendC::LocalTensor<int16_t> range_mask_i16 = ge_result_i16; 
+            
+            // Use a temporary buffer for half type
+            AscendC::LocalTensor<half> tmp_half = inputFloat_buf.Get<half>();
+
+            // 1. Cast inputs: int8_t -> half -> int16_t
+            AscendC::Cast(tmp_half, ge_result, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(ge_result_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+            
+            AscendC::Cast(tmp_half, lt_result, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(lt_result_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+
+            // 2. Perform And on int16_t tensors
+            AscendC::And(range_mask_i16, ge_result_i16, lt_result_i16, size_);
+
+            // 3. Cast result back: int16_t -> half -> int8_t
+            AscendC::Cast(tmp_half, range_mask_i16, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(range_mask, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+        }
+#endif
+    }
+
+    __aicore__ inline void Compute() {
+        AscendC::LocalTensor<scalar_t> inputLocal = inQueue.DeQue<scalar_t>();
+        AscendC::LocalTensor<scalar_t> maskedLocal = outQueue.AllocTensor<scalar_t>();
+        AscendC::LocalTensor<int8_t> maskLocal = maskQueue.AllocTensor<int8_t>();
+
+        AscendC::LocalTensor<float> inputFloat = inputFloat_buf.Get<float>();
+        AscendC::Cast(inputFloat, inputLocal, AscendC::RoundMode::CAST_NONE, size_);
+
+        AscendC::LocalTensor<int8_t> orgVocabMask = result_org_mask_que.AllocTensor<int8_t>();
+        ComputeRangeMask(orgVocabMask, 
+                        inputFloat,
+                        static_cast<float>(org_vocab_start_index_),
+                        static_cast<float>(org_vocab_end_index_));
+
+        AscendC::LocalTensor<int8_t> addedVocabMask = result_add_mask_que.AllocTensor<int8_t>();
+        ComputeRangeMask(addedVocabMask,
+                        inputFloat,
+                        static_cast<float>(added_vocab_start_index_),
+                        static_cast<float>(added_vocab_end_index_));
+
+        AscendC::LocalTensor<float> validOffset = validOffset_buf.Get<float>();
+        AscendC::LocalTensor<float> constOrgStartIndex = start_buf.Get<float>();
+        
+        AscendC::Duplicate(constOrgStartIndex, float(org_vocab_start_index_), size_);
+        
+        AscendC::LocalTensor<half> orgVocabMask_fp16;
+        AscendC::LocalTensor<float> orgVocabMask_fp32;
+        AscendC::Cast(orgVocabMask_fp16, orgVocabMask, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(orgVocabMask_fp32, orgVocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
+
+        AscendC::Mul(validOffset, constOrgStartIndex, orgVocabMask_fp32, size_);
+
+        AscendC::LocalTensor<float> addedOffset;
+        AscendC::LocalTensor<float> addedOffsetTensor = end_buf.Get<float>();
+        AscendC::Duplicate(addedOffsetTensor, float(added_offset_), size_);
+
+        AscendC::LocalTensor<half> addedVocabMask_fp16;
+        AscendC::LocalTensor<float> addedVocabMask_fp32;
+        AscendC::Cast(addedVocabMask_fp16, addedVocabMask, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(addedVocabMask_fp32, addedVocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
+
+        AscendC::Mul(addedOffset, addedOffsetTensor, addedVocabMask_fp32, size_);
+        AscendC::Add(validOffset, validOffset, addedOffset, size_);
+
+        AscendC::LocalTensor<int8_t> vocabMask = vocabMask_buf_.Get<int8_t>();
+        
+#if (__CCE_AICORE__ >= 220)
+        AscendC::Or(vocabMask,
+                    orgVocabMask,
+                    addedVocabMask,
+                    size_);
+#else
+        {
+            // WORKAROUND for older arch 
+            // No direct int8->int16 cast. Use half as intermediate.
+            // No direct int8 Or. Use int16 Or.
+            AscendC::LocalTensor<int16_t> orgVocabMask_i16 = calc_buf_1.Get<int16_t>();
+            AscendC::LocalTensor<int16_t> addedVocabMask_i16 = calc_buf_2.Get<int16_t>();
+            AscendC::LocalTensor<int16_t> vocabMask_i16 = orgVocabMask_i16; 
+
+            // Use a temporary buffer for half type. inputFloat_buf is free now.
+            AscendC::LocalTensor<half> tmp_half = inputFloat_buf.Get<half>();
+
+            // 1. Cast inputs: int8_t -> half -> int16_t
+            AscendC::Cast(tmp_half, orgVocabMask, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(orgVocabMask_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+
+            AscendC::Cast(tmp_half, addedVocabMask, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(addedVocabMask_i16, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+
+            // 2. Perform Or on int16_t tensors
+            AscendC::Or(vocabMask_i16, orgVocabMask_i16, addedVocabMask_i16, size_);
+
+            // 3. Cast result back: int16_t -> half -> int8_t
+            AscendC::Cast(tmp_half, vocabMask_i16, AscendC::RoundMode::CAST_NONE, size_);
+            AscendC::Cast(vocabMask, tmp_half, AscendC::RoundMode::CAST_NONE, size_);
+        }
+#endif
+
+        AscendC::Sub(inputFloat, inputFloat, validOffset, size_);
+
+        AscendC::LocalTensor<half> vocabMask_fp16;
+        AscendC::LocalTensor<float> vocabMask_fp32;
+        AscendC::Cast(vocabMask_fp16, vocabMask, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(vocabMask_fp32, vocabMask_fp16, AscendC::RoundMode::CAST_NONE, size_);
+        
+        AscendC::Mul(inputFloat, inputFloat, vocabMask_fp32, size_);
+
+        AscendC::Cast(maskedLocal, inputFloat, AscendC::RoundMode::CAST_CEIL, size_);  
+        outQueue.EnQue(maskedLocal);
+
+        AscendC::LocalTensor<float> ones_tensor = ones_buf_.Get<float>();
+        AscendC::Duplicate(ones_tensor, (float)1, size_);
+        AscendC::LocalTensor<float> maskLocal_fp32;
+
+        AscendC::Sub(maskLocal_fp32, ones_tensor, vocabMask_fp32, size_);
+
+        AscendC::LocalTensor<half> maskLocal_fp16;
+        AscendC::Cast(maskLocal_fp16, maskLocal_fp32, AscendC::RoundMode::CAST_NONE, size_);
+        AscendC::Cast(maskLocal, maskLocal_fp16, AscendC::RoundMode::CAST_NONE, size_);
+        maskQueue.EnQue(maskLocal);
+        inQueue.FreeTensor(inputLocal);
+    }
+
+    __aicore__ inline void CopyOut()
+    {
+        AscendC::LocalTensor<scalar_t> maskedLocal = outQueue.DeQue<scalar_t>();
+        AscendC::LocalTensor<bool> maskLocal = maskQueue.DeQue<bool>();
+        
+        AscendC::DataCopy(maskedOutputGlobal, maskedLocal, size_);
+        AscendC::DataCopy(maskOutGlobal, maskLocal, size_);
+        
+        outQueue.FreeTensor(maskedLocal);
+        maskQueue.FreeTensor(maskLocal);
+    }
+
+private:
+    static constexpr int32_t BUFFER_NUM = 2;
+    AscendC::TPipe pipe;
+    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueue;
+    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue, maskQueue;
+    AscendC::GlobalTensor<scalar_t> inputGlobal, maskedOutputGlobal;
+    AscendC::GlobalTensor<bool> maskOutGlobal;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> calc_buf_1;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> calc_buf_2;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_ge_que;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_le_que;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_org_mask_que;
+    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> result_add_mask_que;
+
+    // Temporary buffers
+    AscendC::TBuf<AscendC::TPosition::VECCALC> start_buf;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> end_buf; 
+    AscendC::TBuf<AscendC::TPosition::VECCALC> inputFloat_buf;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> validOffset_buf;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> vocabMask_buf_;
+    AscendC::TBuf<AscendC::TPosition::VECCALC> ones_buf_;
+    
+    __gm__ scalar_t *input_, *masked_input_;
+    __gm__ bool *mask_out_;
+    int64_t size_;
+    int64_t org_vocab_start_index_, org_vocab_end_index_;
+    int64_t added_vocab_start_index_, added_vocab_end_index_;
+    int64_t added_offset_;
+
+    static constexpr float MIN_ACCURACY_FP32 = 1.1754943508222875e-38;
+    static constexpr float MAX_MUL_1_FP32 = 1125899906842624;
+    static constexpr float MAX_MUL_2_FP32 = 67108864;
+    static constexpr float NEGATIVE_ONE_FP32 = -1.0f;
+};
+
+extern "C" __global__ __aicore__ void get_masked_input_and_mask_kernel(
+    __gm__ int32_t* input,
+    __gm__ int32_t* masked_input,
+    __gm__ bool* mask_out, 
+    const int64_t org_vocab_start_index,
+    const int64_t org_vocab_end_index,
+    const int64_t num_org_vocab_padding,
+    const int64_t added_vocab_start_index,
+    const int64_t added_vocab_end_index,
+    const int64_t size,
+    const uint32_t loop_cnt,
+    const uint32_t aiv_num)
+{
+    {
+        GetMaskedInputAndMask<int32_t> op{};
+
+        for (int64_t i = AscendC::GetBlockIdx(); i < loop_cnt; i += aiv_num) {
+            op.Init(input + i * size/loop_cnt, 
+                   masked_input + i * size/loop_cnt,
+                   mask_out + i * size/loop_cnt,
+                   org_vocab_start_index, org_vocab_end_index,
+                   num_org_vocab_padding, added_vocab_start_index,
+                   added_vocab_end_index, size/loop_cnt);
+                
+            op.Process();
+        }
+    } // op destructor called here
+}
+
+namespace vllm_ascend {
+
+void get_masked_input_and_mask_impl(
+    void* stream,
+    void* input,
+    void* masked_input,
+    void* mask_out,
+    const int64_t org_vocab_start_index,
+    const int64_t org_vocab_end_index,
+    const int64_t num_org_vocab_padding, 
+    const int64_t added_vocab_start_index,
+    const int64_t added_vocab_end_index,
+    const int64_t size,
+    const uint32_t loop_cnt,
+    const uint32_t aiv_num)
+{
+    get_masked_input_and_mask_kernel<<<aiv_num, nullptr, stream>>>(
+        static_cast<int32_t*>(input),
+        static_cast<int32_t*>(masked_input),
+        static_cast<bool*>(mask_out),
+        org_vocab_start_index,
+        org_vocab_end_index,
+        num_org_vocab_padding,
+        added_vocab_start_index,
+        added_vocab_end_index,
+        size,
+        loop_cnt,
+        aiv_num);
+}
+
+} // namespace vllm_ascend
--- a/csrc/kernels/pos_encoding_kernels.cpp
+++ b/csrc/kernels/pos_encoding_kernels.cpp
@ -30,7 +30,11 @@ using vllm_ascend::local_mem_copy;
 template <typename scalar_t, bool isNeox> class RotaryEmbedding {
    // NOTE(ganyi): we use 512B as load stride for pipe, need to find another way to
    // retrieve this size from runtime for more Soc support
-    static int constexpr loadSize = 512;
+    #if (__CCE_AICORE__ >= 220)
+        static int constexpr loadSize = 512;
+    #else
+        static int constexpr loadSize = 1024 * 4;
+    #endif
    using dst_t = scalar_t;
    using acc_t = typename AccType<scalar_t>::type;
    // only half tensor have cast instruct to int8, hardcode acc_dst_t as half
@ -326,7 +330,9 @@ private:

 // Declare all the kernel entry here
 ROPE_CUSTOM_KERNEL_DECLARE(half)
-ROPE_CUSTOM_KERNEL_DECLARE(bfloat16_t)
+#if (__CCE_AICORE__ >= 220)
+    ROPE_CUSTOM_KERNEL_DECLARE(bfloat16_t)
+#endif

 namespace vllm_ascend {

@ -342,7 +348,7 @@ namespace vllm_ascend {
            reinterpret_cast<TYPE *>(cosSinCache), rotDim, queryStride, keyStride, dstQueryStride, dstKeyStride, \
            numHeads, numKvHeads, headSize, numTokens, loopCnt, blockDim);

-// maximum number for runtime to launch a ascendc kernel. 
+// maximum number for runtime to launch a ascendc kernel.
 // we use this to constrain the maximum number of block size
 static const int64_t maxParallelSize = 65535;

@ -357,9 +363,13 @@ extern void rotary_embedding_impl(AscendType type, bool isNeox, void *stream, in
    int blockDim = maxParallelSize > numTokens ? numTokens : maxParallelSize;
    if (type == AscendType::FP16) {
        ROTARY_EMBEDDING_KERNEL_CALL(half);
-    } else if (type == AscendType::BF16) {
+    }
+    #if (__CCE_AICORE__ >= 220)
+    else if (type == AscendType::BF16) {
        ROTARY_EMBEDDING_KERNEL_CALL(bfloat16_t);
-    } else {
+    }
+    #endif
+    else {
        return;
    }
 }
--- a/csrc/kernels/utils.h
+++ b/csrc/kernels/utils.h
@ -20,9 +20,11 @@ namespace vllm_ascend {

 template <typename scalar_t> struct AccType;

+#if (__CCE_AICORE__ >= 220)
 template <> struct AccType<bfloat16_t> {
-    using type = float;
+  using type = float;
 };
+#endif

 template <> struct AccType<half> {
    using type = half;
--- a/csrc/ops.h
+++ b/csrc/ops.h
@ -31,6 +31,20 @@ namespace vllm_ascend {
    const int headSize, const int64_t numTokens, const uint32_t loopCnt,
    uint32_t aivNum);

+  extern void get_masked_input_and_mask_impl(
+    void* stream,
+    void* input,
+    void* masked_input,
+    void* mask_out,
+    const int64_t org_vocab_start_index,
+    const int64_t org_vocab_end_index,
+    const int64_t num_org_vocab_padding, 
+    const int64_t added_vocab_start_index,
+    const int64_t added_vocab_end_index,
+    const int64_t size,
+    const uint32_t loop_cnt,
+    const uint32_t aiv_num);
+    
  torch::Tensor weak_ref_tensor(torch::Tensor& tensor) {
    if (!tensor.is_privateuseone()) {
      throw std::runtime_error("Tensor must be on NPU device");
@ -46,16 +60,4 @@ namespace vllm_ascend {
    auto new_tensor = at_npu::native::from_blob(data_ptr, sizes, strides, options);
    return new_tensor;
  }
-    extern void launch_advance_step_flashattn(
-        void* stream,
-        int64_t num_seqs,
-        int64_t num_queries,
-        int64_t block_size,
-        int64_t* input_tokens_ptr,
-        int64_t* sampled_token_ids_ptr,
-        int64_t* input_positions_ptr,
-        int32_t* seq_lens_ptr,
-        int32_t* slot_mapping_ptr,
-        int32_t* block_tables_ptr,
-        int64_t block_tables_stride);
 }
--- a/csrc/torch_binding.cpp
+++ b/csrc/torch_binding.cpp
@ -99,85 +99,110 @@ std::tuple<at::Tensor, at::Tensor> rotary_embedding(at::Tensor &positions, at::T
    return {query_dst, key_dst};
 }

-void verify_tensor(std::string const& name, at::Tensor const& t,
-                          int64_t const size_0, int64_t const size_1,
-                          c10::ScalarType const type) {
-    bool size_0_cond = true;
-    if (size_0 != -1) {
-        size_0_cond = t.size(0) == size_0;
-    }
+std::tuple<at::Tensor, at::Tensor> get_masked_input_and_mask(
+    at::Tensor &input,
+    const int64_t org_vocab_start_index,
+    const int64_t org_vocab_end_index,
+    const int64_t num_org_vocab_padding,
+    const int64_t added_vocab_start_index,
+    const int64_t added_vocab_end_index)
+    /*
+    https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L161-L198
+    Embedding parallelized in the vocabulary dimension.

-    bool size_1_cond = true;
-    if (size_1 != -1) {
-        size_1_cond = t.size(1) == size_1;
-    }
+    Adapted from torch.nn.Embedding, note that we pad the vocabulary size to
+    make sure it is divisible by the number of model parallel GPUs.

-    bool is_contiguous = t.is_contiguous();
-    bool same_type = t.dtype() == type;
+    In order to support various loading methods, we ensure that LoRA-added
+    embeddings are always at the end of TP-sharded tensors. In other words,
+    we shard base embeddings and LoRA embeddings separately (both padded),
+    and place them in the same tensor.
+    In this example, we will have the original vocab size = 1010,
+    added vocab size = 16 and padding to 64. Therefore, the total
+    vocab size with padding will be 1088 (because we first pad 1010 to
+    1024, add 16, and then pad to 1088).
+    Therefore, the tensor format looks like the following:
+    TP1, rank 0 (no sharding):
+                            |< --------BASE-------- >|< -BASE PADDING-- >|< -----LORA------ >|< -LORA PADDING-- >|
+    corresponding token_id: |  0  |  1  | ... | 1009 |  -1  | ... |  -1  | 1010 | ... | 1015 |  -1  | ... |  -1  |
+                     index: |  0  |  1  | ... | 1009 | 1010 | ... | 1023 | 1024 | ... | 1039 | 1040 | ... | 1087 |

-    bool pass = size_0_cond && size_1_cond && is_contiguous && same_type;
-    if (!pass) {
-        TORCH_CHECK(false, "tensor: name = ", name, ", shape = ", t.sizes(),
-                " is_cont = ", t.is_contiguous(), ", type = ", t.dtype(),
-                " is not as expected: shape = [", size_0, ", ", size_1,
-                "], type = ", type);
-    }
-}
+    TP2, rank 0:
+                            |< --------------------BASE--------------------- >|< -----LORA------ >|< -LORA PADDING- >|
+    corresponding token_id: |  0  |  1  |  2  | ... | 497  | 498 | ...  | 511 | 1000 | ... | 1015 |  -1  | ... |  -1 |
+                     index: |  0  |  1  |  2  | ... | 497  | 498 | ...  | 511 | 512  | ... | 527  |  520 | ... | 543 |
+    TP2, rank 1:
+                            |< -----------BASE----------- >|< -BASE PADDING- >|< -----------LORA PADDING----------- >|
+    corresponding token_id: | 512 | 513 | 514 | ... | 1009 | -1  | ...  | -1  |  -1  | ... |  -1  | -1  | ... |   -1 |
+                     index: |  0  |  1  |  2  | ... | 497  | 498 | ...  | 511 | 512  | ... | 519  | 520 | ... |  543 | 
+    Parameters:
+        org_vocab_start_index //base embeddings start
+        org_vocab_end_index //base embeddings end
+        num_org_vocab_padding //base embeddings padding
+        added_vocab_start_index //LoRA embeddings start
+        added_vocab_end_index //LoRA embeddings end
+    */
+{
+    // Input validation
+    TORCH_CHECK(input.dim() >= 1, "input must have at least 1 dimension");
+    TORCH_CHECK(org_vocab_start_index >= 0, "org_vocab_start_index must be non-negative");
+    TORCH_CHECK(org_vocab_end_index >= org_vocab_start_index, "org_vocab_end_index must be greater than org_vocab_start_index");
+    TORCH_CHECK(num_org_vocab_padding >= 0, "num_org_vocab_padding must be non-negative");
+    TORCH_CHECK(added_vocab_start_index >= org_vocab_end_index, "added_vocab_start_index must be greater than org_vocab_end_index");
+    TORCH_CHECK(added_vocab_end_index >= added_vocab_start_index, "added_vocab_end_index must be greater than added_vocab_start_index");

+    // Get total number of elements
+    int64_t size = input.numel();

-void advance_step_flashattn_ascendc(
-    int64_t num_seqs, int64_t num_queries, int64_t block_size,
-    at::Tensor& input_tokens,
-    at::Tensor& sampled_token_ids,
-    at::Tensor& input_positions,
-    at::Tensor& seq_lens,
-    at::Tensor& slot_mapping,
-    at::Tensor& block_tables
-){
-    // Verify all tensors
-    verify_tensor("input_tokens", input_tokens, num_seqs, -1, at::kLong);
-    verify_tensor("sampled_token_ids", sampled_token_ids, num_queries, 1,at::kLong);
-    verify_tensor("input_positions", input_positions, num_seqs, -1, at::kLong);
-    verify_tensor("seq_lens", seq_lens, num_seqs, -1, at::kInt);
-    verify_tensor("slot_mapping", slot_mapping, num_seqs, -1, at::kInt);
-    verify_tensor("block_tables", block_tables, num_seqs, -1, at::kInt);
-
-
-    int64_t* input_tokens_ptr = input_tokens.data_ptr<int64_t>();
-    int64_t* sampled_token_ids_ptr = sampled_token_ids.data_ptr<int64_t>();
-    int64_t* input_positions_ptr = input_positions.data_ptr<int64_t>();
-    int32_t* seq_lens_ptr = seq_lens.data_ptr<int32_t>();
-    int32_t* slot_mapping_ptr = slot_mapping.data_ptr<int32_t>();
-    int32_t* block_tables_ptr =  block_tables.data_ptr<int32_t>();
-
-
-    int32_t device_id;
-    aclrtGetDevice(&device_id);
-    auto npu_stream = c10_npu::getCurrentNPUStream(device_id);
-    aclrtStream stream = npu_stream.stream();
-
-    // aclrtStream stream = c10_npu::getCurrentNPUStream().stream();
+    // Create output tensors
+    at::Tensor masked_input = at::empty_like(input);
+	at::Tensor mask = at::empty_like(input).to(at::kBool);
+    
+    // Get data pointers
+    void *input_ptr = input.data_ptr();
+    void *masked_input_ptr = masked_input.data_ptr();
+    void *mask_ptr = mask.data_ptr();
+    
+    // Get current stream
+    aclrtStream stream = c10_npu::getCurrentNPUStream().stream();
+    
+    // Get scalar type
+    at::ScalarType scalar_type = input.scalar_type();
+    
+    // Create and configure OpCommand
    at_npu::native::OpCommand cmd;
-    cmd.Name("advance_step_flashattn_ascendc");
-    cmd.SetCustomHandler([stream, num_seqs, num_queries,
-                          block_size, input_tokens_ptr, sampled_token_ids_ptr,
-                          input_positions_ptr, seq_lens_ptr, slot_mapping_ptr,
-                          block_tables_ptr, block_tables]() -> int {
-        launch_advance_step_flashattn(stream,
-                                    num_seqs,
-                                    num_queries,
-                                    block_size,
-                                    input_tokens_ptr,
-                                    sampled_token_ids_ptr,
-                                    input_positions_ptr,
-                                    seq_lens_ptr,
-                                    slot_mapping_ptr,
-                                    block_tables_ptr,
-                                    block_tables.stride(0));
+    cmd.Name("get_masked_input_and_mask");
+    cmd.SetCustomHandler([scalar_type, size, stream, 
+                         input_ptr, masked_input_ptr, mask_ptr,
+                         org_vocab_start_index, org_vocab_end_index,
+                         num_org_vocab_padding, added_vocab_start_index,
+                         added_vocab_end_index]() -> int {
+        // Get platform info
+        fe::PlatFormInfos platform_infos;
+        int device_id = 0;
+        fe::PlatformInfoManager::GeInstance().GetRuntimePlatformInfosByDevice(device_id, platform_infos);
+        uint32_t aivNum = platform_infos.GetCoreNumByType("aiv");
+        uint32_t loop_cnt = (size + aivNum - 1) / aivNum;
+        
+        // Call implementation
+        get_masked_input_and_mask_impl(
+            stream,
+            input_ptr,
+            masked_input_ptr, 
+            mask_ptr,
+            org_vocab_start_index,
+            org_vocab_end_index,
+            num_org_vocab_padding,
+            added_vocab_start_index,
+            added_vocab_end_index,
+            size,
+            loop_cnt,
+            aivNum);
+            
        return 0;
    });
    cmd.Run();
-    return ;
+    return {masked_input, mask};
 }
 } // namespace vllm_ascend

@ -194,11 +219,15 @@ TORCH_LIBRARY_EXPAND(_C, ops)
        "                 Tensor! key, int head_size,"
        "                 Tensor cos_sin_cache, bool is_neox) -> (Tensor query, Tensor key)");
    ops.impl("rotary_embedding", torch::kPrivateUse1, &vllm_ascend::rotary_embedding);
+
    ops.def(
-        "advance_step_flashattn_ascendc(int num_seqs, int num_queries, int block_size,"
-        "                               Tensor! input_tokens, Tensor! sampled_token_ids, Tensor! input_positions,"
-        "                               Tensor! seq_lens, Tensor! slot_mapping, Tensor! block_tables) -> ()");
-    ops.impl("advance_step_flashattn_ascendc", torch::kPrivateUse1, &vllm_ascend::advance_step_flashattn_ascendc);
+        "get_masked_input_and_mask(Tensor input, "
+        "                         int org_vocab_start_index, "
+        "                         int org_vocab_end_index, "
+        "                         int num_org_vocab_padding, "
+        "                         int added_vocab_start_index, "
+        "                         int added_vocab_end_index) -> (Tensor masked_input, Tensor mask)");
+    ops.impl("get_masked_input_and_mask", torch::kPrivateUse1, &vllm_ascend::get_masked_input_and_mask);
 }

 REGISTER_EXTENSION(_C)
--- a/docs/requirements-test.txt
+++ b/docs/requirements-test.txt
@ -1,2 +1,2 @@
 pytest-asyncio
-
+pytest-mock
--- a/docs/source/assets/multi_node_dp.png
+++ b/docs/source/assets/multi_node_dp.png
--- a/docs/source/community/contributors.md
+++ b/docs/source/community/contributors.md
@ -7,6 +7,7 @@
 | Xiyuan Wang| [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
 | Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 |
 | Yi Gan| [@ganyi1996ppo](https://github.com/ganyi1996ppo) | 2025/02 |
+| Shoujian Zheng| [@jianzs](https://github.com/jianzs) | 2025/06 |

 ## Contributors

@ -16,6 +17,23 @@ Updated on 2025-06-10:

 | Number | Contributor | Date | Commit ID |
 |:------:|:-----------:|:-----:|:---------:|
+| 83 | [@ZhengWG](https://github.com/) | 2025/7/7 | [3a469de](https://github.com/vllm-project/vllm-ascend/commit/9c886d0a1f0fc011692090b0395d734c83a469de) |
+| 82 | [@wm901115nwpu](https://github.com/) | 2025/7/7 | [a2a47d4](https://github.com/vllm-project/vllm-ascend/commit/f08c4f15a27f0f27132f4ca7a0c226bf0a2a47d4) |
+| 81 | [@Agonixiaoxiao](https://github.com/) | 2025/7/2 | [6f84576](https://github.com/vllm-project/vllm-ascend/commit/7fc1a984890bd930f670deedcb2dda3a46f84576) |
+| 80 | [@zhanghw0354](https://github.com/zhanghw0354) | 2025/7/2 | [d3df9a5](https://github.com/vllm-project/vllm-ascend/commit/9fb3d558e5b57a3c97ee5e11b9f5dba6ad3df9a5) |
+| 79 | [@GDzhu01](https://github.com/GDzhu01) | 2025/6/28 | [de256ac](https://github.com/vllm-project/vllm-ascend/commit/b308a7a25897b88d4a23a9e3d583f4ec6de256ac) |
+| 78 | [@leo-pony](https://github.com/leo-pony) | 2025/6/26 | [3f2a5f2](https://github.com/vllm-project/vllm-ascend/commit/10253449120307e3b45f99d82218ba53e3f2a5f2) |
+| 77 | [@zeshengzong](https://github.com/zeshengzong) | 2025/6/26 | [3ee25aa](https://github.com/vllm-project/vllm-ascend/commit/192dbbcc6e244a8471d3c00033dc637233ee25aa) |
+| 76 | [@sharonyunyun](https://github.com/sharonyunyun) | 2025/6/25 | [2dd8666](https://github.com/vllm-project/vllm-ascend/commit/941269a6c5bbc79f6c1b6abd4680dc5802dd8666) |
+| 75 | [@Pr0Wh1teGivee](https://github.com/Pr0Wh1teGivee) | 2025/6/25 | [c65dd40](https://github.com/vllm-project/vllm-ascend/commit/2fda60464c287fe456b4a2f27e63996edc65dd40) |
+| 74 | [@xleoken](https://github.com/xleoken) | 2025/6/23 | [c604de0](https://github.com/vllm-project/vllm-ascend/commit/4447e53d7ad5edcda978ca6b0a3a26a73c604de0) |
+| 73 | [@lyj-jjj](https://github.com/lyj-jjj) | 2025/6/23 | [5cbd74e](https://github.com/vllm-project/vllm-ascend/commit/5177bef87a21331dcca11159d3d1438075cbd74e) |
+| 72 | [@farawayboat](https://github.com/farawayboat)| 2025/6/21 | [bc7d392](https://github.com/vllm-project/vllm-ascend/commit/097e7149f75c0806774bc68207f0f6270bc7d392)
+| 71 | [@yuancaoyaoHW](https://github.com/yuancaoyaoHW) | 2025/6/20 | [7aa0b94](https://github.com/vllm-project/vllm-ascend/commit/00ae250f3ced68317bc91c93dc1f1a0977aa0b94)
+| 70 | [@songshanhu07](https://github.com/songshanhu07) | 2025/6/18 | [5e1de1f](https://github.com/vllm-project/vllm-ascend/commit/2a70dbbdb8f55002de3313e17dfd595e1de1f)
+| 69 | [@wangyanhui-cmss](https://github.com/wangyanhui-cmss) | 2025/6/12| [40c9e88](https://github.com/vllm-project/vllm-ascend/commit/2a5fb4014b863cee6abc3009f5bc5340c9e88) |
+| 68 | [@chenwaner](https://github.com/chenwaner) | 2025/6/11 | [c696169](https://github.com/vllm-project/vllm-ascend/commit/e46dc142bf1180453c64226d76854fc1ec696169) |
+| 67 | [@yzim](https://github.com/yzim) | 2025/6/11 | [aaf701b](https://github.com/vllm-project/vllm-ascend/commit/4153a5091b698c2270d160409e7fee73baaf701b) |
 | 66 | [@Yuxiao-Xu](https://github.com/Yuxiao-Xu) | 2025/6/9 | [6b853f1](https://github.com/vllm-project/vllm-ascend/commit/6b853f15fe69ba335d2745ebcf14a164d0bcc505) |
 | 65 | [@ChenTaoyu-SJTU](https://github.com/ChenTaoyu-SJTU) | 2025/6/7 | [20dedba](https://github.com/vllm-project/vllm-ascend/commit/20dedba5d1fc84b7ae8b49f9ce3e3649389e2193) |
 | 64 | [@zxdukki](https://github.com/zxdukki) | 2025/6/7 | [87ebaef](https://github.com/vllm-project/vllm-ascend/commit/87ebaef4e4e519988f27a6aa378f614642202ecf) |
--- a/docs/source/community/user_stories/index.md
+++ b/docs/source/community/user_stories/index.md
@ -0,0 +1,19 @@
+# User Stories
+
+Read case studies on how users and developers solves real, everyday problems with vLLM Ascend
+
+- [LLaMA-Factory](./llamafactory.md) is an easy-to-use and efficient platform for training and fine-tuning large language models, it supports vLLM Ascend to speed up inference since [LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739), gain 2x performance enhancement of inference.
+
+- [Huggingface/trl](https://github.com/huggingface/trl) is a cutting-edge library designed for post-training foundation models using advanced techniques like SFT, PPO and DPO, it uses vLLM Ascend since [v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) to support RLHF on Ascend NPU.
+
+- [MindIE Turbo](https://pypi.org/project/mindie-turbo) is an LLM inference engine acceleration plug-in library developed by Huawei on Ascend hardware, which includes self-developed large language model optimization algorithms and optimizations related to the inference engine framework. It supports vLLM Ascend since [2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev/mindie-turbo-0001.html).
+
+- [GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU cluster manager for running AI models. It supports vLLM Ascend since [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2), see more GPUStack performance evaluation info on [link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew).
+
+- [verl](https://github.com/volcengine/verl) is a flexible, efficient and production-ready RL training library for large language models (LLMs), uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0), see more info on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html).
+
+:::{toctree}
+:caption: More details
+:maxdepth: 1
+llamafactory
+:::
--- a/docs/source/community/user_stories/llamafactory.md
+++ b/docs/source/community/user_stories/llamafactory.md
@ -0,0 +1,19 @@
+# LLaMA-Factory
+
+**About / Introduction**
+
+[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
+
+LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model. 
+
+**The Business Challenge**
+
+LLaMA-Factory used transformers to perform inference on Ascend NPU, but the speed was slow.
+
+**Solving Challenges and Benefits with vLLM Ascend**
+
+With the joint efforts of LLaMA-Factory and vLLM Ascend ([LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)), the performance of LLaMA-Factory in the model inference stage has been significantly improved. According to the test results, the inference speed of LLaMA-Factory has been increased to 2x compared to the transformers version.
+
+**Learn more**
+
+See more about LLaMA-Factory and how it uses vLLM Ascend for inference on the Ascend NPU in the following documentation: [LLaMA-Factory Ascend NPU Inference](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html).
--- a/docs/source/developer_guide/versioning_policy.md
+++ b/docs/source/developer_guide/versioning_policy.md
@ -22,6 +22,8 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:

 | vLLM Ascend | vLLM         | Python           | Stable CANN | PyTorch/torch_npu  | MindIE Turbo |
 |-------------|--------------|------------------|-------------|--------------------|--------------|
+| v0.9.2rc1   | v0.9.2       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1.post1.dev20250619      |              |
+| v0.9.1rc1   | v0.9.1       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1.post1.dev20250528      |              |
 | v0.9.0rc2   | v0.9.0       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1      |              |
 | v0.9.0rc1   | v0.9.0       | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1      |              |
 | v0.8.5rc1   | v0.8.5.post1 | >= 3.9, < 3.12   | 8.1.RC1     | 2.5.1 / 2.5.1      |              |
@ -35,6 +37,8 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:

 | Date       | Event                                     |
 |------------|-------------------------------------------|
+| 2025.07.11 | Release candidates, v0.9.2rc1             |
+| 2025.06.22 | Release candidates, v0.9.1rc1             |
 | 2025.06.10 | Release candidates, v0.9.0rc2             |
 | 2025.06.09 | Release candidates, v0.9.0rc1             |
 | 2025.05.29 | v0.7.x post release, v0.7.3.post1         |
@ -72,8 +76,8 @@ Usually, each minor version of vLLM (such as 0.7) will correspond to a vLLM Asce

 | Branch     | Status       | Note                                 |
 |------------|--------------|--------------------------------------|
-| main       | Maintained   | CI commitment for vLLM main branch and vLLM 0.9.x branch   |
-| v0.9.1-dev | Maintained   | CI commitment for vLLM 0.9.0 and 0.9.1 version |
+| main       | Maintained   | CI commitment for vLLM main branch and vLLM 0.9.2 branch   |
+| v0.9.1-dev | Maintained   | CI commitment for vLLM 0.9.1 version |
 | v0.7.3-dev | Maintained   | CI commitment for vLLM 0.7.3 version |
 | v0.7.1-dev | Unmaintained | Replaced by v0.7.3-dev               |

--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -23,6 +23,7 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
+import json
 import os

 # import sys
@ -64,17 +65,19 @@ myst_substitutions = {
    # the branch of vllm, used in vllm clone
    # - main branch: 'main'
    # - vX.Y.Z branch: 'vX.Y.Z'
-    'vllm_version': 'v0.9.0',
+    'vllm_version': 'v0.9.2',
    # the branch of vllm-ascend, used in vllm-ascend clone and image tag
    # - main branch: 'main'
    # - vX.Y.Z branch: latest vllm-ascend release tag
-    'vllm_ascend_version': 'v0.9.0rc2',
+    'vllm_ascend_version': 'v0.9.2rc1',
    # the newest release version of vllm-ascend and matched vLLM, used in pip install.
    # This value should be updated when cut down release.
-    'pip_vllm_ascend_version': "0.9.0rc2",
-    'pip_vllm_version': "0.9.0",
+    'pip_vllm_ascend_version': "0.9.2rc1",
+    'pip_vllm_version': "0.9.2",
    # CANN image tag
    'cann_image_tag': "8.1.rc1-910b-ubuntu22.04-py3.10",
+    # vllm version in ci
+    'ci_vllm_version': 'v0.9.2',
 }

 # Add any paths that contain templates here, relative to this directory.
@ -133,3 +136,7 @@ if READTHEDOCS_VERSION_TYPE == "tag":

 def setup(app):
    pass
+
+
+if __name__ == "__main__":
+    print(json.dumps(myst_substitutions))
--- a/docs/source/developer_guide/contribution/index.md
+++ b/docs/source/developer_guide/contribution/index.md
@ -4,7 +4,7 @@
 It's recommended to set up a local development environment to build and test
 before you submit a PR.

-### Prepare environment and build
+### Setup development environment

 Theoretically, the vllm-ascend build is only supported on Linux because
 `vllm-ascend` dependency `torch_npu` only supports Linux.
@ -12,73 +12,64 @@ Theoretically, the vllm-ascend build is only supported on Linux because
 But you can still set up dev env on Linux/Windows/macOS for linting and basic
 test as following commands:

+#### Run lint locally
 ```bash
 # Choose a base dir (~/vllm-project/) and set up venv
 cd ~/vllm-project/
 python3 -m venv .venv
 source ./.venv/bin/activate

-# Clone vllm code and install
-git clone https://github.com/vllm-project/vllm.git
+# Clone vllm-ascend and install
+git clone https://github.com/vllm-project/vllm-ascend.git
+cd vllm-ascend
+
+# Install lint requirement and enable pre-commit hook
+pip install -r requirements-lint.txt
+
+# Run lint (You need install pre-commits deps via proxy network at first time)
+bash format.sh
+```
+
+#### Run CI locally
+
+After complete "Run lint" setup, you can run CI locally:
+
+```{code-block} bash
+   :substitutions:
+
+cd ~/vllm-project/
+
+# Run CI need vLLM installed
+git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
 cd vllm
 pip install -r requirements/build.txt
 VLLM_TARGET_DEVICE="empty" pip install .
 cd ..

-# Clone vllm-ascend and install
-git clone https://github.com/vllm-project/vllm-ascend.git
+# Install requirements
 cd vllm-ascend
-# install system requirement
-apt install -y gcc g++ cmake libnuma-dev
-# install project requirement
+# For Linux:
 pip install -r requirements-dev.txt
+# For non Linux:
+cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
+cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done

-# Then you can run lint and mypy test
-bash format.sh
+# Run ci:
+bash format.sh ci
+```

-# Build:
-# - only supported on Linux (torch_npu available)
-# pip install -e .
-# - build without deps for debugging in other OS
-# pip install -e . --no-deps
-# - build without custom ops
-# COMPILE_CUSTOM_KERNELS=0 pip install -e .
+#### Submit the commit

+```bash
 # Commit changed files using `-s`
 git commit -sm "your commit info"
 ```

-### Testing
+🎉 Congratulations! You have completed the development environment setup.

-Although vllm-ascend CI provide integration test on [Ascend](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml), you can run it
-locally. The simplest way to run these integration tests locally is through a container:
+### Test locally

-```bash
-# Under Ascend NPU environment
-git clone https://github.com/vllm-project/vllm-ascend.git
-cd vllm-ascend
-
-export IMAGE=vllm-ascend-dev-image
-export CONTAINER_NAME=vllm-ascend-dev
-export DEVICE=/dev/davinci1
-
-# The first build will take about 10 mins (10MB/s) to download the base image and packages
-docker build -t $IMAGE -f ./Dockerfile .
-# You can also specify the mirror repo via setting VLLM_REPO to speedup
-# docker build -t $IMAGE -f ./Dockerfile . --build-arg VLLM_REPO=https://gitee.com/mirrors/vllm
-
-docker run --rm --name $CONTAINER_NAME --network host --device $DEVICE \
-           --device /dev/davinci_manager --device /dev/devmm_svm \
-           --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi \
-           -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-           -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-           -ti $IMAGE bash
-
-cd vllm-ascend
-pip install -r requirements-dev.txt
-
-pytest tests/
-```
+You can refer to [Testing](./testing.md) doc to help you setup testing environment and running tests locally.

 ## DCO and Signed-off-by

@ -111,3 +102,10 @@ If the PR spans more than one category, please include all relevant prefixes.

 You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
 If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
+
+
+:::{toctree}
+:caption: Index
+:maxdepth: 1
+testing
+:::
--- a/docs/source/developer_guide/contribution/testing.md
+++ b/docs/source/developer_guide/contribution/testing.md
@ -0,0 +1,280 @@
+# Testing
+
+This secition explains how to write e2e tests and unit tests to verify the implementation of your feature.
+
+## Setup test environment
+
+The fastest way to setup test environment is to use the main branch container image:
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:selected:
+:sync: cpu
+
+You can run the unit tests on CPU with the following steps:
+
+```{code-block} bash
+   :substitutions:
+
+cd ~/vllm-project/
+# ls
+# vllm  vllm-ascend
+
+# Use mirror to speedup download
+# docker pull quay.nju.edu.cn/ascend/cann:|cann_image_tag|
+export IMAGE=quay.io/ascend/cann:|cann_image_tag|
+docker run --rm --name vllm-ascend-ut \
+    -v $(pwd):/vllm-project \
+    -v ~/.cache:/root/.cache \
+    -ti $IMAGE bash
+
+# (Optional) Configure mirror to speedup download
+sed -i 's|ports.ubuntu.com|mirrors.huaweicloud.com|g' /etc/apt/sources.list
+pip config set global.index-url https://mirrors.huaweicloud.com/repository/pypi/simple/
+
+# For torch-npu dev version or x86 machine
+export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
+
+apt-get update -y
+apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
+
+# Install vllm
+cd /vllm-project/vllm
+VLLM_TARGET_DEVICE=empty python3 -m pip -v install .
+
+# Install vllm-ascend
+cd /vllm-project/vllm-ascend
+# [IMPORTANT] Import LD_LIBRARY_PATH to enumerate the CANN environment under CPU
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
+python3 -m pip install -r requirements-dev.txt
+python3 -m pip install -v .
+```
+
+::::
+
+::::{tab-item} Single card
+:sync: single
+
+```{code-block} bash
+   :substitutions:
+
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci0
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:main
+docker run --rm \
+    --name vllm-ascend \
+    --device $DEVICE \
+    --device /dev/davinci_manager \
+    --device /dev/devmm_svm \
+    --device /dev/hisi_hdc \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/.cache:/root/.cache \
+    -p 8000:8000 \
+    -it $IMAGE bash
+```
+
+After starting the container, you should install the required packages:
+
+```bash
+# Prepare
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Install required packages
+pip install -r requirements-dev.txt
+```
+
+::::
+
+::::{tab-item} Multi cards
+:sync: multi
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:main
+docker run --rm \
+    --name vllm-ascend \
+    --device /dev/davinci0 \
+    --device /dev/davinci1 \
+    --device /dev/davinci2 \
+    --device /dev/davinci3 \
+    --device /dev/davinci_manager \
+    --device /dev/devmm_svm \
+    --device /dev/hisi_hdc \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/.cache:/root/.cache \
+    -p 8000:8000 \
+    -it $IMAGE bash
+```
+
+After starting the container, you should install the required packages:
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+
+# Prepare
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Install required packages
+pip install -r requirements-dev.txt
+```
+
+::::
+
+:::::
+
+## Running tests
+
+### Unit test
+
+There are several principles to follow when writing unit tests:
+
+- The test file path should be consistent with source file and start with `test_` prefix, such as: `vllm_ascend/worker/worker_v1.py` --> `tests/ut/worker/test_worker_v1.py`
+- The vLLM Ascend test are using unittest framework, see [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
+- All unit tests can be run on CPU, so you must mock the device-related function to host.
+- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
+- You can run the unit tests using `pytest`:
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:selected:
+:sync: cpu
+
+```bash
+# Run unit tests
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
+TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
+```
+
+::::
+
+::::{tab-item} Single card
+:sync: single
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+pytest -sv tests/ut
+
+# Run single test
+pytest -sv tests/ut/test_ascend_config.py
+```
+::::
+
+::::{tab-item} Multi cards test
+:sync: multi
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+pytest -sv tests/ut
+
+# Run single test
+pytest -sv tests/ut/test_ascend_config.py
+```
+::::
+
+:::::
+
+### E2E test
+
+Although vllm-ascend CI provide [e2e test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
+locally.
+
+:::::{tab-set}
+:sync-group: e2e
+
+::::{tab-item} Local (CPU)
+:sync: cpu
+
+You can't run e2e test on CPU.
+::::
+
+::::{tab-item} Single card
+:selected:
+:sync: single
+
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/
+
+# Run a certain test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py
+
+# Run a certain case in test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models
+```
+::::
+
+::::{tab-item} Multi cards test
+:sync: multi
+```bash
+cd /vllm-workspace/vllm-ascend/
+# Run all single card the tests
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/
+
+# Run a certain test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_dynamic_npugraph_batchsize.py
+
+# Run a certain case in test script
+VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models
+```
+::::
+
+:::::
+
+This will reproduce e2e test: [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml).
+
+#### E2E test example:
+
+- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
+- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)
+- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
+- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
+
+    The CI resource is limited, you might need to reduce layer number of the model, below is an example of how to generate a reduced layer model:
+    1. Fork the original model repo in modelscope, we need all the files in the repo except for weights.
+    2. Set `num_hidden_layers` to the expected number of layers, e.g., `{"num_hidden_layers": 2,}`
+    3. Copy the following python script as `generate_random_weight.py`. Set the relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and `DIST_MODEL_PATH` as needed:
+
+        ```python
+        import torch
+        from transformers import AutoTokenizer, AutoConfig
+        from modeling_deepseek import DeepseekV3ForCausalLM
+        from modelscope import snapshot_download
+
+        MODEL_LOCAL_PATH = "~/.cache/modelscope/models/vllm-ascend/DeepSeek-V3-Pruning"
+        DIST_DTYPE = torch.bfloat16
+        DIST_MODEL_PATH = "./random_deepseek_v3_with_2_hidden_layer"
+
+        config = AutoConfig.from_pretrained(MODEL_LOCAL_PATH, trust_remote_code=True)
+        model = DeepseekV3ForCausalLM(config)
+        model = model.to(DIST_DTYPE)
+        model.save_pretrained(DIST_MODEL_PATH)
+        ```
+
+### Run doctest
+
+vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to run all doctests in the doc files.
+The doctest is a good way to make sure the docs are up to date and the examples are executable, you can run it locally as follows:
+
+```bash
+# Run doctest
+/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh
+```
+
+This will reproduce the same environment as the CI: [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml).
--- a/docs/source/developer_guide/evaluation/index.md
+++ b/docs/source/developer_guide/evaluation/index.md
@ -1,17 +1,10 @@
-# Evaluation
+# Accuracy

 :::{toctree}
 :caption: Accuracy
 :maxdepth: 1
+using_evalscope
 using_lm_eval
 using_opencompass
-using_evalscope
 accuracy_report/index
 :::
-
-:::{toctree}
-:caption: Performance
-:maxdepth: 1
-performance_benchmark
-profile_execute_duration
-:::
--- a/docs/source/developer_guide/feature_guide/index.md
+++ b/docs/source/developer_guide/feature_guide/index.md
@ -0,0 +1,9 @@
+# Feature Guide
+
+This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
+
+:::{toctree}
+:caption: Feature Guide
+:maxdepth: 1
+patch
+:::
--- a/docs/source/developer_guide/feature_guide/patch.md
+++ b/docs/source/developer_guide/feature_guide/patch.md
@ -0,0 +1,82 @@
+# Patch in vLLM Ascend
+
+vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM and vLLM Ascend is different, and the hardware limitation in some case, we need to patch some code in vLLM to make it compatible with vLLM Ascend.
+
+In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to address the change for vLLM.
+
+## Principle
+
+We should keep in mind that Patch is not the best way to make vLLM Ascend compatible. It's just a temporary solution. The best way is to contribute the change to vLLM to make it compatible with vLLM Ascend originally. In vLLM Ascend, we have the basic principle for Patch strategy:
+
+1. Less is more. Please do not patch unless it's the only way currently.
+2. Once a patch is added, it's required to describe the future plan for removing the patch.
+3. Anytime, clean the patch code is welcome.
+
+## How it works
+
+In `vllm_ascend/patch`, you can see the code structure as follows:
+
+```
+vllm_ascend
+├── patch
+│   ├── platform
+│   │   ├── patch_0_9_2
+│   │   ├── patch_common
+│   │   ├── patch_main
+│   ├── worker
+│   │   ├── patch_0_9_2
+│   │   ├── patch_common
+│   │   ├── patch_main
+└───────────
+```
+
+- **platform**: The patch code in this directory is for patching the code in vLLM main process. It's called by `vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when vLLM is initialized.
+  - For online mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing the cli args.
+  - For offline mode, vLLM process calls the platform patch here `vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when parsing the input parameters.
+- **worker**: The patch code in this directory is for patching the code in vLLM worker process. It's called by `vllm_ascend/worker/worker_v1::NPUWorker::__init__` when the vLLM worker process is initialized.
+  - For both online and offline mode, vLLM engine core process calls the worker patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when initializing the worker process.
+
+In both **platform** and **worker** folder, there are several patch modules. They are used for patching different version of vLLM.
+
+- `patch_0_9_2`: This module is used for patching vLLM 0.9.2. The version is always the nearest version of vLLM. Once vLLM is released, we will drop this patch module and bump to a new version. For example, `patch_0_9_2` is used for patching vLLM 0.9.2.
+- `patch_main`: This module is used for patching the code in vLLM main branch.
+- `patch_common`: This module is used for patching both vLLM 0.9.2 and vLLM main branch.
+
+## How to write a patch
+
+Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
+
+1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.9.2 and main of vLLM.
+2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
+3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
+4. Write your patch code in the new file. Here is an example:
+    ```python
+    import vllm
+
+    def patch_destroy_model_parallel():
+        # your patch code
+        ...
+
+    vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
+    ```
+5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
+6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
+    ```
+    # ** File: <The patch file name> **
+    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    #   1. `<The target patch module in vLLM>`
+    #    Why:
+    #       <Describe the reason why we need to patch>
+    #    How：
+    #       <Describe the way to patch>
+    #    Related PR (if no, explain why):
+    #       <Add a link to the related PR in vLLM. If there is no related PR, explain why>
+    #    Future Plan:
+    #       <Describe the future plan to remove the patch>
+    ```
+7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
+
+
+## Limitation
+1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
+2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.n, the version of vLLM may be change to v0.9.nxxx, in this case, the patch for v0.9.n in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.9.2 should work.
--- a/docs/source/developer_guide/modeling/adding_a_new_model.md
+++ b/docs/source/developer_guide/modeling/adding_a_new_model.md
@ -0,0 +1,258 @@
+# Adding a New Model
+
+This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to
+[vllm official doc: Adding a New Model](https://docs.vllm.ai/en/stable/contributing/model/) first.
+
+## Step 1: Implementing Models with `torch` and `torch_npu`
+
+This section provides instructions for implementing new models compatible with vllm and vllm-ascend.
+
+**Before starting:**
+
+- Verify whether your model already exists in vllm's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory.
+- Use existing models' implementation as templates to accelerate your development.
+
+### Method 1: Implementing New Models from Scratch
+
+Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance.
+
+**Key implementation requirements:**
+
+1. Place model files in `vllm_ascend/models/` directory.
+
+2. Standard module structure for decoder-only LLMs (please checkout vllm's implementations for other kinds of model):
+
+- `*ModelForCausalLM` (top-level wrapper)
+- `*Model` (main architecture)
+- `*DecoderLayer` (transformer block)
+- `*Attention` and `*MLP` (specific computation unit)
+
+:::{note}
+`*` denotes your model's unique identifier.
+:::
+
+3. Critical Implementation Details:
+
+All modules must include a `prefix` argument in `__init__()`.
+
+**Required interfaces:**
+
+| Module Type          | Required Methods                          |
+| :------------------- | :---------------------------------------- |
+| `*ModelForCausalLM`  | `get_input_embeddings`, `compute_logits`, `load_weights` |
+| `*Model`             | `get_input_embeddings`, `load_weights`    |
+
+4. Attention Backend Integration:
+
+Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py`).
+
+5. Tensor Parallelism:
+
+Use vllm's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
+
+**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py`):
+
+```python
+from collections.abc import Iterable
+from typing import Optional, Union
+
+import torch
+from torch import nn
+from vllm.attention import Attention
+from vllm.config import VllmConfig
+from vllm.sequence import IntermediateTensors
+from vllm.model_executor.sampling_metadata import SamplingMetadata
+
+class CustomAttention(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.attn = Attention(prefix=f"{prefix}.attn")
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # Implement attention logic
+        ...
+
+class CustomDecoderLayer(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn")
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # Implement decoder layer
+        ...
+
+class CustomModel(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str):
+        super().__init__()
+        self.layers = nn.ModuleList([
+            CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") 
+            for i in range(vllm_config.model_config.hf_config.num_hidden_layers)
+        ])
+
+    def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
+        ...
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        ...
+
+    def load_weights(self, 
+                    weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        ...
+
+class CustomModelForCausalLM(nn.Module):
+    def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
+        super().__init__()
+        self.model = CustomModel(vllm_config, prefix=f"{prefix}.model")
+
+    def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
+        ...
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        ...
+
+    def compute_logits(self,
+                      hidden_states: torch.Tensor,
+                      sampling_metadata: SamplingMetadata) -> torch.Tensor:
+        ...
+
+    def load_weights(self, 
+                    weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        ...
+```
+
+### Method 2: Customizing Existing vLLM Models
+
+For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`).
+
+```python
+from typing import List, Optional
+import torch
+from vllm.attention import AttentionMetadata
+from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM
+from vllm.sequence import IntermediateTensors
+
+class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
+    # Define merged weights for quantization/efficiency
+    packed_modules_mapping = {
+        "gate_up_proj": ["gate_proj", "up_proj"],
+        "experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
+    }
+
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        kv_caches: Optional[List[torch.Tensor]] = None,
+        attn_metadata: Optional[AttentionMetadata] = None,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        # Custom forward logic
+        hidden_states = self.model(
+            input_ids, 
+            positions, 
+            kv_caches,
+            attn_metadata, 
+            intermediate_tensors,
+            inputs_embeds
+        )
+        return hidden_states
+```
+
+:::{note}
+For a complete implementation reference, see: `vllm_ascend/models/deepseek_v2.py`.
+:::
+
+## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM
+
+vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.
+
+To integrate your implemented model from `vllm_ascend/models/` directory:
+
+1. Import your model implementation in `vllm_ascend/models/__init__.py` using relative imports.
+2. Register the model wrapper class via `vllm.ModelRegistry.register_model()` function.
+
+**Reference Registration Template** (an example of registering new models in `vllm_ascend/models/__init__.py`):
+
+```python
+from vllm import ModelRegistry
+
+def register_model():
+    from .custom_model import CustomModelForCausalLM        # New custom model
+    from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM  # Customized Deepseek
+
+    # For NEW architectures: Register with unique name
+    ModelRegistry.register_model(
+        "CustomModelForCausalLM",  # Must match config.json's 'architectures'
+        "vllm_ascend.models.custom_model:CustomModelForCausalLM"
+    )
+
+    # For MODIFIED architectures: Use original name
+    ModelRegistry.register_model(
+        "DeepseekV2ForCausalLM",   # Original architecture identifier in vLLM
+        "vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM  "
+    )
+```
+
+:::{note}
+The first argument of `vllm.ModelRegistry.register_model()` indicates the unique architecture identifier which must match `architectures` in `config.json` of the model.
+
+```json
+{
+  "architectures": [
+    "CustomModelForCausalLM"
+  ],
+}
+```
+:::
+
+## Step 3: Verification
+
+### Case 1: Overriding Existing vLLM Model Architecture
+
+If you're registering a customized model architecture based on vllm's existing implementation (overriding vllm's original class), when executing vllm offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`.
+
+```bash
+Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
+```
+
+### Case 2: Registering New Model Architecture
+
+If you're registering a novel model architecture not present in vllm (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`.
+
+```python
+logger.info(f"model_arch: {model_arch} has been registered here!")
+```
+
+After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).
+
+```bash
+model_arch: CustomModelForCausalLM has been registered here!
+```
+
+This log output confirms your novel model architecture has been successfully registered in vllm.
+
+## Step 4: Testing
+
+After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.
+
+Find more details at:
+
+- [Accuracy test guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/evaluation/index.html)
+- [Performance benchmark guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/performance/performance_benchmark.html)
+
+## Step 5: Updating Supported Models Doc
+
+At last, if all the steps above are completed, you should add the new model into our [Supported Models](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html) doc.
--- a/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md
+++ b/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md
@ -0,0 +1,3 @@
+# Adding a New Multi-Modal Model
+
+**_Comming soon ..._**
--- a/docs/source/developer_guide/modeling/index.md
+++ b/docs/source/developer_guide/modeling/index.md
@ -0,0 +1,10 @@
+# Modeling
+
+This section provides tutorials of how to implement and register a new model into vllm-ascend.
+
+:::{toctree}
+:caption: Modeling
+:maxdepth: 1
+adding_a_new_model
+adding_a_new_multimodal_model
+:::
--- a/docs/source/developer_guide/performance/index.md
+++ b/docs/source/developer_guide/performance/index.md
@ -0,0 +1,8 @@
+# Performance
+
+:::{toctree}
+:caption: Performance
+:maxdepth: 1
+performance_benchmark
+profile_execute_duration
+:::
--- a/docs/source/developer_guide/performance/performance_benchmark.md
+++ b/docs/source/developer_guide/performance/performance_benchmark.md
--- a/docs/source/developer_guide/performance/profile_execute_duration.md
+++ b/docs/source/developer_guide/performance/profile_execute_duration.md
@ -9,6 +9,11 @@ The execution duration of each stage (including pre/post-processing, model forwa
 * Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
 * Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.

+**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
+```
+VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
+```
+
 ## Example Output

 ```
--- a/docs/source/faqs.md
+++ b/docs/source/faqs.md
@ -3,19 +3,19 @@
 ## Version Specific FAQs

 - [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
- [[v0.9.0rc2] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1115)
+- [[v0.9.2rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1742)

 ## General FAQs

 ### 1. What devices are currently supported?

-Currently, **ONLY Atlas A2 series**  (Ascend-cann-kernels-910b) are supported:
+Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b) and Atlas 300I(Ascend-cann-kernels-310p) series are supported:

 - Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
 - Atlas 800I A2 Inference series (Atlas 800I A2)
+- Atlas 300I Inference series (Atlas 300I Duo)

 Below series are NOT supported yet:
- Atlas 300I Duo、Atlas 300I Pro (Ascend-cann-kernels-310p) might be supported on 2025.Q2
 - Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
 - Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet

@ -35,7 +35,7 @@ docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG

 ### 3. What models does vllm-ascend supports?

-Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html).
+Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_models.html).

 ### 4. How to get in touch with our community?

@ -48,7 +48,7 @@ There are many channels that you can communicate with our community developers /

 ### 5. What features does vllm-ascend V1 supports?

-Find more details [<u>here</u>](https://github.com/vllm-project/vllm-ascend/issues/414).
+Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).

 ### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?

@ -69,7 +69,7 @@ If all above steps are not working, feel free to submit a GitHub issue.

 ### 7. How does vllm-ascend perform?

-Currently, only some models are improved. Such as `Qwen2 VL`, `Deepseek  V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
+Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek  V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.

 ### 8. How vllm-ascend work with vllm?
 vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
@ -84,9 +84,9 @@ Currently, w8a8 quantization is already supported by vllm-ascend originally on v

 ### 11. How to run w8a8 DeepSeek model?

-Please following the [quantization inferencing tutorail](https://vllm-ascend.readthedocs.io/en/main/tutorials/multi_npu_quantization.html) and replace model to DeepSeek.
+Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.

-### 12. There is not output in log when loading models using vllm-ascend, How to solve it?
+### 12. There is no output in log when loading models using vllm-ascend, How to solve it?

 If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.

@ -94,9 +94,9 @@ If you're using vllm 0.7.3 version, this is a known progress bar display issue i

 vllm-ascend is tested by functional test, performance test and accuracy test.

- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests，on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/suppoted_features.html) via e2e test
+- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests，on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) via e2e test

- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website like [vllm](https://simon-mo-workspace.observablehq.cloud/vllm-dashboard-v0/perf) does to show the performance test results for each pull request
+- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website to show the performance test results for each pull request

 - **Accuracy test**: we're working on adding accuracy test to CI as well.

@ -114,7 +114,7 @@ In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynam

 - **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).

-### 15. Failed to enable NPU graph mode when running DeepSeek?
+### 16. Failed to enable NPU graph mode when running DeepSeek?
 You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.

 And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
@ -123,3 +123,47 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
 [rank0]: RuntimeError: EZ9999: Inner Error!
 [rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
 ```
+
+### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
+You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
+
+### 18. How to generate determinitic results when using vllm-ascend?
+There are several factors that affect output certainty:
+
+1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0)
+# Create an LLM.
+llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
+
+# Generate texts from the prompts.
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+2. Set the following enveriments parameters:
+
+```bash
+export LCCL_DETERMINISTIC = 1
+export HCCL_DETERMINISTIC = 1
+export ATB_MATMUL_SHUFFLE_K_ENABLE = 0
+export ATB_LLM_LCOC_ENABLE = 0
+```
+
+### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model？
+The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
+this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -43,11 +43,9 @@ faqs
 :::{toctree}
 :caption: User Guide
 :maxdepth: 1
-user_guide/suppoted_features
-user_guide/supported_models
-user_guide/env_vars
-user_guide/additional_config
-user_guide/graph_mode.md
+user_guide/support_matrix/index
+user_guide/configuration/index
+user_guide/feature_guide/index
 user_guide/release_notes
 :::

@ -55,9 +53,11 @@ user_guide/release_notes
 :::{toctree}
 :caption: Developer Guide
 :maxdepth: 1
-developer_guide/contributing
-developer_guide/versioning_policy
+developer_guide/contribution/index
+developer_guide/feature_guide/index
 developer_guide/evaluation/index
+developer_guide/performance/index
+developer_guide/modeling/index
 :::

 % How to involve vLLM Ascend
@ -66,11 +66,6 @@ developer_guide/evaluation/index
 :maxdepth: 1
 community/governance
 community/contributors
-:::
-
-% User stories about vLLM Ascend project
-:::{toctree}
-:caption: User Story
-:maxdepth: 1
-user_stories/index
+community/versioning_policy
+community/user_stories/index
 :::
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@ -9,11 +9,11 @@ This document describes how to install vllm-ascend manually.
 - A hardware with Ascend NPU. It's usually the Atlas 800 A2 series.
 - Software:

-    | Software  | Supported version | Note                                   |
-    |-----------|-------------------|----------------------------------------| 
-    | CANN      | >= 8.1.RC1        | Required for vllm-ascend and torch-npu |
-    | torch-npu | >= 2.5.1          | Required for vllm-ascend               |
-    | torch     | >= 2.5.1          | Required for torch-npu and vllm        |
+    | Software      | Supported version                | Note                                      |
+    |---------------|----------------------------------|-------------------------------------------|
+    | CANN          | >= 8.1.RC1                       | Required for vllm-ascend and torch-npu    |
+    | torch-npu     | >= 2.5.1.post1.dev20250619       | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
+    | torch         | >= 2.5.1                         | Required for torch-npu and vllm           |

 You have 2 way to install:
 - **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
@ -78,17 +78,17 @@ source vllm-ascend-env/bin/activate
 pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs 'numpy<2.0.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions

 # Download and install the CANN package.
-wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
+wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
 chmod +x ./Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run
 ./Ascend-cann-toolkit_8.1.RC1_linux-"$(uname -i)".run --full

 source /usr/local/Ascend/ascend-toolkit/set_env.sh

-wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
+wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
 chmod +x ./Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run
 ./Ascend-cann-kernels-910b_8.1.RC1_linux-"$(uname -i)".run --install

-wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
+wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.1.RC1/Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
 chmod +x ./Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run
 ./Ascend-cann-nnal_8.1.RC1_linux-"$(uname -i)".run --install

@ -116,17 +116,23 @@ Once it's done, you can start to set up `vllm` and `vllm-ascend`.
 :selected:
 :sync: pip

-First install system dependencies:
+First install system dependencies and config pip mirror:

 ```bash
-apt update  -y
-apt install -y gcc g++ cmake libnuma-dev wget git
+# Using apt-get with mirror
+sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
+apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
+# Or using yum
+# yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
+# Config pip mirror
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
 ```

-**[Optional]** Config the extra-index of `pip` if you are working on a **x86** machine, so that the torch with cpu could be found:
+**[Optional]** Then config the extra-index of `pip` if you are working on a x86 machine or using torch-npu dev version:

 ```bash
-pip config set global.extra-index-url https://download.pytorch.org/whl/cpu/
+# For torch-npu dev version or x86 machine
+pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
 ```

 Then you can install `vllm` and `vllm-ascend` from **pre-built wheel**:
@ -245,7 +251,8 @@ for output in outputs:
 Then run:

 ```bash
-# export VLLM_USE_MODELSCOPE=true to speed up download if huggingface is not reachable.
+# Try `export VLLM_USE_MODELSCOPE=true` and `pip install modelscope`
+# to speed up download if huggingface is not reachable.
 python example.py
 ```

--- a/docs/source/quick_start.md
+++ b/docs/source/quick_start.md
@ -32,6 +32,8 @@ docker run --rm \
 -v /root/.cache:/root/.cache \
 -p 8000:8000 \
 -it $IMAGE bash
+# Install curl
+apt-get update -y && apt-get install -y curl
 ```
 ::::

@ -58,6 +60,8 @@ docker run --rm \
 -v /root/.cache:/root/.cache \
 -p 8000:8000 \
 -it $IMAGE bash
+# Install curl
+yum update -y && yum install -y curl
 ```
 ::::
 :::::
--- a/docs/source/tutorials/index.md
+++ b/docs/source/tutorials/index.md
@ -5,7 +5,12 @@
 :maxdepth: 1
 single_npu
 single_npu_multimodal
+single_npu_audio
+single_npu_qwen3_embedding
 multi_npu
+multi_npu_moge
+multi_npu_qwen3_moe
 multi_npu_quantization
+single_node_300i
 multi_node
 :::
--- a/docs/source/tutorials/multi_node.md
+++ b/docs/source/tutorials/multi_node.md
@ -1,11 +1,19 @@
-# Multi-Node (DeepSeek)
+# Multi-Node-DP (DeepSeek)

-Multi-node inference is suitable for scenarios where the model cannot be deployed on a single NPU. In such cases, the model can be distributed using tensor parallelism and pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
+## Getting Start
+vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.

-* **Verify Multi-Node Communication Environment** 
-* **Set Up and Start the Ray Cluster**
-* **Start the Online Inference Service on multinode**
+Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.

+For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
+    - Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
+    - Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
+
+This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
+
+In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
+
+For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).

 ## Verify Multi-Node Communication Environment

@ -45,24 +53,20 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
 hccn_tool -i 0 -ping -g address 10.20.0.20
 ```

-## Set Up and Start the Ray Cluster
-### Setting Up the Basic Container
-To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
+## Run with docker
+Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node.

-For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
-
-Below is the example container setup command, which should be executed on **all nodes** :
-
-
-
-```shell
-# Define the image and container name
-export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
 export NAME=vllm-ascend

 # Run the container using the defined variables
+# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
 docker run --rm \
 --name $NAME \
+--net=host \
 --device /dev/davinci0 \
 --device /dev/davinci1 \
 --device /dev/davinci2 \
@ -75,121 +79,120 @@ docker run --rm \
 --device /dev/devmm_svm \
 --device /dev/hisi_hdc \
 -v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 -v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
+-v /mnt/sfs_turbo/.cache:/root/.cache \
 -it $IMAGE bash
 ```

-### Start Ray Cluster
-After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
-
-Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
-
-Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues. The `--num-gpus` parameter defines the number of NPUs to be used on each node.
-
-Below are the commands for the head and worker nodes:
-
-**Head node**:
-
 :::{note}
-When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. 
-Updating the environment variables requires restarting the Ray cluster.
+Before launch the inference server, ensure some environment variables are set for multi node communication
 :::

+Run the following scripts on two nodes respectively
+
+**node0**
 ```shell
-# Head node
-export HCCL_IF_IP={local_ip}
-export GLOO_SOCKET_IFNAME={nic_name}
-export TP_SOCKET_IFNAME={nic_name}
-export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
-export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-ray start --head --num-gpus=8
-```
-**Worker node**:
+#!/bin/sh

-:::{note}
-When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
-:::
+# this obtained through ifconfig
+# nic_name is the network interface name corresponding to local_ip
+nic_name="xxxx"
+local_ip="xxxx"

-```shell
-# Worker node
-export HCCL_IF_IP={local_ip}
-export GLOO_SOCKET_IFNAME={nic_name}
-export TP_SOCKET_IFNAME={nic_name}
-export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 
-export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
-```
-:::{tip}
-Before starting the Ray cluster, set the `export ASCEND_PROCESS_LOG_PATH={plog_save_path}` environment variable on each node to redirect the Ascend plog, which helps in debugging issues during multi-node execution.
-:::
+export HCCL_IF_IP=$local_ip
+export GLOO_SOCKET_IFNAME=$nic_name
+export TP_SOCKET_IFNAME=$nic_name
+export HCCL_SOCKET_IFNAME=$nic_name
+export OMP_PROC_BIND=false
+export OMP_NUM_THREADS=100
+export HCCL_BUFFSIZE=1024

-
-Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
-
-
-## Start the Online Inference Service on multinode
-In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. You only need to run the vllm command on one node. 
-
-To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
-
-For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
-
-```shell
-python -m vllm.entrypoints.openai.api_server \
-       --model="Deepseek/DeepSeek-V2-Lite-Chat" \
-       --trust-remote-code \
-       --enforce-eager \
-       --distributed_executor_backend "ray" \
-       --tensor-parallel-size 8 \
-       --pipeline-parallel-size 2 \
-       --disable-frontend-multiprocessing \
-       --port {port_num}
-```
-:::{note}
-Pipeline parallelism currently requires AsyncLLMEngine, hence the `--disable-frontend-multiprocessing`  is set.
-:::
-
-Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
-```shell
-python -m vllm.entrypoints.openai.api_server \
-       --model="Deepseek/DeepSeek-V2-Lite-Chat" \
-       --trust-remote-code \
-       --distributed_executor_backend "ray" \
-       --enforce-eager \
-       --tensor-parallel-size 16 \
-       --port {port_num}
+# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
+# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
+vllm serve /root/.cache/ds_v3 \
+--host 0.0.0.0 \
+--port 8004 \
+--data-parallel-size 4 \
+--data-parallel-size-local 2 \
+--data-parallel-address $local_ip \
+--data-parallel-rpc-port 13389 \
+--tensor-parallel-size 4 \
+--seed 1024 \
+--served-model-name deepseek_v3 \
+--enable-expert-parallel \
+--max-num-seqs 16 \
+--max-model-len 32768 \
+--quantization ascend \
+--max-num-batched-tokens 4096 \
+--trust-remote-code \
+--no-enable-prefix-caching \
+--gpu-memory-utilization 0.9 \
+--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
 ```

-:::{note}
-If you're running DeepSeek V3/R1, please remove `quantization_config` section in `config.json` file since it's not supported by vllm-ascend currently.
-:::
+**node1**
+```shell
+#!/bin/sh
+
+nic_name="xxx"
+local_ip="xxx"
+
+export HCCL_IF_IP=$local_ip
+export GLOO_SOCKET_IFNAME=$nic_name
+export TP_SOCKET_IFNAME=$nic_name
+export HCCL_SOCKET_IFNAME=$nic_name
+export OMP_PROC_BIND=false
+export OMP_NUM_THREADS=100
+export VLLM_USE_V1=1
+export HCCL_BUFFSIZE=1024
+
+vllm serve /root/.cache/ds_v3 \
+--host 0.0.0.0 \
+--port 8004 \
+--headless \
+--data-parallel-size 4 \
+--data-parallel-size-local 2 \
+--data-parallel-start-rank 2 \
+--data-parallel-address { node0 ip } \
+--data-parallel-rpc-port 13389 \
+--tensor-parallel-size 4 \
+--seed 1024 \
+--quantization ascend \
+--served-model-name deepseek_v3 \
+--max-num-seqs 16 \
+--max-model-len 32768 \
+--max-num-batched-tokens 4096 \
+--enable-expert-parallel \
+--trust-remote-code \
+--no-enable-prefix-caching \
+--gpu-memory-utilization 0.92 \
+--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
+```
+
+The Deployment view looks like: 
+![alt text](../assets/multi_node_dp.png)

 Once your server is started, you can query the model with input prompts:

 ```shell
-curl -X POST http://127.0.0.1:{prot_num}/v1/completions  \
-     -H "Content-Type: application/json" \
-     -d '{
-         "model": "Deepseek/DeepSeek-V2-Lite-Chat",
-         "prompt": "The future of AI is",
-         "max_tokens": 24
-     }'
+curl http://{ node0 ip:8004 }/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "/root/.cache/ds_v3",
+        "prompt": "The future of AI is",
+        "max_tokens": 50,
+        "temperature": 0
+    }'
 ```

-If you query the server successfully, you can see the info shown below (client):
-
-```
-{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
-```
-
-Logs of the vllm server:
-
-```
-INFO:     127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
-INFO 02-19 17:37:35 metrics.py:453 Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, NPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
+## Run benchmarks
+For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
+```shell
+vllm bench serve --model /root/.cache/ds_v3  --served-model-name deepseek_v3 \
+--dataset-name random --random-input-len 128 --random-output-len 128 \
+--num-prompts 200  --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
 ```
--- a/docs/source/tutorials/multi_npu_moge.md
+++ b/docs/source/tutorials/multi_npu_moge.md
@ -0,0 +1,235 @@
+# Multi-NPU (Pangu Pro MoE)
+
+## Run vllm-ascend on Multi-NPU
+
+Run container:
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-it $IMAGE bash
+```
+
+Setup environment variables:
+
+```bash
+# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
+export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+```
+
+Download the model:
+
+```bash
+git lfs install
+git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
+```
+
+### Online Inference on Multi-NPU
+
+Run the following script to start the vLLM server on Multi-NPU:
+
+```bash
+vllm serve /path/to/pangu-pro-moe-model \
+--tensor-parallel-size 4 \
+--trust-remote-code \
+--enforce-eager
+```
+
+Once your server is started, you can query the model with input prompts:
+
+:::::{tab-set}
+::::{tab-item} v1/completions
+
+```{code-block} bash
+   :substitutions:
+export question="你是谁？"
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "[unused9]系统：[unused10][unused9]用户：'${question}'[unused10][unused9]助手：",
+    "max_tokens": 64,
+    "top_p": 0.95,
+    "top_k": 50,
+    "temperature": 0.6
+  }'
+```
+::::
+
+::::{tab-item} v1/chat/completions
+
+```{code-block} bash
+   :substitutions:
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "messages": [
+      {"role": "system", "content": ""},
+      {"role": "user", "content": "你是谁？"}
+    ],
+        "max_tokens": "64",
+        "top_p": "0.95",
+        "top_k": "50",
+        "temperature": "0.6",
+        "add_special_tokens" : true
+    }'
+```
+::::
+:::::
+
+If you run this successfully, you can see the info shown below:
+
+```json
+{"id":"cmpl-2cd4223228ab4be9a91f65b882e65b32","object":"text_completion","created":1751255067,"model":"/root/.cache/pangu-pro-moe-model","choices":[{"index":0,"text":" [unused16] 好的，用户问我是谁，我需要根据之前的设定来回答。用户提到我是华为开发的“盘古Reasoner”，属于盘古大模型系列，作为智能助手帮助解答问题和提供 信息支持。现在用户再次询问，可能是在确认我的身份或者测试我的回答是否一致。\n\n首先，我要确保","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":15,"total_tokens":79,"completion_tokens":64,"prompt_tokens_details":null},"kv_transfer_params":null}
+```
+
+### Offline Inference on Multi-NPU
+
+Run the following script to execute offline inference on multi-NPU:
+
+:::::{tab-set}
+::::{tab-item} Graph Mode
+
+```{code-block} python
+   :substitutions:
+import gc
+from transformers import AutoTokenizer
+import torch
+import os
+
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (destroy_distributed_environment,
+                                             destroy_model_parallel)
+
+os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+def clean_up():
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    gc.collect()
+    torch.npu.empty_cache()
+
+
+if __name__ == "__main__":
+
+    tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
+    tests = [
+        "Hello, my name is",
+        "The future of AI is",
+    ]
+    prompts = []
+    for text in tests:
+        messages = [
+        {"role": "system", "content": ""},    # Optionally customize system content
+        {"role": "user", "content": text}
+    ]
+        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        prompts.append(prompt)
+
+    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+
+    llm = LLM(model="/path/to/pangu-pro-moe-model",
+            tensor_parallel_size=4,
+            distributed_executor_backend="mp",
+            max_model_len=1024,
+            trust_remote_code=True,
+            additional_config={
+            'torchair_graph_config': {
+            'enabled': True,
+            },
+            'ascend_scheduler_config':{
+            'enabled': True,
+            'enable_chunked_prefill' : False,
+            'chunked_prefill_enabled': False
+            },
+            })
+
+    outputs = llm.generate(prompts, sampling_params)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+    del llm
+    clean_up()
+```
+::::
+
+::::{tab-item} Eager Mode
+```{code-block} python
+   :substitutions:
+import gc
+from transformers import AutoTokenizer
+import torch
+import os
+
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (destroy_distributed_environment,
+                                             destroy_model_parallel)
+
+os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+def clean_up():
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    gc.collect()
+    torch.npu.empty_cache()
+
+
+if __name__ == "__main__":
+
+    tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
+    tests = [
+        "Hello, my name is",
+        "The future of AI is",
+    ]
+    prompts = []
+    for text in tests:
+        messages = [
+        {"role": "system", "content": ""},    # Optionally customize system content
+        {"role": "user", "content": text}
+    ]
+        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        prompts.append(prompt)
+
+    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+
+    llm = LLM(model="/path/to/pangu-pro-moe-model",
+            tensor_parallel_size=4,
+            distributed_executor_backend="mp",
+            max_model_len=1024,
+            trust_remote_code=True,
+            enforce_eager=True)
+
+    outputs = llm.generate(prompts, sampling_params)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+    del llm
+    clean_up()
+```
+::::
+:::::
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
+Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
+```
--- a/docs/source/tutorials/multi_npu_quantization.md
+++ b/docs/source/tutorials/multi_npu_quantization.md
@ -1,6 +1,6 @@
 # Multi-NPU (QwQ 32B W8A8)

-## Run docker container:
+## Run docker container
 :::{note}
 w8a8 quantization feature is supported by v0.8.4rc2 or higher
 :::
--- a/docs/source/tutorials/multi_npu_qwen3_moe.md
+++ b/docs/source/tutorials/multi_npu_qwen3_moe.md
@ -0,0 +1,109 @@
+# Multi-NPU (Qwen3-30B-A3B)
+
+## Run vllm-ascend on Multi-NPU with Qwen3 MoE
+
+Run docker container:
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-it $IMAGE bash
+```
+
+Setup environment variables:
+
+```bash
+# Load model from ModelScope to speed up download
+export VLLM_USE_MODELSCOPE=True
+
+# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
+export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+```
+
+### Online Inference on Multi-NPU
+
+Run the following script to start the vLLM server on Multi-NPU:
+
+For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
+
+```bash
+vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
+```
+
+Once your server is started, you can query the model with input prompts
+
+```bash
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "Qwen/Qwen3-30B-A3B",
+  "messages": [
+    {"role": "user", "content": "Give me a short introduction to large language models."}
+  ],
+  "temperature": 0.6,
+  "top_p": 0.95,
+  "top_k": 20,
+  "max_tokens": 4096
+}'
+```
+
+### Offline Inference on Multi-NPU
+
+Run the following script to execute offline inference on multi-NPU:
+
+```python
+import gc
+import torch
+
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (destroy_distributed_environment,
+                                             destroy_model_parallel)
+
+def clean_up():
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    gc.collect()
+    torch.npu.empty_cache()
+
+prompts = [
+    "Hello, my name is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+llm = LLM(model="Qwen/Qwen3-30B-A3B",
+          tensor_parallel_size=4,
+          distributed_executor_backend="mp",
+          max_model_len=4096,
+          enable_expert_parallel=True)
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+del llm
+clean_up()
+```
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+Prompt: 'Hello, my name is', Generated text: " Lucy. I'm from the UK and I'm 11 years old."
+Prompt: 'The future of AI is', Generated text: ' a topic that has captured the imagination of scientists, philosophers, and the general public'
+```
--- a/docs/source/tutorials/single_node_300i.md
+++ b/docs/source/tutorials/single_node_300i.md
@ -0,0 +1,330 @@
+# Single Node (Atlas 300I series)
+
+```{note}
+This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
+```
+
+## Run vLLM on Altlas 300I series
+
+Run docker container:
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-310p
+docker run --rm \
+--name vllm-ascend \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci4 \
+--device /dev/davinci5 \
+--device /dev/davinci6 \
+--device /dev/davinci7 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-it $IMAGE bash
+```
+
+Setup environment variables:
+
+```bash
+# Load model from ModelScope to speed up download
+export VLLM_USE_MODELSCOPE=True
+
+# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
+export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+```
+
+### Online Inference on NPU
+
+Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
+
+:::::{tab-set}
+:sync-group: inference
+
+::::{tab-item} Qwen3-0.6B
+:selected:
+:sync: qwen0.6
+
+Run the following command to start the vLLM server:
+
+```{code-block} bash
+   :substitutions:
+vllm serve Qwen/Qwen3-0.6B \
+    --tensor-parallel-size 1 \
+    --enforce-eager \
+    --dtype float16 \
+    --compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
+```
+
+Once your server is started, you can query the model with input prompts
+
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "The future of AI is",
+    "max_tokens": 64,
+    "top_p": 0.95,
+    "top_k": 50,
+    "temperature": 0.6
+  }'
+```
+::::
+
+::::{tab-item} Qwen/Qwen2.5-7B-Instruct
+:sync: qwen7b
+
+Run the following command to start the vLLM server:
+
+```{code-block} bash
+   :substitutions:
+vllm serve Qwen/Qwen2.5-7B-Instruct \
+    --tensor-parallel-size 2 \
+    --enforce-eager \
+    --dtype float16 \
+    --compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
+```
+
+Once your server is started, you can query the model with input prompts
+
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "The future of AI is",
+    "max_tokens": 64,
+    "top_p": 0.95,
+    "top_k": 50,
+    "temperature": 0.6
+  }'
+```
+
+::::
+
+::::{tab-item} Pangu-Pro-MoE-72B
+:sync: pangu
+
+Download the model:
+
+```bash
+git lfs install
+git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
+```
+
+Run the following command to start the vLLM server:
+
+```{code-block} bash
+   :substitutions:
+
+vllm serve /home/pangu-pro-moe-mode/ \
+--tensor-parallel-size 4 \
+--enable-expert-parallel \
+--dtype "float16" \
+--trust-remote-code \
+--enforce-eager
+
+```
+
+Once your server is started, you can query the model with input prompts
+
+```bash
+export question="你是谁？"
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "[unused9]系统：[unused10][unused9]用户：'${question}'[unused10][unused9]助手：",
+    "max_tokens": 64,
+    "top_p": 0.95,
+    "top_k": 50,
+    "temperature": 0.6
+  }'
+```
+
+::::
+:::::
+
+If you run this script successfully, you can see the results.
+
+### Offline Inference
+
+Run the following script (`example.py`) to execute offline inference on NPU:
+
+:::::{tab-set}
+:sync-group: inference
+
+::::{tab-item} Qwen3-0.6B
+:selected:
+:sync: qwen0.6
+
+```{code-block} python
+   :substitutions:
+from vllm import LLM, SamplingParams
+import gc
+import torch
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (destroy_distributed_environment,
+                                             destroy_model_parallel)
+
+def clean_up():
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    gc.collect()
+    torch.npu.empty_cache()
+prompts = [
+    "Hello, my name is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
+# Create an LLM.
+llm = LLM(
+    model="Qwen/Qwen3-0.6B",
+    tensor_parallel_size=1,
+    enforce_eager=True, # For 300I series, only eager mode is supported.
+    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
+    compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
+)
+# Generate texts from the prompts.
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+del llm
+clean_up()
+```
+
+::::
+
+::::{tab-item} Qwen2.5-7B-Instruct
+:sync: qwen7b
+
+```{code-block} python
+   :substitutions:
+from vllm import LLM, SamplingParams
+import gc
+import torch
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (destroy_distributed_environment,
+                                             destroy_model_parallel)
+
+def clean_up():
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    gc.collect()
+    torch.npu.empty_cache()
+prompts = [
+    "Hello, my name is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
+# Create an LLM.
+llm = LLM(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    tensor_parallel_size=2,
+    enforce_eager=True, # For 300I series, only eager mode is supported.
+    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
+    compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
+)
+# Generate texts from the prompts.
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+del llm
+clean_up()
+```
+
+::::
+
+::::{tab-item} Pangu-Pro-MoE-72B
+:sync: pangu
+
+Download the model:
+
+```bash
+git lfs install
+git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
+```
+
+```{code-block} python
+   :substitutions:
+
+import gc
+from transformers import AutoTokenizer
+import torch
+
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (destroy_distributed_environment,
+                                             destroy_model_parallel)
+
+def clean_up():
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    gc.collect()
+    torch.npu.empty_cache()
+
+
+if __name__ == "__main__":
+
+    tokenizer = AutoTokenizer.from_pretrained("/home/pangu-pro-moe-mode/", trust_remote_code=True)
+    tests = [
+        "Hello, my name is",
+        "The future of AI is",
+    ]
+    prompts = []
+    for text in tests:
+        messages = [
+        {"role": "system", "content": ""},    # Optionally customize system content
+        {"role": "user", "content": text}
+    ]
+        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)        # 推荐使用官方的template
+        prompts.append(prompt)
+    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+
+    llm = LLM(model="/home/pangu-pro-moe-mode/",
+            tensor_parallel_size=8,
+            distributed_executor_backend="mp",
+            enable_expert_parallel=True,
+            dtype="float16",
+            max_model_len=1024,
+            trust_remote_code=True,
+            enforce_eager=True)
+
+    outputs = llm.generate(prompts, sampling_params)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+    del llm
+    clean_up()
+```
+
+::::
+:::::
+
+Run script:
+```bash
+python example.py
+```
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+Prompt: 'Hello, my name is', Generated text: " Lina. I'm a 22-year-old student from China. I'm interested in studying in the US. I'm looking for a job in the US. I want to know if there are any opportunities in the US for me to work. I'm also interested in the culture and lifestyle in the US. I want to know if there are any opportunities for me to work in the US. I'm also interested in the culture and lifestyle in the US. I'm interested in the culture"
+Prompt: 'The future of AI is', Generated text: " not just about the technology itself, but about how we use it to solve real-world problems. As AI continues to evolve, it's important to consider the ethical implications of its use. AI has the potential to bring about significant changes in society, but it also has the power to create new challenges. Therefore, it's crucial to develop a comprehensive approach to AI that takes into account both the benefits and the risks associated with its use. This includes addressing issues such as bias, privacy, and accountability."
+```
--- a/docs/source/tutorials/single_npu.md
+++ b/docs/source/tutorials/single_npu.md
@ -42,7 +42,12 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

 Run the following script to execute offline inference on a single NPU:

-```python
+:::::{tab-set}
+::::{tab-item} Graph Mode
+
+```{code-block} python
+   :substitutions:
+import os
 from vllm import LLM, SamplingParams

 prompts = [
@ -50,7 +55,10 @@ prompts = [
    "The future of AI is",
 ]
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="Qwen/Qwen3-8B", max_model_len=26240)
+llm = LLM(
+        model="Qwen/Qwen3-8B",
+        max_model_len=26240
+)

 outputs = llm.generate(prompts, sampling_params)
 for output in outputs:
@ -58,6 +66,34 @@ for output in outputs:
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```
+::::
+
+::::{tab-item} Eager Mode
+
+```{code-block} python
+   :substitutions:
+import os
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM(
+        model="Qwen/Qwen3-8B",
+        max_model_len=26240,
+        enforce_eager=True
+)
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+::::
+:::::

 If you run this script successfully, you can see the info shown below:

@ -70,9 +106,11 @@ Prompt: 'The future of AI is', Generated text: ' following you. As the technolog

 Run docker container to start the vLLM server on a single NPU:

+:::::{tab-set}
+::::{tab-item} Graph Mode
+
 ```{code-block} bash
   :substitutions:
-
 # Update the vllm-ascend image
 export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
 docker run --rm \
@ -93,6 +131,33 @@ docker run --rm \
 -it $IMAGE \
 vllm serve Qwen/Qwen3-8B --max_model_len 26240
 ```
+::::
+
+::::{tab-item} Eager Mode
+
+```{code-block} bash
+   :substitutions:
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device /dev/davinci0 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
+```
+::::
+:::::

 :::{note}
 Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
--- a/docs/source/tutorials/single_npu_audio.md
+++ b/docs/source/tutorials/single_npu_audio.md
@ -0,0 +1,122 @@
+# Single NPU (Qwen2-Audio 7B)
+
+## Run vllm-ascend on Single NPU
+
+### Offline Inference on Single NPU
+
+Run docker container:
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device /dev/davinci0 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-it $IMAGE bash
+```
+
+Setup environment variables:
+
+```bash
+# Load model from ModelScope to speed up download
+export VLLM_USE_MODELSCOPE=True
+
+# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
+export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+```
+
+:::{note}
+`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
+:::
+
+Install packages required for audio processing:
+
+```bash
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+pip install librosa soundfile
+```
+
+Run the following script to execute offline inference on a single NPU:
+
+```python
+from vllm import LLM, SamplingParams
+from vllm.assets.audio import AudioAsset
+from vllm.utils import FlexibleArgumentParser
+
+audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]
+question_per_audio_count = {
+    1: "What is recited in the audio?",
+    2: "What sport and what nursery rhyme are referenced?"
+}
+
+
+def prepare_inputs(audio_count: int):
+    audio_in_prompt = "".join([
+        f"Audio {idx+1}: <|audio_bos|><|AUDIO|><|audio_eos|>\n"
+        for idx in range(audio_count)
+    ])
+    question = question_per_audio_count[audio_count]
+    prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
+              "<|im_start|>user\n"
+              f"{audio_in_prompt}{question}<|im_end|>\n"
+              "<|im_start|>assistant\n")
+
+    mm_data = {
+        "audio":
+        [asset.audio_and_sample_rate for asset in audio_assets[:audio_count]]
+    }
+
+    # Merge text prompt and audio data into inputs
+    inputs = {"prompt": prompt, "multi_modal_data": mm_data}
+    return inputs
+
+
+def main(audio_count: int):
+    # NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
+    # lower-end GPUs.
+    # Unless specified, these settings have been tested to work on a single L4.
+    # `limit_mm_per_prompt`: the max num items for each modality per prompt.
+    llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
+              max_model_len=4096,
+              max_num_seqs=5,
+              limit_mm_per_prompt={"audio": audio_count},
+              enforce_eager=True)
+
+    inputs = prepare_inputs(audio_count)
+
+    sampling_params = SamplingParams(temperature=0.2,
+                                     max_tokens=64,
+                                     stop_token_ids=None)
+
+    outputs = llm.generate(inputs, sampling_params=sampling_params)
+
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+
+
+if __name__ == "__main__":
+    audio_count = 2
+    main(audio_count)
+```
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'.
+```
+
+### Online Serving on Single NPU
+
+Currently, vllm's OpenAI-compatible server doesn't support audio inputs, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
--- a/docs/source/tutorials/single_npu_multimodal.md
+++ b/docs/source/tutorials/single_npu_multimodal.md
@ -57,6 +57,7 @@ llm = LLM(
    model=MODEL_PATH,
    max_model_len=16384,
    limit_mm_per_prompt={"image": 10},
+    enforce_eager=True,
 )

 sampling_params = SamplingParams(
@ -103,13 +104,11 @@ outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
 generated_text = outputs[0].outputs[0].text

 print(generated_text)
-
 ```

 If you run this script successfully, you can see the info shown below:

 ```bash
-Processed prompts: 100%|███████████████| 1/1 [00:11<00:00, 11.29s/it, est. speed input: 9.48 toks/s, output: 20.55 toks/s]
 The image displays a logo consisting of two main elements: a stylized geometric design and a pair of text elements.

 1. **Geometric Design**: On the left side of the image, there is a blue geometric design that appears to be made up of interconnected shapes. These shapes resemble a network or a complex polygonal structure, possibly hinting at a technological or interconnected theme. The design is monochromatic and uses only blue as its color, which could be indicative of a specific brand or company.
@ -144,7 +143,11 @@ docker run --rm \
 -e VLLM_USE_MODELSCOPE=True \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
 -it $IMAGE \
-vllm serve Qwen/Qwen2.5-VL-7B-Instruct --dtype bfloat16 --max_model_len 16384 --max-num-batched-tokens 16384
+vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
+--dtype bfloat16 \
+--max_model_len 16384 \
+--max-num-batched-tokens 16384 \
+--enforce-eager
 ```

 :::{note}
--- a/docs/source/tutorials/single_npu_qwen3_embedding.md
+++ b/docs/source/tutorials/single_npu_qwen3_embedding.md
@ -0,0 +1,99 @@
+# Single NPU (Qwen3-Embedding-8B)
+
+The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
+
+## Run docker container
+
+Take Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device /dev/davinci0 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-it $IMAGE bash
+```
+
+Setup environment variables:
+
+```bash
+# Load model from ModelScope to speed up download
+export VLLM_USE_MODELSCOPE=True
+
+# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
+export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+```
+
+### Online Inference
+
+```bash
+vllm serve Qwen/Qwen3-Embedding-8B --task embed
+```
+
+Once your server is started, you can query the model with input prompts
+
+```bash
+curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
+  "model": "Qwen/Qwen3-Embedding-8B",
+  "messages": [
+    {"role": "user", "content": "Hello"}
+  ]
+}'
+```
+
+### Offline Inference
+
+```python
+import torch
+import vllm
+from vllm import LLM
+
+def get_detailed_instruct(task_description: str, query: str) -> str:
+    return f'Instruct: {task_description}\nQuery:{query}'
+
+
+if __name__=="__main__":
+    # Each query must come with a one-sentence instruction that describes the task
+    task = 'Given a web search query, retrieve relevant passages that answer the query'
+
+    queries = [
+        get_detailed_instruct(task, 'What is the capital of China?'),
+        get_detailed_instruct(task, 'Explain gravity')
+    ]
+    # No need to add instruction for retrieval documents
+    documents = [
+        "The capital of China is Beijing.",
+        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
+    ]
+    input_texts = queries + documents
+
+    model = LLM(model="Qwen/Qwen3-Embedding-8B",
+                task="embed",
+                distributed_executor_backend="mp")
+
+    outputs = model.embed(input_texts)
+    embeddings = torch.tensor([o.outputs.embedding for o in outputs])
+    scores = (embeddings[:2] @ embeddings[2:].T)
+    print(scores.tolist())
+```
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
+Processed prompts:   0%|                                                                                                                                    | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
+Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
+[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]
+```
--- a/docs/source/user_guide/configuration/additional_config.md
+++ b/docs/source/user_guide/configuration/additional_config.md
@ -1,6 +1,6 @@
 # Additional Configuration

-addintional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.
+additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.

 ## How to use

@ -28,9 +28,11 @@ The following table lists the additional configuration options available in vLLM
 |-------------------------------| ---- |------|-----------------------------------------------------------------------------------------------|
 | `torchair_graph_config`       | dict | `{}` | The config options for torchair graph mode                                                    |
 | `ascend_scheduler_config`     | dict | `{}` | The config options for ascend scheduler                                                       |
-| `expert_tensor_parallel_size` | str | `0`  | Expert tensor parallel size the model to use.                                                 |
-| `refresh`                     | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf case.     |
-| `expert_map_path`             | str | None | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
+| `expert_tensor_parallel_size` | str  | `0`  | Expert tensor parallel size the model to use.                                                 |
+| `refresh`                     | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case.     |
+| `expert_map_path`             | str  | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
+| `chunked_prefill_for_mla`     | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |
+| `kv_cache_dtype`     | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |

 The details of each config option are as follows:

@ -38,12 +40,14 @@ The details of each config option are as follows:

 | Name | Type | Default | Description |
 | ---- | ---- | ------- | ----------- |
-| `enabled` | bool | `False` | Whether to enable torchair graph mode |
+| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
+| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
+| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models. |
 | `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
 | `use_cached_graph` | bool | `False` | Whether to use cached graph |
 | `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
 | `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
-| `enable_multistream_shared_expert`| bool | `False` | Whether to enable multistream shared expert |
+| `enable_kv_nz`| bool | `False` | Whether to enable kvcache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |

 **ascend_scheduler_config**

@ -51,26 +55,27 @@ The details of each config option are as follows:
 | ---- | ---- | ------- | ----------- |
 | `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine|

-ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you  can add `chunked_prefill_enabled: true` to ascend_scheduler_config as well.
+ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.

 ### Example

-A full example of additional configuration is as follows:
+An example of additional configuration is as follows:

 ```
 {
    "torchair_graph_config": {
-        "enabled": true,
-        "use_cached_graph": true,
+        "enabled": True,
+        "use_cached_graph": True,
        "graph_batch_sizes": [1, 2, 4, 8],
-        "graph_batch_sizes_init": false,
-        "enable_multistream_shared_expert": false
+        "graph_batch_sizes_init": False,
+        "enable_multistream_moe": False,
+        "enable_kv_nz": False
    },
    "ascend_scheduler_config": {
-        "enabled": true,
-        "chunked_prefill_enabled": true,
+        "enabled": True,
+        "enable_chunked_prefill": True,
    },
    "expert_tensor_parallel_size": 1,
-    "refresh": false,
+    "refresh": False,
 }
 ```
--- a/docs/source/user_guide/configuration/env_vars.md
+++ b/docs/source/user_guide/configuration/env_vars.md
@ -2,7 +2,7 @@

 vllm-ascend uses the following environment variables to configure the system:

-:::{literalinclude} ../../../vllm_ascend/envs.py
+:::{literalinclude} ../../../../vllm_ascend/envs.py
 :language: python
 :start-after: begin-env-vars-definition
 :end-before: end-env-vars-definition
--- a/docs/source/user_guide/configuration/index.md
+++ b/docs/source/user_guide/configuration/index.md
@ -0,0 +1,10 @@
+# Configuration Guide
+
+This section provides a detailed configuration guide of vLLM Ascend.
+
+:::{toctree}
+:caption: Configuration Guide
+:maxdepth: 1
+env_vars
+additional_config
+:::
--- a/docs/source/user_guide/feature_guide/graph_mode.md
+++ b/docs/source/user_guide/feature_guide/graph_mode.md
@ -1,17 +1,18 @@
 # Graph Mode Guide

-
+```{note}
 This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
+```

-This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested in 0.9.0rc1. We'll make it stable and generalize in the next release.
+This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We'll make it stable and generalize in the next release.

 ## Getting Started

-From v0.9.0rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
+From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.

 There are two kinds for graph mode supported by vLLM Ascend:
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.0rc1, only Qwen series models are well tested.
- **TorchAirGraph**: This is the GE graph mode. In v0.9.0rc1, only DeepSeek series models are supported.
+- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
+- **TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek series models are supported.

 ## Using ACLGraph
 ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
@ -23,8 +24,6 @@ import os

 from vllm import LLM

-os.environ["VLLM_USE_V1"] = 1
-
 model = LLM(model="Qwen/Qwen2-7B-Instruct")
 outputs = model.generate("Hello, how are you?")
 ```
@ -45,19 +44,18 @@ offline example:
 import os
 from vllm import LLM

-os.environ["VLLM_USE_V1"] = 1
-
-model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enable": True}})
+# TorchAirGraph is only work without chunked-prefill now
+model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True,}})
 outputs = model.generate("Hello, how are you?")
 ```

 online example:

 ```shell
-vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enable": true}}'
+vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
 ```

-You can find more detail about additional config [here](./additional_config.md)
+You can find more detail about additional config [here](../configuration/additional_config.md).

 ## Fallback to Eager Mode

@ -69,8 +67,6 @@ offline example:
 import os
 from vllm import LLM

-os.environ["VLLM_USE_V1"] = 1
-
 model = LLM(model="someother_model_weight", enforce_eager=True)
 outputs = model.generate("Hello, how are you?")
 ```
--- a/docs/source/user_guide/feature_guide/images/structured_output_1.png
+++ b/docs/source/user_guide/feature_guide/images/structured_output_1.png
--- a/docs/source/user_guide/feature_guide/index.md
+++ b/docs/source/user_guide/feature_guide/index.md
@ -0,0 +1,13 @@
+# Feature Guide
+
+This section provides a detailed usage guide of vLLM Ascend features.
+
+:::{toctree}
+:caption: Feature Guide
+:maxdepth: 1
+graph_mode
+quantization
+sleep_mode
+structured_output
+lora
+:::
--- a/docs/source/user_guide/feature_guide/lora.md
+++ b/docs/source/user_guide/feature_guide/lora.md
@ -0,0 +1,8 @@
+# LoRA Adapters Guide
+
+Like vLLM, vllm-ascend supports LoRA as well. The usage and more details can be found in [vLLM official document](https://docs.vllm.ai/en/latest/features/lora.html).
+
+You can also refer to [this](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) to find which models support LoRA in vLLM.
+
+## Tips
+If you fail to run vllm-ascend with LoRA, you may follow [this instruction](https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html#fallback-to-eager-mode) to disable graph mode and try again.
--- a/docs/source/user_guide/feature_guide/quantization.md
+++ b/docs/source/user_guide/feature_guide/quantization.md
@ -0,0 +1,106 @@
+# Quantization Guide
+
+Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
+
+Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
+
+## Install modelslim
+
+To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
+
+Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is available for vLLM Ascend in the future.
+
+Install modelslim:
+```bash
+git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
+cd msit/msmodelslim
+bash install.sh
+pip install accelerate
+```
+
+## Quantize model
+
+Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).
+
+
+```bash
+cd example/DeepSeek
+python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8  --is_dynamic True
+```
+
+:::{note}
+You can also download the quantized model that we uploaded. Please note that these weights should be used for test only. For example, https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8
+:::
+
+Once convert action is done, there are two important files generated.
+
+1. [config.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Please make sure that there is no `quantization_config` field in it.
+
+2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file.
+
+Here is the full converted model files:
+```bash
+.
+├── config.json
+├── configuration_deepseek.py
+├── configuration.json
+├── generation_config.json
+├── quant_model_description.json
+├── quant_model_weight_w8a8_dynamic-00001-of-00004.safetensors
+├── quant_model_weight_w8a8_dynamic-00002-of-00004.safetensors
+├── quant_model_weight_w8a8_dynamic-00003-of-00004.safetensors
+├── quant_model_weight_w8a8_dynamic-00004-of-00004.safetensors
+├── quant_model_weight_w8a8_dynamic.safetensors.index.json
+├── README.md
+├── tokenization_deepseek_fast.py
+├── tokenizer_config.json
+└── tokenizer.json
+```
+
+## Run the model
+
+Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
+
+### Offline inference
+
+```python
+import torch
+
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+
+llm = LLM(model="{quantized_model_save_path}",
+          max_model_len=2048,
+          trust_remote_code=True,
+          # Enable quantization by specifying `quantization="ascend"`
+          quantization="ascend")
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+### Online inference
+
+```bash
+# Enable quantization by specifying `--quantization ascend`
+vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code
+```
+
+## FAQs
+
+### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
+
+First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `modelslim-VLLM-8.1.RC1.b020_001` modelslim version. Finally, if it still doesn't work, please
+submit a issue, maybe some new models need to be adapted.
+
+### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
+
+Please convert DeepSeek series models using `modelslim-VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing configuration_deepseek.py error.
--- a/docs/source/user_guide/feature_guide/sleep_mode.md
+++ b/docs/source/user_guide/feature_guide/sleep_mode.md
@ -0,0 +1,115 @@
+# Sleep Mode Guide
+
+## Overview
+
+Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
+
+Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
+
+
+## Getting started
+
+With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
+
+The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
+
+- Level 1 Sleep
+    - Action: Offloads model weights and discards the KV cache.
+    - Memory: Model weights are moved to CPU memory; KV cache is forgotten.
+    - Use Case: Suitable when reusing the same model later.
+    - Note: Ensure sufficient CPU memory is available to hold the model weights.
+
+- Level 2 Sleep
+    - Action: Discards both model weights and KV cache.
+    - Memory: The content of both the model weights and kv cache is forgotten.
+    - Use Case: Ideal when switching to a different model or updating the current one.
+
+Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
+
+## Usage
+
+The following is a simple example of how to use sleep mode.
+
+- offline inference:
+
+    ```python
+    import os
+
+    import torch
+    from vllm import LLM, SamplingParams
+    from vllm.utils import GiB_bytes
+
+
+    os.environ["VLLM_USE_MODELSCOPE"] = "True"
+    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+
+    if __name__ == "__main__":
+        prompt = "How are you?"
+
+        free, total = torch.npu.mem_get_info()
+        print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB")
+        # record npu memory use baseline in case other process is running
+        used_bytes_baseline = total - free
+        llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
+        sampling_params = SamplingParams(temperature=0, max_tokens=10)
+        output = llm.generate(prompt, sampling_params)
+
+        llm.sleep(level=1)
+
+        free_npu_bytes_after_sleep, total = torch.npu.mem_get_info()
+        print(f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB")
+        used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline
+        # now the memory usage should be less than the model weights
+        # (0.5B model, 1GiB weights)
+        assert used_bytes < 1 * GiB_bytes
+
+        llm.wake_up()
+        output2 = llm.generate(prompt, sampling_params)
+        # cmp output
+        assert output[0].outputs[0].text == output2[0].outputs[0].text
+    ```
+
+- online serving:
+    :::{note}
+    Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
+    :::
+
+    ```bash
+    export VLLM_SERVER_DEV_MODE="1"
+    export VLLM_WORKER_MULTIPROC_METHOD="spawn"
+    export VLLM_USE_MODELSCOPE="True"
+
+    vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
+
+    # after serveing is up, post these endpoints
+
+    # sleep level 1
+    curl -X POST http://127.0.0.1:8000/sleep \
+        -H "Content-Type: application/json" \
+        -d '{"level": "1"}'
+
+    curl -X GET http://127.0.0.1:8000/is_sleeping
+
+    # sleep level 2
+    curl -X POST http://127.0.0.1:8000/sleep \
+        -H "Content-Type: application/json" \
+        -d '{"level": "2"}'
+
+    # wake up
+    curl -X POST http://127.0.0.1:8000/wake_up
+
+    # wake up with tag, tags must be in ["weights", "kv_cache"]
+    curl -X POST "http://127.0.0.1:8000/wake_up?tags=weights"
+
+    curl -X GET http://127.0.0.1:8000/is_sleeping
+
+    # after sleep and wake up, the serving is still available
+    curl http://localhost:8000/v1/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+            "model": "Qwen/Qwen2.5-0.5B-Instruct",
+            "prompt": "The future of AI is",
+            "max_tokens": 7,
+            "temperature": 0
+        }'
+    ```
--- a/docs/source/user_guide/feature_guide/structured_output.md
+++ b/docs/source/user_guide/feature_guide/structured_output.md
@ -0,0 +1,163 @@
+# Structured Output Guide
+
+## Overview
+
+### What is Structured Output?
+
+LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also called Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
+
+In simple terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure.
+
+![structured decoding](./images/structured_output_1.png)
+
+### Structured Output in vllm-ascend
+
+Currently, vllm-ascend supports **xgrammar** and **guidance** backend for structured output with vllm v1 engine.
+
+XGrammar introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.
+
+## How to Use Structured Output?
+
+### Online Inference
+
+You can also generate structured outputs using the OpenAI's Completions and Chat API. The following parameters are supported, which must be added as extra parameters:
+
+- `guided_choice`: the output will be exactly one of the choices.
+- `guided_regex`: the output will follow the regex pattern.
+- `guided_json`: the output will follow the JSON schema.
+- `guided_grammar`: the output will follow the context free grammar.
+
+Structured outputs are supported by default in the OpenAI-Compatible Server. You can choose to specify the backend to use by setting the `--guided-decoding-backend` flag to vllm serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.
+
+Now let´s see an example for each of the cases, starting with the guided_choice, as it´s the easiest one:
+
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="-",
+)
+
+completion = client.chat.completions.create(
+    model="Qwen/Qwen2.5-3B-Instruct",
+    messages=[
+        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+    ],
+    extra_body={"guided_choice": ["positive", "negative"]},
+)
+print(completion.choices[0].message.content)
+```
+
+The next example shows how to use the guided_regex. The idea is to generate an email address, given a simple regex template:
+
+```python
+completion = client.chat.completions.create(
+    model="Qwen/Qwen2.5-3B-Instruct",
+    messages=[
+        {
+            "role": "user",
+            "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
+        }
+    ],
+    extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
+)
+print(completion.choices[0].message.content)
+```
+
+One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. For this we can use the guided_json parameter in two different ways:
+
+- Using a JSON Schema.
+- Defining a Pydantic model and then extracting the JSON Schema from it.
+
+The next example shows how to use the guided_json parameter with a Pydantic model:
+
+```python
+from pydantic import BaseModel
+from enum import Enum
+
+class CarType(str, Enum):
+    sedan = "sedan"
+    suv = "SUV"
+    truck = "Truck"
+    coupe = "Coupe"
+
+class CarDescription(BaseModel):
+    brand: str
+    model: str
+    car_type: CarType
+
+json_schema = CarDescription.model_json_schema()
+
+completion = client.chat.completions.create(
+    model="Qwen/Qwen2.5-3B-Instruct",
+    messages=[
+        {
+            "role": "user",
+            "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
+        }
+    ],
+    extra_body={"guided_json": json_schema},
+)
+print(completion.choices[0].message.content)
+```
+
+Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can use to define a specific format of simplified SQL queries:
+
+```python
+simplified_sql_grammar = """
+    root ::= select_statement
+
+    select_statement ::= "SELECT " column " from " table " where " condition
+
+    column ::= "col_1 " | "col_2 "
+
+    table ::= "table_1 " | "table_2 "
+
+    condition ::= column "= " number
+
+    number ::= "1 " | "2 "
+"""
+
+completion = client.chat.completions.create(
+    model="Qwen/Qwen2.5-3B-Instruct",
+    messages=[
+        {
+            "role": "user",
+            "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
+        }
+    ],
+    extra_body={"guided_grammar": simplified_sql_grammar},
+)
+print(completion.choices[0].message.content)
+```
+
+Find more examples [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).
+
+### Offline Inference
+
+To use Structured Output, we'll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:
+
+- json
+- regex
+- choice
+- grammar
+
+One example for the usage of the choice parameter is shown below:
+
+```python
+from vllm import LLM, SamplingParams
+from vllm.sampling_params import GuidedDecodingParams
+
+llm = LLM(model="Qwen/Qwen2.5-7B-Instruct",
+          guided_decoding_backend="xgrammar")
+
+guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
+sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
+outputs = llm.generate(
+    prompts="Classify this sentiment: vLLM is wonderful!",
+    sampling_params=sampling_params,
+)
+print(outputs[0].outputs[0].text)
+```
+
+Find more examples of other usages [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).
--- a/docs/source/user_guide/release.template.md
+++ b/docs/source/user_guide/release.template.md
@ -1,13 +0,0 @@
-## {version}
-### Highlights
- {feature}
-### Bug fixes
- {bug}
-### Other changes
- {change}
-### Known issues
- {issue}
-### Upgrade Notes
- {upgrade}
-### Deprecation Notes
- {deprecation}
--- a/docs/source/user_guide/release_notes.md
+++ b/docs/source/user_guide/release_notes.md
@ -1,5 +1,68 @@
 # Release note

+## v0.9.2rc1 - 2025.07.11
+
+This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
+
+### Highlights
+- Pooling model works with V1 engine now. You can take a try with Qwen3 embedding model [#1359](https://github.com/vllm-project/vllm-ascend/pull/1359).
+- The performance on Atlas 300I series has been improved. [#1591](https://github.com/vllm-project/vllm-ascend/pull/1591)
+- aclgraph mode works with Moe models now. Currently, only Qwen3 Moe is well tested. [#1381](https://github.com/vllm-project/vllm-ascend/pull/1381)
+
+### Core
+- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.5.1.post1.dev20250619`. Don’t forget to update it in your environment. [#1347](https://github.com/vllm-project/vllm-ascend/pull/1347)
+- The **GatherV3** error has been fixed with **aclgraph** mode. [#1416](https://github.com/vllm-project/vllm-ascend/pull/1416)
+- W8A8 quantization works on Atlas 300I series now. [#1560](https://github.com/vllm-project/vllm-ascend/pull/1560)
+- Fix the accuracy problem with deploy models with parallel parameters. [#1678](https://github.com/vllm-project/vllm-ascend/pull/1678)
+- The pre-built wheel package now requires lower version of glibc. Users can use it by `pip install vllm-ascend` directly. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
+
+## Other
+- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
+- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331)
+- A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335)
+- A new env variable `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future[#1732](https://github.com/vllm-project/vllm-ascend/pull/1732)
+- A batch of bugs have been fixed for Data Parallelism case [#1273](https://github.com/vllm-project/vllm-ascend/pull/1273) [#1322](https://github.com/vllm-project/vllm-ascend/pull/1322) [#1275](https://github.com/vllm-project/vllm-ascend/pull/1275) [#1478](https://github.com/vllm-project/vllm-ascend/pull/1478)
+- The DeepSeek performance has been improved. [#1194](https://github.com/vllm-project/vllm-ascend/pull/1194) [#1395](https://github.com/vllm-project/vllm-ascend/pull/1395) [#1380](https://github.com/vllm-project/vllm-ascend/pull/1380)
+- Ascend scheduler works with prefix cache now. [#1446](https://github.com/vllm-project/vllm-ascend/pull/1446)
+- DeepSeek now works with prefix cache now. [#1498](https://github.com/vllm-project/vllm-ascend/pull/1498)
+- Support prompt logprobs to recover ceval accuracy in V1 [#1483](https://github.com/vllm-project/vllm-ascend/pull/1483)
+
+## v0.9.1rc1 - 2025.06.22
+
+This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started.
+
+### Highlights
+
+- Atlas 300I series is experimental supported in this release. [#1333](https://github.com/vllm-project/vllm-ascend/pull/1333) After careful consideration, this feature **will NOT be included in v0.9.1-dev branch** taking into account the v0.9.1 release quality and the feature rapid iteration to improve performance on Atlas 300I series. We will improve this from 0.9.2rc1 and later.
+- Support EAGLE-3 for speculative decoding. [#1032](https://github.com/vllm-project/vllm-ascend/pull/1032)
+
+### Core
+- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.5.1.post1.dev20250528`. Don’t forget to update it in your environment. [#1235](https://github.com/vllm-project/vllm-ascend/pull/1235)
+- Support Atlas 300I series container image. You can get it from [quay.io](https://quay.io/repository/vllm/vllm-ascend)
+- Fix token-wise padding mechanism to make multi-card graph mode work. [#1300](https://github.com/vllm-project/vllm-ascend/pull/1300)
+- Upgrade vllm to 0.9.1 [#1165]https://github.com/vllm-project/vllm-ascend/pull/1165
+
+### Other Improvements
+- Initial support Chunked Prefill for MLA. [#1172](https://github.com/vllm-project/vllm-ascend/pull/1172)
+- An example of best practices to run DeepSeek with ETP has been added. [#1101](https://github.com/vllm-project/vllm-ascend/pull/1101)
+- Performance improvements for DeepSeek using the TorchAir graph. [#1098](https://github.com/vllm-project/vllm-ascend/pull/1098), [#1131](https://github.com/vllm-project/vllm-ascend/pull/1131)
+- Supports the speculative decoding feature with AscendScheduler. [#943](https://github.com/vllm-project/vllm-ascend/pull/943)
+- Improve `VocabParallelEmbedding` custom op performance. It will be enabled in the next release. [#796](https://github.com/vllm-project/vllm-ascend/pull/796)
+- Fixed a device discovery and setup bug when running vLLM Ascend on Ray [#884](https://github.com/vllm-project/vllm-ascend/pull/884)
+- DeepSeek with [MC2](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/developmentguide/opdevg/ascendcbestP/atlas_ascendc_best_practices_10_0043.html) (Merged Compute and Communication) now works properly. [#1268](https://github.com/vllm-project/vllm-ascend/pull/1268)
+- Fixed log2phy NoneType bug with static EPLB feature. [#1186](https://github.com/vllm-project/vllm-ascend/pull/1186)
+- Improved performance for DeepSeek with DBO enabled. [#997](https://github.com/vllm-project/vllm-ascend/pull/997), [#1135](https://github.com/vllm-project/vllm-ascend/pull/1135)
+- Refactoring AscendFusedMoE [#1229](https://github.com/vllm-project/vllm-ascend/pull/1229)
+- Add initial user stories page (include LLaMA-Factory/TRL/verl/MindIE Turbo/GPUStack) [#1224](https://github.com/vllm-project/vllm-ascend/pull/1224)
+- Add unit test framework [#1201](https://github.com/vllm-project/vllm-ascend/pull/1201)
+
+### Known Issues
+- In some cases, the vLLM process may crash with a **GatherV3** error when **aclgraph** is enabled. We are working on this issue and will fix it in the next release. [#1038](https://github.com/vllm-project/vllm-ascend/issues/1038)
+- Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. [#1350](https://github.com/vllm-project/vllm-ascend/issues/1350)
+
+### Full Changelog
+https://github.com/vllm-project/vllm-ascend/compare/v0.9.0rc2...v0.9.1rc1
+
 ## v0.9.0rc2 - 2025.06.10

 This release contains some quick fixes for v0.9.0rc1. Please use this release instead of v0.9.0rc1.
@ -14,14 +77,14 @@ This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [

 ### Highlights

- DeepSeek works with graph mode now. Follow the [official doc](https://vllm-ascend.readthedocs.io/en/latest/user_guide/graph_mode.html) to take a try. [#789](https://github.com/vllm-project/vllm-ascend/pull/789)
+- DeepSeek works with graph mode now. Follow the [official doc](https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html) to take a try. [#789](https://github.com/vllm-project/vllm-ascend/pull/789)
 - Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.

 ### Core

 - The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. [#814](https://github.com/vllm-project/vllm-ascend/pull/814)
 - LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. [#893](https://github.com/vllm-project/vllm-ascend/pull/893)
- prefix cache and chunked prefill feature works now [#782](https://github.com/vllm-project/vllm-ascend/pull/782) [#844](https://github.com/vllm-project/vllm-ascend/pull/844)
+- Prefix cache and chunked prefill feature works now [#782](https://github.com/vllm-project/vllm-ascend/pull/782) [#844](https://github.com/vllm-project/vllm-ascend/pull/844)
 - Spec decode and MTP features work with V1 Engine now. [#874](https://github.com/vllm-project/vllm-ascend/pull/874) [#890](https://github.com/vllm-project/vllm-ascend/pull/890)
 - DP feature works with DeepSeek now. [#1012](https://github.com/vllm-project/vllm-ascend/pull/1012)
 - Input embedding feature works with V0 Engine now. [#916](https://github.com/vllm-project/vllm-ascend/pull/916)
@ -95,7 +158,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir

 ## v0.8.5rc1 - 2025.05.06

-This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [here](https://vllm-ascend.readthedocs.io/en/latest/user_guide/suppoted_features.html).
+This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [here](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).

 ### Highlights
 - Upgrade CANN version to 8.1.RC1 to support chunked prefill and automatic prefix caching (`--enable_prefix_caching`) when V1 is enabled [#747](https://github.com/vllm-project/vllm-ascend/pull/747)
@ -138,7 +201,7 @@ This is the second release candidate of v0.8.4 for vllm-ascend. Please follow th

 ## v0.8.4rc1 - 2025.04.18

-This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the [official documentation](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/versioning_policy.html#release-window).
+This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the [official documentation](https://vllm-ascend.readthedocs.io/en/latest/community/versioning_policy.html#release-window).

 ### Highlights

--- a/docs/source/user_guide/support_matrix/index.md
+++ b/docs/source/user_guide/support_matrix/index.md
@ -0,0 +1,10 @@
+# Features and models
+
+This section provides a detailed supported matrix by vLLM Ascend.
+
+:::{toctree}
+:caption: Support Matrix
+:maxdepth: 1
+supported_models
+supported_features
+:::
--- a/docs/source/user_guide/support_matrix/supported_features.md
+++ b/docs/source/user_guide/support_matrix/supported_features.md
--- a/docs/source/user_guide/support_matrix/supported_models.md
+++ b/docs/source/user_guide/support_matrix/supported_models.md
@ -1,4 +1,4 @@
-# Supported Models
+# Model Support

 ## Text-only Language Models

--- a/docs/source/user_stories/example.md
+++ b/docs/source/user_stories/example.md
@ -1,15 +0,0 @@
-# xxx project uses Ascend vLLM, gain 200% performance enhancement of inference.
-
-## About / Introduction
-Draft content
-
-## The Business Challenge
-Our goal is to ...
-
-## Solving challenges with vLLM Ascend
-vLLM Ascend helped us ...
-
-## Benefits using vLLM Ascend
-
-## Learn more
-more info about this case
--- a/Show More
+++ b/Show More