Commit Graph

81 Commits

Author SHA1 Message Date
quanliu 92183b41f3
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (#18957)
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
2025-06-15 21:56:37 -07:00
Ning Xie 2f1c19b245
[CI] change spell checker from codespell to typos (#18711)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-06-11 19:57:10 -07:00
Simon Mo 02f0c7b220
[Misc] Add SPDX-FileCopyrightText (#19100)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-06-03 11:20:17 -07:00
Andrew Sansom cc2a77d7f1
[Core] [Bugfix] Add Input Embeddings (#15428)
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com>
Co-authored-by: Bryce1010 <bryceyx@gmail.com>
Co-authored-by: Nan2018 <nan@protopia.ai>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-02 01:06:39 -07:00
Aleksandr Malyshev bc7c4d206b
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 (#13305)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com>
Signed-off-by: maleksan85 <maleksan@amd.com>
Signed-off-by: <>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: qli88 <qiang.li2@amd.com>
Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com>
2025-04-22 19:11:56 -07:00
Chen Zhang 3a5f0afcd2
[V1] Implement sliding window attention in kv_cache_manager (#14097)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-04-01 00:33:17 -07:00
Robert Shaw d4d93db2c5
[V1] V1 Enablement Oracle (#13726)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-03-14 22:02:20 -07:00
Vincent a4f1ee35d6
Deprecate `best_of` Sampling Parameter in anticipation for vLLM V1 (#13997)
Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-05 20:22:43 +00:00
Harry Mellor cf069aa8aa
Update deprecated Python 3.8 typing (#13971) 2025-03-02 17:34:51 -08:00
Sage Moore 38acae6e97
[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups (#13970)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-02-27 20:31:47 +00:00
Jee Jee Li 5157338ed9
[Misc] Improve LoRA spelling (#13831) 2025-02-25 23:43:01 -08:00
Joe Runde 3bcb8c75da
[Core] Reduce TTFT with concurrent partial prefills (#10235)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-14 15:36:07 -08:00
Russell Bryant 73b35cca7f
[Core] Improve hash collision avoidance in prefix caching (#12621)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-03 16:28:20 -08:00
Russell Bryant e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files (#12628)
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 11:58:18 -08:00
Cody Yu 7206ce4ce1
[Core] Support `reset_prefix_cache` (#12284) 2025-01-22 18:52:27 +00:00
Sungjae Lee c31d4a57a6
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching (#8240) 2024-12-13 07:51:25 -08:00
Cyrus Leung 8f10d5e393
[Misc] Split up pooling tasks (#10820)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 01:28:00 -08:00
Murali Andoorveedu db66e018ea
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
2024-11-26 09:11:16 -08:00
Ricky Xu 4634a89d18
Prefix Cache Aware Scheduling [1/n] (#10128)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-22 21:15:55 -08:00
Pavani Majety 6c1208d083
[Core] Add Sliding Window Support with Flashinfer (#10462)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2024-11-20 19:56:47 -08:00
Aaron Pham 21063c11c7
[CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2024-11-06 07:11:55 +00:00
Cyrus Leung bbc3619dc8
[Core] Make encoder-decoder inputs a nested structure to be more composable (#9604)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-05 10:07:31 +08:00
youkaichao 4fdc581f9e
[core] simplify seq group code (#9569)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-10-24 00:16:44 -07:00
Cody Yu d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Cyrus Leung 051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Kuntai Du 81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default (#8704)
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
sroy745 f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149) 2024-10-10 14:17:17 +08:00
youkaichao 18b296fdb2
[core] remove beam search from the core (#9105) 2024-10-07 05:47:04 +00:00
Varun Sundar Rabindranath cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (#9038)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-10-06 12:48:11 -07:00
sroy745 5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots (#8824) 2024-09-29 09:17:45 +08:00
sroy745 fc3afc20df
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (#8752) 2024-09-24 21:26:36 -07:00
sroy745 ee777d9c30
Fix test_schedule_swapped_simple in test_scheduler.py (#8780) 2024-09-24 21:26:18 -07:00
sroy745 88577ac928
Fix tests in test_scheduler.py that fail with BlockManager V2 (#8728) 2024-09-24 04:43:13 +00:00
Cody Yu e3580537a4
[Performance] Enable chunked prefill and prefix caching together (#7753) 2024-08-28 00:36:31 -07:00
Megha Agarwal 2eedede875
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
Cody Yu 2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822) 2024-08-26 11:24:53 -07:00
Cody Yu 3ac50b47d0
[MISC] Add prefix cache hit rate to metrics (#7606) 2024-08-19 11:52:07 -07:00
SangBin Cho ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization (#7109) 2024-08-18 17:57:20 -07:00
Cade Daniel baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 (#7380) 2024-08-09 23:48:49 +00:00
Zach Zheng 782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849) 2024-08-08 10:43:30 -07:00
afeldman-nm fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
youkaichao c8a7e93273
[core][scheduler] simplify and improve scheduler (#6867) 2024-07-31 23:51:09 -07:00
Jiaxin Shan 42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace (#6234)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-22 15:42:40 -07:00
Antoni Baum 9ed82e7074
[Misc] Small perf improvements (#6520) 2024-07-19 12:10:56 -07:00
Alexander Matveev 3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602) 2024-07-01 20:10:37 -07:00
Cyrus Leung 0e9164b40a
[mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
leiwen83 1b8a0d71cf
[Core][Bugfix]: fix prefix caching for blockv2 (#5364)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-06-14 17:23:56 -07:00
SangBin Cho 847cdcca1c
[CI] Upgrade codespell version. (#5381) 2024-06-12 10:06:14 -07:00
Kaiyang Chen 10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) 2024-06-03 13:37:11 -07:00
Cyrus Leung b1c255630d
[Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099) 2024-05-29 16:05:01 -07:00