Commit Graph

352 Commits

Author SHA1 Message Date
Tyler Michael Smith 5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-19 02:43:30 +00:00
Dipika Sikka 60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-18 09:57:16 -05:00
Luka Govedič 30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion (#10906)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-13 03:19:23 +00:00
zhou fan 78029b34ed
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928)
Signed-off-by: xffxff <1247714429@qq.com>
2024-12-08 01:21:18 +08:00
Woosuk Kwon 073a4bd1c0
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-01 17:55:39 -08:00
Wallas Henrique c27df94e1f
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-11-25 12:23:32 -05:00
youkaichao 05d1f8c9c6
[misc] move functions to config.py (#10624)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 09:27:30 +00:00
youkaichao eebad39f26
[torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 14:04:42 -08:00
Lucas Wilkinson d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-19 19:40:33 -08:00
ElizaWszola b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion (#9766)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-19 13:31:12 -08:00
Mengqing Cao 8c1fb50705
[Platform][Refactor] Extract func `get_default_attn_backend` to `Platform` (#10358)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-11-19 11:22:26 +08:00
Lucas Wilkinson 96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors (#9855)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-18 12:59:29 -07:00
ElizaWszola 79ee45b428
[Misc] Bump up test_fused_moe tolerance (#10364)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-15 16:31:18 +00:00
Luka Govedič bf2ddc6610
[bugfix] Fix static asymmetric quantization case (#10334)
Signed-off-by: Daniël de Kok <me@danieldk.eu>
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-11-15 09:35:11 +08:00
rasmith 127c07480e
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2024-11-08 19:59:22 -05:00
Luka Govedič 4f93dfe952
[torch.compile] Fuse RMSNorm with quant (#9138)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-11-08 21:20:08 +00:00
Joe Runde d58268c56a
[V1] Make v1 more testable (#9888)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-06 11:57:35 -08:00
Aaron Pham 21063c11c7
[CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2024-11-06 07:11:55 +00:00
Michael Goin 235366fe2e
[CI] Prune back the number of tests in tests/kernels/* (#9932)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-05 16:02:32 -05:00
sroy745 a78dd3303e
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models (#9559) 2024-11-01 23:22:49 -07:00
Peter Salas 6c0b7f548d
[Core][VLM] Add precise multi-modal placeholder tracking (#8346)
Signed-off-by: Peter Salas <peter@fixie.ai>
2024-11-01 16:21:10 -07:00
Pavani Majety 598b6d7b07
[Bugfix/Core] Flashinfer k_scale and v_scale (#9861) 2024-11-01 12:15:05 -07:00
Mor Zusman 9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (#9838)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-10-31 20:06:25 +00:00
wangshuai09 622b7ab955
[Hardware] using current_platform.seed_everything (#9785)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-29 14:47:44 +00:00
youkaichao 32176fee73
[torch.compile] support moe models (#9632)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-27 21:58:04 -07:00
wangshuai09 4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm (#9642)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
Mengqing Cao 5cbdccd151
[Hardware][openvino] is_openvino --> current_platform.is_openvino (#9716) 2024-10-26 10:59:06 +00:00
Charlie Fu 59449095ab
[Performance][Kernel] Fused_moe Performance Improvement (#9384)
Signed-off-by: charlifu <charlifu@amd.com>
2024-10-24 15:37:52 -07:00
Jee Jee Li 295a061fb3
[Kernel] add kernel for FATReLU (#9610)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-10-24 16:18:27 +08:00
wangshuai09 3ddbe25502
[Hardware][CPU] using current_platform.is_cpu (#9536) 2024-10-22 00:50:43 -07:00
Chen Zhang 4fa3e33349
[Kernel] Support sliding window in flash attention backend (#9403) 2024-10-20 10:57:52 -07:00
bnellnm eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType (#9299) 2024-10-17 19:08:34 +00:00
Mor Zusman fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189) 2024-10-16 12:12:43 -04:00
Cyrus Leung 7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL (#9250) 2024-10-16 13:56:17 +08:00
Tyler Michael Smith 7342a7d7f8
[Model] Support Mamba (#6484) 2024-10-11 15:40:06 +00:00
Lucas Wilkinson a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) (#9212) 2024-10-10 14:16:17 +08:00
bnellnm bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch (#9086) 2024-10-08 14:28:12 -07:00
ElizaWszola 05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
2024-10-04 12:34:44 -06:00
youkaichao 9aaf14c62e
[misc] add forward context for attention (#9029) 2024-10-03 12:09:42 -07:00
Mor Zusman f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533) 2024-09-29 17:35:58 -04:00
ElizaWszola d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-28 18:19:40 -07:00
youkaichao a9b15c606f
[torch.compile] use empty tensor instead of None for profiling (#8875) 2024-09-27 08:11:32 -07:00
bnellnm 300da09177
[Kernel] Fullgraph and opcheck tests (#8479) 2024-09-25 08:35:52 -06:00
Lucas Wilkinson 86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Charlie Fu 9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) 2024-09-19 17:37:57 +00:00
Tyler Michael Smith db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) 2024-09-18 20:05:06 +00:00
Cyrus Leung 6ffa3f314c
[CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Tyler Michael Smith 8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) 2024-09-17 23:44:27 +00:00
Simon Mo 546034b466
[refactor] remove triton based sampler (#8524) 2024-09-16 20:04:48 -07:00
Luka Govedič 5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270) 2024-09-16 11:52:40 -07:00
ElizaWszola a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
2024-09-16 09:47:19 -06:00
Isotr0py fc990f9795
[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (#8357) 2024-09-15 16:51:44 -06:00
Charlie Fu 1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) 2024-09-13 17:01:11 -07:00
Cyrus Leung a84e598e21
[CI/Build] Reorganize models tests (#7820) 2024-09-13 10:20:06 -07:00
bnellnm 73202dbe77
[Kernel][Misc] register ops to prevent graph breaks (#6917)
Co-authored-by: Sage Moore <sage@neuralmagic.com>
2024-09-11 12:52:19 -07:00
Dipika Sikka 6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ (#8217) 2024-09-09 23:02:52 -04:00
Elfie Guo e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) 2024-09-05 05:12:26 +00:00
Pavani Majety 6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-29 14:53:11 -04:00
youkaichao ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) 2024-08-28 21:27:06 -07:00
Mor Zusman fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) 2024-08-28 15:06:52 -07:00
rasmith e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386) 2024-08-28 15:37:47 -04:00
Pavani Majety b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-28 10:01:22 -07:00
LI MOU 53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509) 2024-08-21 08:54:31 -07:00
Lucas Wilkinson 5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) 2024-08-20 07:09:33 -06:00
Charlie Fu e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) 2024-08-16 10:06:30 -07:00
youkaichao 54bd9a03c4
register custom op for flash attn and use from torch.ops (#7536) 2024-08-15 22:38:56 -07:00
jon-chuang 50b8d08dbd
[Misc/Testing] Use `torch.testing.assert_close` (#7324) 2024-08-16 04:24:04 +00:00
jon-chuang a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Luka Govedič 5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests (#7305) 2024-08-08 12:16:13 -04:00
afeldman-nm fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Luka Govedič 8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Lucas Wilkinson a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
Woosuk Kwon 805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn (#7022) 2024-08-01 13:14:37 -07:00
Jee Jee Li 7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 2024-07-31 17:12:24 -07:00
HandH1998 6512937de1
Support W4A8 quantization for vllm (#5218) 2024-07-31 07:55:21 -06:00
Varun Sundar Rabindranath af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace (#6848)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 20:24:58 -06:00
Thomas Parnell 9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-29 14:51:27 -07:00
Varun Sundar Rabindranath 766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
Alexander Matveev 75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) 2024-07-27 17:52:33 -04:00
Joe 14dbd5a767
[Model] H2O Danube3-4b (#6451) 2024-07-26 20:47:50 -07:00
Antoni Baum 0e63494cf3
Add fp8 support to `reshape_and_cache_flash` (#6667) 2024-07-24 18:36:52 +00:00
Tyler Michael Smith fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) 2024-07-22 14:08:30 -06:00
Alexander Matveev 396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612) 2024-07-21 19:41:42 -04:00
Varun Sundar Rabindranath 2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593)
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
2024-07-19 18:15:26 -07:00
Robert Shaw 4cc24f01b1
[ Kernel ] Enable Dynamic Per Token `fp8` (#6547) 2024-07-19 23:08:15 +00:00
Varun Sundar Rabindranath b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-18 01:38:35 +00:00
Michael Goin 978aed5300
[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) 2024-07-16 15:31:32 -07:00
Robert Shaw b675069d74
[ Misc ] Refactor Marlin Python Utilities (#6082)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-07-11 15:40:11 +00:00
afeldman-nm 543aa48573
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-08 17:12:15 +00:00
Lily Liu 69ec3ca14c
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-04 16:35:51 -07:00
Michael Goin 47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) 2024-07-03 17:38:00 +00:00
youkaichao 482045ee77
[hardware][misc] introduce platform abstraction (#6080) 2024-07-02 20:12:22 -07:00
Robert Shaw 7c008c51a9
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-02 21:54:35 +00:00
Avshalom Manevich 12a59959ed
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029) 2024-07-01 21:08:29 +00:00
youkaichao 614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) 2024-06-30 20:07:34 -07:00
Tyler Michael Smith 6a2d659d28
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931) 2024-06-28 17:10:34 +00:00
Ilya Lavrenov 57f09a419c
[Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 13:50:16 +00:00
Luka Govedič 5bfd1bbc98
[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560)
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-06-26 15:16:00 +00:00
Cyrus Leung 0e9164b40a
[mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
Tyler Michael Smith 85657b5607
[Kernel] Factor out epilogues from cutlass kernels (#5391)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
bnellnm 5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) 2024-06-09 16:23:30 -04:00
Dipika Sikka ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization (#5037)
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
Woosuk Kwon 41ca62cf03
[Misc] Add CustomOp interface for device portability (#5255) 2024-06-05 09:18:19 -07:00
afeldman-nm f42a006b15
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210) 2024-06-03 20:32:57 -07:00
Tyler Michael Smith cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) 2024-06-03 09:52:30 -07:00
Varun Sundar Rabindranath f081c3ce4b
[Kernel] Update Cutlass fp8 configs (#5144)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-01 08:46:07 +00:00
Tyler Michael Smith 260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) 2024-06-01 06:45:32 +00:00
SnowDist a22dea54d3
[Model] Support MAP-NEO model (#5081)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-30 19:24:41 -07:00
Eric Xihui Lin 8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799)
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-24 22:00:52 -07:00
Dipika Sikka a1242324c9
[Kernel] Initial Activation Quantization Support (#4525)
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Alexander Matveev 6066253296
Marlin 24 prefill performance improvement (about 25% better on average) (#4983) 2024-05-23 02:39:27 -04:00
Cody Yu ee3eea0a1b
[Misc] Take user preference in attention selector (#4960) 2024-05-23 07:55:56 +09:00
Tyler Michael Smith 8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
Woosuk Kwon b57e6c5949
[Kernel] Add flash-attn back (#4907) 2024-05-19 18:11:30 -07:00
Alexander Matveev 27ce85476e
[Kernel] Add marlin_24 unit tests (#4901) 2024-05-19 11:37:34 -04:00
Jinzhen Lin 33e0823de5
[Bugfix] fix rope error when load models with different dtypes (#4835) 2024-05-17 18:43:34 +09:00
Tyler Michael Smith 2060e93659
[Kernel] Add w8a8 CUTLASS kernels (#4749) 2024-05-16 18:32:50 -04:00
alexm-nm 5c342570d7
Add marlin unit tests and marlin benchmark script (#4815) 2024-05-16 09:36:49 -04:00
SangBin Cho 8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820)
Lora 3 & 4 test seems to have illegal memory access failure after this commit;

[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241

This reverts commit 1356df5.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
Stephen Krider 1356df53bd
[Kernel] Use flash-attn for decoding (#3648)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2024-05-13 15:50:33 -07:00
Cyrus Leung 350f9e107f
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425)
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
Cody Yu c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) 2024-05-09 18:04:17 -06:00
youkaichao 230c4b38c1
[CI/Test] fix swap test for multi gpu (#4689) 2024-05-08 13:14:02 -07:00
youkaichao 20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) 2024-05-08 12:07:05 -07:00
DefTruth 0f9a6e3d22
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573) 2024-05-08 09:19:58 -07:00
youkaichao 63575bc2e1
[Core][Optimization] change python dict to pytorch tensor (#4607) 2024-05-06 21:30:27 -07:00
Michael Goin 2a052011ca
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)
Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
2024-05-04 11:45:16 -07:00
Lily Liu 43c413ec57
[Kernel] Use flashinfer for decoding (#4353)
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
2024-05-03 15:51:27 -07:00
SangBin Cho 3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) 2024-05-03 10:20:12 -07:00
Michał Moskal 32881f3f31
[kernel] fix sliding window in prefix prefill Triton kernel (#4405)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2024-05-02 11:23:37 -07:00
Michał Moskal e8cc7967ff
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128) 2024-04-18 00:51:28 -07:00
Antoni Baum a10d3056da
[Core] Set `linear_weights` directly on the layer (#3977) 2024-04-11 16:35:51 -04:00
Kunshang Ji e9da5a40c6
[Misc] Add indirection layer for custom ops (#3913) 2024-04-10 20:26:07 -07:00
Adrian Abeyta 2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
mawong-amd b6d103542c
[Kernel] Layernorm performance optimization (#3662) 2024-03-30 14:26:38 -07:00
Roger Wang 45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277) 2024-03-27 13:39:26 -07:00
SangBin Cho 01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
Woosuk Kwon 925f3332ca
[Core] Refactor Attention Take 2 (#3462) 2024-03-25 04:39:33 +00:00
youkaichao 8b268a46a7
[CI] typo fix: is_hip --> is_hip() (#3595) 2024-03-24 16:03:06 -07:00
Nick Hill 41deac4a3d
[BugFix] 1D query fix for MoE models (#3597) 2024-03-24 16:00:16 -07:00
Antoni Baum 426ec4ec67
[1/n] Triton sampling kernel (#3186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-20 14:45:08 -07:00
simon-mo ad50bf4b25 fix lint 2024-03-15 22:23:38 -07:00
Tao He 3123f15138
Fixes the incorrect argument in the prefix-prefill test cases (#3246) 2024-03-15 20:58:10 -07:00
Terry 7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
Woosuk Kwon 602358f8a8
Add kernel for GeGLU with approximate GELU (#3337) 2024-03-12 22:06:17 -07:00
Zhuohan Li 2f8844ba08
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Woosuk Kwon 2daf23ab0c
Separate attention backends (#3005) 2024-03-07 01:45:50 -08:00
Tao He 71bcaf99e2
Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Woosuk Kwon fd5dcc5c81
Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
Lily Liu fe6d09ae61
[Minor] More fix of test_cache.py CI test failure (#2750) 2024-02-06 11:38:38 -08:00
Woosuk Kwon f0d4e14557
Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
Hongxia Yang 56f738ae9b
[ROCm] Fix some kernels failed unit tests (#2498) 2024-02-05 14:25:36 -08:00
Kunshang Ji 96b6f475dd
Remove hardcoded `device="cuda" ` to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Philipp Moritz d0d93b92b1
Add unit test for Mixtral MoE layer (#2677) 2024-01-31 14:34:17 -08:00
Philipp Moritz 89efcf1ce5
[Minor] Fix test_cache.py CI test failure (#2684) 2024-01-31 10:12:11 -08:00
Vladimir 4f65af0e25
Add swap_blocks unit tests (#2616) 2024-01-30 09:30:50 -08:00
wangding zeng 5d60def02c
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star 9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Jason Zhu 7a0b011dd5
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553) 2024-01-22 14:47:25 -08:00
shiyi.c_98 d10f8e1d43
[Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
Simon Mo 6e01e8c1c8
[CI] Add Buildkite (#2355) 2024-01-14 12:37:58 -08:00
Woosuk Kwon 941767127c
Revert the changes in test_cache (#2335) 2024-01-03 17:32:05 -08:00
Zhuohan Li fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-03 11:30:22 -08:00
Jee Li 77af974b40
[FIX] Support non-zero CUDA devices in custom kernels (#1959) 2024-01-02 19:09:59 -08:00
wbn dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
Woosuk Kwon 9b294976a2
Add PyTorch-native implementation of custom layers (#1898) 2023-12-02 21:18:40 -08:00
Yanming W e0c6f556e8
[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
Simon Mo 5ffc0d13a2
Migrate linter from `pylint` to `ruff` (#1665) 2023-11-20 11:58:01 -08:00
Woosuk Kwon 0ce8647dc5
Fix integer overflows in attention & cache ops (#1514) 2023-10-31 15:19:30 -07:00
Woosuk Kwon 928de46888
Implement PagedAttention V2 (#1348) 2023-10-16 00:59:57 -07:00
Zhuohan Li ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) 2023-10-02 15:36:09 -07:00
Woosuk Kwon 6f88f762bf
Fix OOM in attention kernel test (#1223) 2023-09-28 14:33:24 -07:00
Antoni Baum cf5cb1e33e
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00
Woosuk Kwon e67b4f2c2a
Use FP32 in RoPE initialization (#1004)
Co-authored-by: One <imone@tuta.io>
2023-09-11 00:26:35 -07:00
Zhuohan Li db09d4ad83
[FIX] Fix Alibi implementation in PagedAttention kernel (#945)
* [FIX] Fix Alibi implementation in PagedAttention kernel

* Fix test_attention

* Fix

---------

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
2023-09-07 15:53:14 -07:00
Woosuk Kwon 320a622ec4
[BugFix] Implement RoPE for GPT-J (#941) 2023-09-06 11:54:33 +09:00
Woosuk Kwon fbd80ad409
Clean up kernel unit tests (#938) 2023-09-05 16:57:38 -07:00
Aman Gupta Karmani 75471386de
use flash-attn via xformers (#877) 2023-08-29 21:52:13 -07:00
Woosuk Kwon d64bf1646c
Implement approximate GELU kernels (#828) 2023-08-23 07:43:21 +09:00
Tao Peng d7a1c6d614
Fix paged attention testing. (#495)
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
2023-07-24 21:01:56 -07:00
Song bda41c70dd
hotfix attn alibi wo head mapping (#496)
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
2023-07-18 11:31:48 -07:00
Andre Slavescu c894836108
[Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon e41f06702c
Add support for BLOOM (#331) 2023-07-03 13:12:35 -07:00
Zhuohan Li d6fa1be3a8
[Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00
Woosuk Kwon 0b98ba15c7
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
Woosuk Kwon e38074b1e6
Support FP32 (#141) 2023-06-07 00:40:21 -07:00
Woosuk Kwon a283ec2eec
Add contributing guideline and mypy config (#122) 2023-05-23 17:58:51 -07:00
Woosuk Kwon 825d8892b5
Use pytest format for unit tests (#107) 2023-05-17 17:11:23 -07:00
Woosuk Kwon c9d5b6d4a8
Replace FlashAttention with xformers (#70) 2023-05-05 02:01:08 -07:00
Woosuk Kwon 436e523bf1
Refactor attention kernels (#53) 2023-05-03 13:40:13 -07:00
Woosuk Kwon a96d63c21d
Add support for GPT-NeoX (Pythia) (#50) 2023-04-28 00:32:10 -07:00
Siyuan (Ryans) Zhuang e3cec88aa5
Memcpy kernel for flash attention (#29)
* optimize

* add benchmark

* add assert

* add test
2023-04-10 18:22:49 -07:00
Woosuk Kwon b9926f7f66
Support block size 32 (#35) 2023-04-09 23:07:18 -07:00
Woosuk Kwon c267b1a02c
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27)
* Add query stride to multi_query_cached_kv_attention

* Add kernel benchmark script
2023-04-08 13:36:09 -07:00
Woosuk Kwon 0f40557af6
Implement block copy kernel to optimize beam search (#32) 2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang 21b3671bbc
Basic attention kernel that supports cached KV + (multi-)prompts (#24) 2023-04-04 20:34:46 -07:00
Woosuk Kwon 897cb2ae28
Optimize data movement (#20) 2023-04-02 00:30:17 -07:00
Woosuk Kwon 09e9245478
Add custom kernel for RMS normalization (#16) 2023-04-01 00:51:22 +08:00
Woosuk Kwon 88c0268a18
Implement custom kernel for LLaMA rotary embedding (#14) 2023-03-30 11:04:21 -07:00
Woosuk Kwon a1b3de86cd
Refactor the test code for attention kernels (#13) 2023-03-29 18:59:27 -07:00