vLLM/vllm - vllm - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Caleb_Du	3e887d2e0c	permute/unpermute kernel for moe optimization (#14568 ) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>	2025-05-02 11:31:55 -07:00
Sage Moore	460a2b1100	[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations (#10867 ) Signed-off-by: Sage Moore <sage@neuralmagic.com>	2025-05-01 07:59:28 -07:00
Kaixi Hou	ed7a29d9f8	[NVIDIA] Support Cutlass MLA for Blackwell GPUs (#16032 ) Signed-off-by: kaixih <kaixih@nvidia.com>	2025-04-27 06:29:21 -07:00
rasmith	8e4b351a0c	[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2025-04-27 00:35:08 +00:00
Happy	9869453c42	Update test_flash_attn.py (#17102 ) Signed-off-by: ShuaibinLi <lishuaibin@live.cn>	2025-04-26 22:17:35 +00:00
Shu Wang	9e96f56efb	Allocate kv_cache with stride order (#16605 ) Signed-off-by: shuw <shuw@nvidia.com>	2025-04-25 22:03:31 -07:00
Michael Goin	82e43b2d7e	Add missing rocm_skinny_gemms kernel test to CI (#17060 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-04-24 07:49:37 -07:00
Michael Goin	6317a5174a	Categorize `tests/kernels/` based on kernel type (#16799 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-04-23 09:21:07 -04:00
vllmellm	30bc3e0f66	[FEAT][ROCm]: Support AITER MLA (#15893 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com>	2025-04-22 09:31:13 -07:00
Charlie Fu	188b7f9b8c	[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm (#15830 ) Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>	2025-04-21 20:46:22 -07:00
Varun Sundar Rabindranath	7b8a2ab76f	[Kernel] Add expert_map support to Cutlass FP8 MOE (#16861 ) Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com> Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com>	2025-04-21 20:44:32 -07:00
Tyler Michael Smith	dbb036cf61	[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py (#16623 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-04-15 05:35:38 +00:00
Jinzhen Lin	d06ba4ed3f	[Kernel] moe wna16 marlin kernel (#14447 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-04-14 20:05:22 -07:00
Michael Goin	f41647ee6b	[Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-04-11 17:54:08 +00:00
DefTruth	e9528f6dc6	[Kernel] support merge_attn_states CUDA kernel, 3x speedup (#16173 ) Signed-off-by: DefTruth <qiustudent_r@163.com>	2025-04-11 06:50:50 -06:00
Kebe	e11880deea	[Bugfix] Remove triton do_bench fast_flush arg (#16256 ) Signed-off-by: Kebe <mail@kebe7jun.com>	2025-04-08 13:51:06 +00:00
bnellnm	dcc56d62da	[Bugfix] Fix function names in test_block_fp8.py (#16033 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2025-04-03 23:01:34 +00:00
bnellnm	15ba07ef25	[Minor] Fused experts refactor (#15914 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2025-04-03 10:19:38 -07:00
Aleksandr Malyshev	e73ff24e31	[ROCM][KERNEL] Paged attention for V1 (#15720 ) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com>	2025-04-02 19:48:00 -07:00
LukasBluebaum	90969fb39a	[Kernel] Add more dtype support for GGUF dequantization (#15879 ) Signed-off-by: lukas.bluebaum <lukas.bluebaum@aleph-alpha.com>	2025-04-02 01:58:48 -07:00
Gerald	9ef98d527e	[Model][MiniMaxText01] Support MiniMaxText01 model inference (#13454 ) Signed-off-by: qscqesze <475517977@qq.com> Co-authored-by: qingjun <qingjun@minimaxi.com> Co-authored-by: qscqesze <475517977@qq.com>	2025-04-01 16:23:55 -04:00
bnellnm	e59ca942f5	Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2025-04-01 12:07:43 -04:00
youkaichao	555aa21905	[V1] Fully Transparent Implementation of CPU Offloading (#15354 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-03-31 20:22:34 +08:00
ElizaWszola	9239bf718e	[Kernel] CUTLASS grouped gemm fp8 MoE kernel (#13972 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com> Signed-off-by: ElizaWszola <ewszola@redhat.com> Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com>	2025-03-27 00:54:44 +00:00
vllmellm	5ebf66748b	[FEAT][ROCm] Integrate Fused MoE Kernels from AITER (#14967 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-03-26 16:30:30 +08:00
Thien Tran	4f044b1d67	[Kernel][CPU] CPU MLA (#14744 ) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>	2025-03-25 09:34:59 +00:00
Gregory Shtrasberg	f533b5837f	[ROCm][Kernel] MoE weights padding (#14454 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com>	2025-03-24 23:45:30 +00:00
Jinzhen Lin	6b3cc75be0	[Kernel] allow non-contiguous input for marlin kernel (#14658 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-03-24 09:21:33 -04:00
Russell Bryant	b877031d80	Remove openvino support in favor of external plugin (#15339 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-03-22 14:06:39 -07:00
Isotr0py	8afcd0f633	[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend (#15282 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-03-21 11:42:06 +00:00
Mickaël Seznec	a597a57595	[Attention] Flash Attention 3 - fp8 (#14570 ) Signed-off-by: Mickael Seznec <mickael@mistral.ai>	2025-03-20 01:14:20 -04:00
Sibi	a73e183e36	[Misc] Replace os environ to monkeypatch in test suite (#14516 ) Signed-off-by: sibi <85477603+t-sibiraj@users.noreply.github.com> Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Aaron Pham <contact@aarnphm.xyz>	2025-03-16 20:35:57 -07:00
Lucas Wilkinson	5952d8ab61	[Attention] Get rid of mla cache alignment (#14842 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-03-15 05:08:25 +00:00
Robert Shaw	d4d93db2c5	[V1] V1 Enablement Oracle (#13726 ) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2025-03-14 22:02:20 -07:00
Szymon Ożóg	e22ee1e7a2	[Kernel] GGUF MoE kernel (#14613 ) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>	2025-03-12 03:33:27 +00:00
Jeff Daily	a1c8f3796c	dynamic distpatch of fp8 kernels (#14245 ) Signed-off-by: Jeff Daily <jeff.daily@amd.com>	2025-03-11 10:54:56 -04:00
Szymon Ożóg	89cdaa83e7	[Kernel] Add more dtype support for GGUF kernels (#14043 ) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>	2025-03-10 07:30:04 -07:00
Jinzhen Lin	d0feea31c7	[Kernel] optimize performance of gptq marlin kernel when n is small (#14138 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-03-07 11:53:38 -05:00
Thomas Parnell	6bd1dd9d26	[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152 )	2025-03-06 07:39:16 -08:00
Nicolò Lucchesi	5ee10e990d	[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (#11301 )	2025-03-05 20:00:53 -08:00
Michael Goin	dae9ec464c	Temporarily disable test_awq_gemm_opcheck (#14251 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-03-05 06:10:35 +00:00
Tyler Michael Smith	72c62eae5f	[V1] EP/TP MoE + DP Attention (#13931 )	2025-03-04 21:27:26 -08:00
TJian	848a6438ae	[ROCm] Faster Custom Paged Attention kernels (#12348 )	2025-03-03 09:24:45 -08:00
Harry Mellor	cf069aa8aa	Update deprecated Python 3.8 typing (#13971 )	2025-03-02 17:34:51 -08:00
YajieWang	6a92ff93e1	[Misc][Kernel]: Add GPTQAllSpark Quantization (#12931 )	2025-02-28 22:30:59 -08:00
Michael Goin	788f284b53	Fix test_block_fp8.py test for MoE (#13915 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-02-27 18:00:00 +08:00
Lucas Wilkinson	f95903909f	[Kernel] FlashMLA integration (#13747 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-02-27 10:35:08 +08:00
Gregory Shtrasberg	aabeb2688f	[ROCm][Quantization][Kernel] Using HIP FP8 header (#12593 )	2025-02-25 00:39:59 -08:00
Harry Mellor	cdc1fa12eb	Remove unused kwargs from model definitions (#13555 )	2025-02-24 17:13:52 -08:00
Jongseok Park	781096e385	Expert Parallelism (EP) Support for DeepSeek V2 (#12583 )	2025-02-24 07:33:20 -08:00
Sage Moore	558db8083c	[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths (#13095 )	2025-02-22 05:25:41 -08:00
Kaixi Hou	e109e598c7	[NVIDIA] Support nvfp4 cutlass gemm (#13571 )	2025-02-22 05:24:05 -08:00
Lucas Wilkinson	288cc6c234	[Attention] MLA with chunked prefill (#12639 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-21 15:30:12 -08:00
Liangfu Chen	3809458456	[Bugfix] Fix invalid rotary embedding unit test (#13431 ) Signed-off-by: Liangfu Chen <liangfc@amazon.com>	2025-02-18 11:52:03 +00:00
youkaichao	124776ebd5	[ci] skip failed tests for flashinfer (#13352 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-16 22:09:15 +08:00
wchen61	dc0f7ccf8b	[BugFix] Enhance test_pos_encoding to support execution on multi-devices (#13187 ) Signed-off-by: wchen61 <wchen61@foxmail.com>	2025-02-16 08:59:49 +00:00
Tyler Michael Smith	c1e37bf71b	[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-14 00:01:14 +00:00
Kaixi Hou	4fc5c23bb6	[NVIDIA] Support nvfp4 quantization (#12784 )	2025-02-12 19:51:51 -08:00
Yu Chin Fabian Lim	aff404571b	Add Bamba Model (#10909 ) Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-06 15:22:42 -08:00
Isotr0py	85ac82d228	[Kernel] Make rotary_embedding ops more flexible with input shape (#12777 )	2025-02-06 08:46:13 -08:00
Lucas Wilkinson	75e94309e8	[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676 ) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-02-04 18:22:24 -08:00
Hongxia Yang	c36ac98d01	[AMD][ROCm] Enable DeepSeek model on ROCm (#12662 ) Signed-off-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>	2025-02-04 08:24:11 +00:00
Russell Bryant	e489ad7a21	[Misc] Add SPDX-License-Identifier headers to python source files (#12628 ) - Add SPDX license headers to python source files - Check for SPDX headers using pre-commit commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745 Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:18:24 2025 -0500 Add SPDX license headers to python source files This commit adds SPDX license headers to python source files as recommended to the project by the Linux Foundation. These headers provide a concise way that is both human and machine readable for communicating license information for each source file. It helps avoid any ambiguity about the license of the code and can also be easily used by tools to help manage license compliance. The Linux Foundation runs license scans against the codebase to help ensure we are in compliance with the licenses of the code we use, including dependencies. Having these headers in place helps that tool do its job. More information can be found on the SPDX site: - https://spdx.dev/learn/handling-license-info/ Signed-off-by: Russell Bryant <rbryant@redhat.com> commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:36:32 2025 -0500 Check for SPDX headers using pre-commit Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-02 11:58:18 -08:00
Lucas Wilkinson	cabaf4eff3	[Attention] MLA decode optimizations (#12528 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-01-30 23:49:37 -08:00
Lucas Wilkinson	9798b2fb00	[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) scaling (#11868 )	2025-01-30 18:33:00 -08:00
Jinzhen Lin	27b78c73ca	[Kernel] add triton fused moe kernel for gptq/awq (#12185 )	2025-01-29 09:07:09 -05:00
Harry Mellor	823ab79633	Update `pre-commit` hooks (#12475 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-27 17:23:08 -07:00
Bowen Wang	2bc3fbba0c	[FlashInfer] Upgrade to 0.2.0 (#11194 ) Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2025-01-27 18:19:24 +00:00
Tyler Michael Smith	72f4880425	[Bugfix/CI] Fix broken kernels/test_mha.py (#12450 )	2025-01-26 10:39:03 -08:00
Tyler Michael Smith	aa2cd2c43d	[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2025-01-26 19:59:58 +08:00
Isotr0py	f1fc0510df	[Misc] Add FA2 support to ViT MHA layer (#12355 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-25 15:07:35 +08:00
Lucas Wilkinson	ab5bbf5ae3	[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-01-24 15:27:59 +00:00
Gregory Shtrasberg	e97f802b2d	[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com>	2025-01-23 18:04:03 +00:00
Lucas Wilkinson	978b45f399	[Kernel] Flash Attention 3 Support (#12093 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-01-23 06:45:48 -08:00
rasmith	68c4421b6d	[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2025-01-23 00:10:37 +00:00
Isotr0py	dd7c9ad870	[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12067 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-16 10:11:54 +00:00
wangxiyuan	3adf0ffda8	[Platform] Do not raise error if _Backend is not found (#12023 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-01-15 10:14:15 +00:00
Jee Jee Li	42f5e7c52a	[Kernel] Support MulAndSilu (#11624 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-01-15 02:29:53 +00:00
Avshalom Manevich	263a870ee1	[Hardware][TPU] workaround fix for MoE on TPU (#11764 )	2025-01-12 10:53:51 -05:00
Chen Zhang	cf5f000d21	[torch.compile] Hide KV cache behind torch.compile boundary (#11677 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-10 13:14:42 +08:00
wangxiyuan	405eb8e396	[platform] Allow platform specify attention backend (#11609 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-01-09 21:46:50 +08:00
Chen Zhang	e20c92bb61	[Kernel] Move attn_type to Attention.__init__() (#11690 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-07 00:11:28 +08:00
Woosuk Kwon	73001445fb	[V1] Implement Cascade Attention (#11635 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-01-01 21:56:46 +09:00
youkaichao	b12e87f942	[platforms] enable platform plugins (#11602 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-30 20:24:45 +08:00
Michael Goin	2072924d14	[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523 ) Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-26 15:33:30 -08:00
Tyler Michael Smith	5a9da2e6e9	[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-12-19 02:43:30 +00:00
Dipika Sikka	60508ffda9	[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995 ) Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com> Co-authored-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2024-12-18 09:57:16 -05:00
Luka Govedič	30870b4f66	[torch.compile] Dynamic fp8 + rms_norm fusion (#10906 ) Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-12-13 03:19:23 +00:00
zhou fan	78029b34ed	[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928 ) Signed-off-by: xffxff <1247714429@qq.com>	2024-12-08 01:21:18 +08:00
Woosuk Kwon	073a4bd1c0	[Kernel] Use `out` arg in flash_attn_varlen_func (#10811 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-12-01 17:55:39 -08:00
Wallas Henrique	c27df94e1f	[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850 ) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-11-25 12:23:32 -05:00
youkaichao	05d1f8c9c6	[misc] move functions to config.py (#10624 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-25 09:27:30 +00:00
youkaichao	eebad39f26	[torch.compile] support all attention backends (#10558 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-22 14:04:42 -08:00
Lucas Wilkinson	d200972e7f	[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-11-19 19:40:33 -08:00
ElizaWszola	b00b33d77e	[Model][Quantization] HQQ support through Marlin kernel expansion (#9766 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com>	2024-11-19 13:31:12 -08:00
Mengqing Cao	8c1fb50705	[Platform][Refactor] Extract func `get_default_attn_backend` to `Platform` (#10358 ) Signed-off-by: Mengqing Cao <cmq0113@163.com>	2024-11-19 11:22:26 +08:00
Lucas Wilkinson	96d999fbe8	[Kernel] Initial Machete W4A8 support + Refactors (#9855 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-11-18 12:59:29 -07:00
ElizaWszola	79ee45b428	[Misc] Bump up test_fused_moe tolerance (#10364 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com>	2024-11-15 16:31:18 +00:00
Luka Govedič	bf2ddc6610	[bugfix] Fix static asymmetric quantization case (#10334 ) Signed-off-by: Daniël de Kok <me@danieldk.eu> Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-11-15 09:35:11 +08:00
rasmith	127c07480e	[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2024-11-08 19:59:22 -05:00

1 2 3 4 5 ...

337 Commits