vLLM/vllm - vllm - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Alexander Matveev	e76466dde2	[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338 )	2024-07-17 14:30:28 -07:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
Tyler Michael Smith	9dad5cc859	[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384 )	2024-07-14 13:37:19 +00:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
Joe Runde	ba4994443a	[Kernel] Add punica dimensions for Granite 3b and 8b (#5930 ) Signed-off-by: Joe Runde <joe@joerun.de>	2024-06-29 10:48:25 +08:00
Tyler Michael Smith	5d2a1a9cf0	Unmark more files as executable (#5962 )	2024-06-28 17:34:56 -04:00
Tyler Michael Smith	6a2d659d28	[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931 )	2024-06-28 17:10:34 +00:00
Chip Kerchner	38a1674abb	Support CPU inference with VSX PowerPC ISA (#5652 )	2024-06-26 21:53:04 +00:00
Luka Govedič	5bfd1bbc98	[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560 ) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-06-26 15:16:00 +00:00
Varun Sundar Rabindranath	6c916ac8a8	[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-23 21:07:11 +00:00
Roger Wang	bd620b01fb	[Kernel][CPU] Add Quick `gelu` to CPU (#5717 )	2024-06-21 06:39:40 +00:00
Jinzhen Lin	1f5674218f	[Kernel] Add punica dimension for Qwen2 LoRA (#5441 )	2024-06-20 17:55:41 -07:00
Tyler Michael Smith	3f3b6b2150	[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715 )	2024-06-20 18:36:10 +00:00
Varun Sundar Rabindranath	a7dcc62086	[Kernel] Update Cutlass int8 kernel configs for SM80 (#5275 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-20 13:33:21 +00:00
Roger Wang	ad137cd111	[Model] Port over CLIPVisionModel for VLMs (#5591 )	2024-06-20 11:52:09 +00:00
Varun Sundar Rabindranath	111af1fa2c	[Kernel] Update Cutlass int8 kernel configs for SM90 (#5514 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-20 06:37:08 +00:00
Tyler Michael Smith	b23ce92032	[Bugfix] Fix CUDA version check for mma warning suppression (#5642 )	2024-06-18 23:48:49 +00:00
sergey-tinkoff	07feecde1a	[Model] LoRA support added for command-r (#5178 )	2024-06-18 11:01:21 -07:00
Joe Runde	5002175e80	[Kernel] Add punica dimensions for Granite 13b (#5559 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-06-18 03:54:11 +00:00
Tyler Michael Smith	348616ac4b	[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401 )	2024-06-14 10:02:00 -07:00
Tyler Michael Smith	703475f6c2	[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516 )	2024-06-14 09:30:15 -07:00
Jie Fu (傅杰)	cd9c0d65d9	[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452 )	2024-06-13 17:22:24 -06:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
Jie Fu (傅杰)	6840a71610	[Misc] Remove unused cuda_utils.h in CPU backend (#5345 )	2024-06-07 14:09:13 -07:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
Tyler Michael Smith	ccd4f129e8	[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-06-05 10:44:15 -07:00
Yuan	cafb8e06c5	[CI/BUILD] enable intel queue for longer CPU tests (#4113 )	2024-06-03 10:39:50 -07:00
Tyler Michael Smith	cbb2f59cc8	[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159 )	2024-06-03 09:52:30 -07:00
Divakar Verma	a66cf40b20	[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927 ) This PR enables the fused topk_softmax kernel used in moe layer for HIP	2024-06-02 14:13:26 -07:00
Varun Sundar Rabindranath	f081c3ce4b	[Kernel] Update Cutlass fp8 configs (#5144 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-01 08:46:07 +00:00
Tyler Michael Smith	260d119e86	[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137 )	2024-06-01 06:45:32 +00:00
Tyler Michael Smith	1197e02141	[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168 )	2024-05-31 17:21:38 -07:00
Simon Mo	e9d3aa04f6	Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149 )	2024-05-30 22:00:26 -07:00
SnowDist	a22dea54d3	[Model] Support MAP-NEO model (#5081 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-05-30 19:24:41 -07:00
Alexander Matveev	6d21fa1cad	[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136 )	2024-05-30 21:02:11 -05:00
Eric Xihui Lin	8e192ff967	[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799 ) Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-05-24 22:00:52 -07:00
Dipika Sikka	a1242324c9	[Kernel] Initial Activation Quantization Support (#4525 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-05-23 21:29:18 +00:00
Alexander Matveev	6066253296	Marlin 24 prefill performance improvement (about 25% better on average) (#4983 )	2024-05-23 02:39:27 -04:00
raywanb	97b030005c	[Model] LoRA gptbigcode implementation (#3949 )	2024-05-22 13:58:59 -07:00
Tyler Michael Smith	8674f9880e	[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954 ) Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs	2024-05-22 14:10:43 +00:00
Michael Goin	5f6d10c14c	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722 )	2024-05-22 07:18:41 +00:00
Alexander Matveev	da5a0b539d	Remove marlin warning (#4918 )	2024-05-20 14:55:34 +00:00
Tyler Michael Smith	2060e93659	[Kernel] Add w8a8 CUTLASS kernels (#4749 )	2024-05-16 18:32:50 -04:00
Silencio	8435b207af	[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850 ) Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>	2024-05-16 11:16:09 -07:00
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
Jinzhen Lin	99caa49106	[Kernel] add bfloat16 support for gptq marlin kernel (#4788 )	2024-05-16 09:55:29 -04:00
Steve Grubb	dac6a3f6ed	[Misc] Apply a couple g++ cleanups (#4719 )	2024-05-10 13:37:05 +00:00
Cody Yu	c833101740	[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535 )	2024-05-09 18:04:17 -06:00
kliuae	ff5abcd746	[ROCm] Add support for Punica kernels on AMD GPUs (#3140 ) Co-authored-by: miloice <jeffaw99@hotmail.com>	2024-05-09 09:19:50 -07:00
alexm-nm	e288df0632	[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626 )	2024-05-08 17:14:31 -07:00
youkaichao	20cfcdec99	[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659 )	2024-05-08 12:07:05 -07:00
youkaichao	63575bc2e1	[Core][Optimization] change python dict to pytorch tensor (#4607 )	2024-05-06 21:30:27 -07:00
Philipp Moritz	a98187cf72	[Kernel] Make static FP8 scaling more robust (#4570 ) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.2295\|± \|0.0035\| \| - humanities \|N/A \|none \| 5\|acc \|0.2421\|± \|0.0062\| \| - other \|N/A \|none \| 5\|acc \|0.2398\|± \|0.0076\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.2171\|± \|0.0074\| \| - stem \|N/A \|none \| 5\|acc \|0.2125\|± \|0.0073\| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7008\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6453\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7692\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8083\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6115\|± \|0.0083\| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.	2024-05-06 17:39:28 -07:00
Lily Liu	43c413ec57	[Kernel] Use flashinfer for decoding (#4353 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>	2024-05-03 15:51:27 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
alexm-nm	7038e8b803	[Kernel] Support running GPTQ 8-bit models in Marlin (#4533 )	2024-05-02 12:56:22 -04:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Austin Veselka	eefeb16464	[Kernel] Full Tensor Parallelism for LoRA Layers (#3524 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-04-27 00:03:48 -07:00
Philipp Moritz	12628d3c78	[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-27 04:49:59 +00:00
alexm-nm	aae08249ac	[Bugfix] Fix marlin kernel crash on H100 (#4218 ) This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.	2024-04-24 10:35:01 -07:00
Woosuk Kwon	468d761b32	[Misc] Reduce supported Punica dtypes (#4304 )	2024-04-23 18:54:33 -07:00
Philipp Moritz	eace8bf0b9	[Kernel] FP8 support for MoE kernel / Mixtral (#4244 ) This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208 It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Performance: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> Accuracy: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7018\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6472\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7673\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8099\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6131\|± \|0.0083\| ``` this compares favorably with the fp16 results which are ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7020\|± \|0.1313\| \| - humanities \|N/A \|none \| 5\|acc \|0.6425\|± \|0.1349\| \| - other \|N/A \|none \| 5\|acc \|0.7744\|± \|0.1038\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8131\|± \|0.0695\| \| - stem \|N/A \|none \| 5\|acc \|0.6108\|± \|0.1383\| ``` Happy hacking!	2024-04-24 01:18:23 +00:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Shoichi Uchinami	a53222544c	[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134 )	2024-04-17 10:02:45 -07:00
Jee Li	989ae2538d	[Kernel] Add punica dimension for Baichuan-13B (#4053 )	2024-04-13 07:55:05 -07:00
Antoni Baum	1e96c3341a	Add extra punica sizes to support bigger vocabs (#4015 )	2024-04-11 22:18:57 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
fuchen.ljl	08ccee1e83	punica fix-bgmv-kernel-640 (#4007 )	2024-04-11 08:59:26 -07:00
Matt Wong	59a6abf3c9	[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782 )	2024-04-08 14:31:02 -07:00
Woosuk Kwon	498eb5cfa3	[Bugfix] Add kv_scale input parameter to CPU backend (#3840 )	2024-04-04 04:33:08 +00:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
bigPYJ1151	0e3f06fe9c	[Hardware][Intel] Add CPU inference backend (#3634 ) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>	2024-04-01 22:07:30 -07:00
mawong-amd	b6d103542c	[Kernel] Layernorm performance optimization (#3662 )	2024-03-30 14:26:38 -07:00
Jee Li	566b57c5c4	[Kernel] support non-zero cuda devices in punica kernels (#3636 )	2024-03-27 00:37:42 +00:00
Jee Li	8af890a865	Enable more models to inference based on LoRA (#3382 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-25 18:09:31 -07:00
Hanzhi Zhou	f721096d48	[BugFix] Some fixes for custom allreduce kernels (#2760 )	2024-03-21 23:02:58 -07:00
Woosuk Kwon	9101d832e6	[Bugfix] Make moe_align_block_size AMD-compatible (#3470 )	2024-03-18 11:26:24 -07:00
Simon Mo	8e67598aa6	[Misc] fix line length for entire codebase (#3444 )	2024-03-16 00:36:29 -07:00
akhoroshev	78b6c4845a	Dynamically configure shared memory size for moe_align_block_size_kernel (#3376 )	2024-03-14 18:18:07 -07:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
Or Sharir	ae0ccb4017	Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350 )	2024-03-13 12:18:25 -07:00
Woosuk Kwon	602358f8a8	Add kernel for GeGLU with approximate GELU (#3337 )	2024-03-12 22:06:17 -07:00
kliuae	c9415c19d3	[ROCm] Fix warp and lane calculation in blockReduceSum (#3321 )	2024-03-11 13:14:07 -07:00
Douglas Lehr	e4a28e5316	[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262 )	2024-03-10 15:27:45 -07:00
Terry	0bba88df03	Enhance lora tests with more layer and rank variations (#3243 )	2024-03-09 17:14:16 -08:00
whyiug	c59e120c55	Feature add lora support for Qwen2 (#3177 )	2024-03-07 21:58:24 -08:00
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
CHU Tianxiang	01a5d18a53	Add Support for 2/3/8-bit GPTQ Quantization Models (#2330 )	2024-02-28 21:52:23 -08:00
Woosuk Kwon	929b4f2973	Add LoRA support for Gemma (#3050 )	2024-02-28 13:03:28 -08:00
Woosuk Kwon	d6e4a130b0	[Minor] Remove gather_cached_kv kernel (#3043 )	2024-02-26 15:00:54 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Rex	563836496a	Refactor 2 awq gemm kernels into m16nXk32 (#2723 ) Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>	2024-02-12 11:02:17 -08:00
Woosuk Kwon	f0d4e14557	Add fused top-K softmax kernel for MoE (#2769 )	2024-02-05 17:38:02 -08:00
zhaoyang-star	923797fea4	Fix compile error when using rocm (#2648 )	2024-02-01 09:35:09 -08:00
Philipp Moritz	ab40644669	Fused MOE for Mixtral (#2542 ) Co-authored-by: chen shen <scv119@gmail.com>	2024-01-29 22:43:37 -08:00
wangding zeng	5d60def02c	DeepseekMoE support with Fused MoE kernel (#2453 ) Co-authored-by: roy <jasonailu87@gmail.com>	2024-01-29 21:19:48 -08:00
Hanzhi Zhou	1b20639a43	No repeated IPC open (#2642 )	2024-01-29 10:46:29 -08:00
zhaoyang-star	9090bf02e7	Support FP8-E5M2 KV Cache (#2279 ) Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-28 16:43:54 -08:00
Woosuk Kwon	f8ecb84c02	Speed up Punica compilation (#2632 )	2024-01-27 17:46:56 -08:00
Hanzhi Zhou	380170038e	Implement custom all reduce kernels (#2192 )	2024-01-27 12:46:35 -08:00
Casper	beb89f68b4	AWQ: Up to 2.66x higher throughput (#2566 )	2024-01-26 23:53:17 -08:00
Hongxia Yang	6b7de1a030	[ROCm] add support to ROCm 6.0 and MI300 (#2274 )	2024-01-26 12:41:10 -08:00
Vladimir	5265631d15	use a correct device when creating OptionalCUDAGuard (#2583 )	2024-01-25 23:48:17 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00
Woosuk Kwon	6ef00b03a2	Enable CUDA graph for GPTQ & SqueezeLLM (#2318 )	2024-01-03 09:52:29 -08:00
Jee Li	77af974b40	[FIX] Support non-zero CUDA devices in custom kernels (#1959 )	2024-01-02 19:09:59 -08:00
kliuae	1b7c791d60	[ROCm] Fixes for GPTQ on ROCm (#2180 )	2023-12-18 10:41:04 -08:00
Woosuk Kwon	76a7983b23	[BugFix] Fix RoPE kernel on long sequences(#2164 )	2023-12-17 17:09:10 -08:00
CHU Tianxiang	0fbfc4b81b	Add GPTQ support (#916 )	2023-12-15 03:04:22 -08:00
Mingcan Xiang	614856da25	Avoid multiple redefinition (#1817 )	2023-12-14 09:35:58 -08:00
wbn	dacaf5a400	Replace head_mapping params with num_kv_heads to attention kernel. (#1997 ) Co-authored-by: wangguoya <wangguoya@baidu.com> Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>	2023-12-10 10:12:53 -08:00
TJian	6ccc0bfffb	Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>	2023-12-07 23:16:52 -08:00
Yanming W	e0c6f556e8	[Build] Avoid building too many extensions (#1624 )	2023-11-23 16:31:19 -08:00
ljss	e1054247ba	[Optimization] Implement fused add rmsnorm (#1667 )	2023-11-18 18:18:02 -08:00
Antoni Baum	9f669a9a7c	Support YaRN models (#1264 ) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-11-03 14:12:48 -07:00
Woosuk Kwon	0ce8647dc5	Fix integer overflows in attention & cache ops (#1514 )	2023-10-31 15:19:30 -07:00
chooper1	1f24755bf8	Support SqueezeLLM (#1326 ) Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-10-21 23:14:59 -07:00
Woosuk Kwon	c1376e0f82	Change scheduler & input tensor shape (#1381 )	2023-10-16 17:48:42 -07:00
Woosuk Kwon	928de46888	Implement PagedAttention V2 (#1348 )	2023-10-16 00:59:57 -07:00
Woosuk Kwon	29678cd213	Minor fix on AWQ kernel launch (#1356 )	2023-10-15 21:53:56 -07:00
CHU Tianxiang	980dd4a2c4	Fix overflow in awq kernel (#1295 ) Co-authored-by: 楚天翔 <tianxiang.ctx@alibaba-inc.com>	2023-10-11 00:19:53 -07:00
twaka	8285736840	workaround of AWQ for Turing GPUs (#1252 )	2023-10-10 19:48:16 -07:00
Liang	ebe4d1db3a	Fix boundary check in paged attention kernel (#1241 )	2023-10-01 11:35:06 -07:00
Antoni Baum	cf5cb1e33e	Allocate more shared memory to attention kernel (#1154 )	2023-09-26 22:27:13 -07:00
Woosuk Kwon	2b1c116b5a	Add minimum capability requirement for AWQ (#1064 )	2023-09-18 12:02:01 -07:00
Woosuk Kwon	e3e79e9e8a	Implement AWQ quantization support for LLaMA (#1032 ) Co-authored-by: Robert Irvine <robert@seamlessml.com> Co-authored-by: root <rirv938@gmail.com> Co-authored-by: Casper <casperbh.96@gmail.com> Co-authored-by: julian-q <julianhquevedo@gmail.com>	2023-09-16 00:03:37 -07:00
Zhuohan Li	db09d4ad83	[FIX] Fix Alibi implementation in PagedAttention kernel (#945 ) * [FIX] Fix Alibi implementation in PagedAttention kernel * Fix test_attention * Fix --------- Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Oliver-ss <yuansongwx@outlook.com>	2023-09-07 15:53:14 -07:00
Woosuk Kwon	320a622ec4	[BugFix] Implement RoPE for GPT-J (#941 )	2023-09-06 11:54:33 +09:00
Woosuk Kwon	bf87484efa	[BugFix] Fix NaN errors in paged attention kernel (#936 )	2023-09-04 09:20:06 +09:00
Woosuk Kwon	8ce9c50d40	Avoid compiling kernels for double data type (#933 )	2023-09-02 14:59:47 +09:00
Woosuk Kwon	d64bf1646c	Implement approximate GELU kernels (#828 )	2023-08-23 07:43:21 +09:00
Dean Leitersdorf	79af7e96a0	[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel (#420 )	2023-08-04 10:57:29 -07:00
Zhuohan Li	1b0bd0fe8a	Add Falcon support (new) (#592 )	2023-08-02 14:04:39 -07:00
Zhuohan Li	6fc2a38b11	Add support for LLaMA-2 (#505 )	2023-07-20 11:38:27 -07:00
Zhuohan Li	96853af5a8	Optimize MQA Kernel (#452 )	2023-07-14 20:06:40 -04:00
Andre Slavescu	c894836108	[Model] Add support for GPT-J (#226 ) Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>	2023-07-08 17:55:16 -07:00
Woosuk Kwon	404422f42e	[Model] Add support for MPT (#334 )	2023-07-03 16:47:53 -07:00
Woosuk Kwon	e41f06702c	Add support for BLOOM (#331 )	2023-07-03 13:12:35 -07:00
Woosuk Kwon	0b98ba15c7	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00
Woosuk Kwon	e38074b1e6	Support FP32 (#141 )	2023-06-07 00:40:21 -07:00
Woosuk Kwon	d721168449	Improve setup script & Add a guard for bfloat16 kernels (#130 )	2023-05-27 00:59:32 -07:00
Woosuk Kwon	667ba3995c	Add copyright headers to source files adapted from FT (#104 )	2023-05-14 22:19:19 -07:00
Woosuk Kwon	130d5fd8c7	Fix a bug in attention kernel (#68 )	2023-05-04 02:56:09 -07:00
Woosuk Kwon	e070829ae8	Support bfloat16 data type (#54 )	2023-05-03 14:09:44 -07:00
Woosuk Kwon	436e523bf1	Refactor attention kernels (#53 )	2023-05-03 13:40:13 -07:00
Woosuk Kwon	a96d63c21d	Add support for GPT-NeoX (Pythia) (#50 )	2023-04-28 00:32:10 -07:00
Woosuk Kwon	0f4b32199e	Support various block sizes & Change default block size to 16 (#38 )	2023-04-15 09:03:24 -07:00
Siyuan (Ryans) Zhuang	e3cec88aa5	Memcpy kernel for flash attention (#29 ) * optimize * add benchmark * add assert * add test	2023-04-10 18:22:49 -07:00
Woosuk Kwon	b9926f7f66	Support block size 32 (#35 )	2023-04-09 23:07:18 -07:00
Woosuk Kwon	c267b1a02c	Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27 ) * Add query stride to multi_query_cached_kv_attention * Add kernel benchmark script	2023-04-08 13:36:09 -07:00
Woosuk Kwon	0f40557af6	Implement block copy kernel to optimize beam search (#32 )	2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang	21b3671bbc	Basic attention kernel that supports cached KV + (multi-)prompts (#24 )	2023-04-04 20:34:46 -07:00
Woosuk Kwon	897cb2ae28	Optimize data movement (#20 )	2023-04-02 00:30:17 -07:00
Woosuk Kwon	09e9245478	Add custom kernel for RMS normalization (#16 )	2023-04-01 00:51:22 +08:00
Woosuk Kwon	88c0268a18	Implement custom kernel for LLaMA rotary embedding (#14 )	2023-03-30 11:04:21 -07:00
Woosuk Kwon	cfae35b861	Add miscellaneous updates (#8 )	2023-03-13 13:48:38 -07:00
Woosuk Kwon	1a7eb7da61	Support beam search & parallel generation (#7 )	2023-03-10 09:58:21 -08:00
Woosuk Kwon	0deacbce6e	Implement `single_query_cached_kv_attention` kernel (#3 )	2023-03-01 15:02:19 -08:00
Woosuk Kwon	c413c41cda	Add reshape_and_cache op	2023-02-18 19:22:57 +00:00
Woosuk Kwon	ffad4e1e03	cache_kernel -> cache_kernels	2023-02-16 20:05:45 +00:00
Woosuk Kwon	6d2f74efb3	Remove redundant fn	2023-02-16 09:24:42 +00:00
Woosuk Kwon	6f058c7ba8	Implement cache ops	2023-02-16 07:47:03 +00:00

... 4 5 6 7 8 ...

414 Commits