Commit Graph

165 Commits

Author SHA1 Message Date
Roger Wang b6d7392579
[Misc][CI/Build] Include `cv2` via `mistral_common[opencv]` (#8951) 2024-09-30 04:28:26 +00:00
Tyler Titsworth 260024a374
[Bugfix][Intel] Fix XPU Dockerfile Build (#7824)
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-09-27 23:45:50 -07:00
Daniele 2467b642dd
[CI/Build] fix setuptools-scm usage (#8771) 2024-09-24 12:38:12 -07:00
Daniele ee5f34b1c2
[CI/Build] use setuptools-scm to set __version__ (#4738)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-09-23 09:44:26 -07:00
youkaichao 0e40ac9b7b
[ci][build] fix vllm-flash-attn (#8699) 2024-09-21 23:24:58 -07:00
Luka Govedič 71c60491f2
[Kernel] Build flash-attn from source (#8245) 2024-09-20 23:27:10 -07:00
Simon Mo 5ce45eb54d
[misc] small qol fixes for release process (#8517) 2024-09-16 15:11:27 -07:00
Charlie Fu 1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) 2024-09-13 17:01:11 -07:00
Yangshen⚡Deng 6a512a00df
[model] Support for Llava-Next-Video model (#7559)
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-10 22:21:36 -07:00
Daniele 6234385f4a
[CI/Build] enable ccache/scccache for HIP builds (#8327) 2024-09-10 08:55:08 -07:00
tomeras91 c02638efb3
[CI/Build] make pip install vllm work in macos (for import only) (#8118) 2024-09-03 12:37:08 -07:00
Roger Wang 5b86b19954
[Misc] Optional installation of audio related packages (#8063) 2024-09-01 14:46:57 -07:00
sasha0552 1b32e02648
[Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730) 2024-08-21 11:17:48 -07:00
Kunshang Ji 1a36287b89
[Bugfix] Fix xpu build (#7644) 2024-08-18 22:00:09 -07:00
tomeras91 386087970a
[CI/Build] build on empty device for better dev experience (#4773) 2024-08-11 13:09:44 -07:00
Ilya Lavrenov 80cbe10c59
[OpenVINO] migrate to latest dependencies versions (#7251) 2024-08-07 09:49:10 -07:00
Lucas Wilkinson a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
Michael Goin b482b9a5b1
[CI/Build] Add support for Python 3.12 (#7035) 2024-08-02 13:51:22 -07:00
Jee Jee Li 7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 2024-07-31 17:12:24 -07:00
Ethan Xu dbfe254eda
[Feature] vLLM CLI (#5090)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-07-14 15:36:43 -07:00
youkaichao ccd3c04571
[ci][build] fix commit id (#6420)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-07-14 22:16:21 +08:00
Michael Goin 111fc6e7ec
[Misc] Add generated git commit hash as `vllm.__commit__` (#6386) 2024-07-12 22:52:15 +00:00
Ilya Lavrenov 57f09a419c
[Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 13:50:16 +00:00
Kunshang Ji 728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
Cyrus Leung 03dccc886e
[Misc] Add vLLM version getter to utils (#5098) 2024-06-13 11:21:39 -07:00
Kevin H. Luu 916d219d62
[ci] Use sccache to build images (#5419)
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-12 17:58:12 -07:00
Woosuk Kwon 1a8bfd92d5
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
Woosuk Kwon 8bab4959be
[Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389) 2024-06-11 00:37:56 -07:00
bnellnm 5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) 2024-06-09 16:23:30 -04:00
Divakar Verma a66cf40b20
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927)
This PR enables the fused topk_softmax kernel used in moe layer for HIP
2024-06-02 14:13:26 -07:00
Daniele a360ff80bb
[CI/Build] CMakeLists: build all extensions' cmake targets at the same time (#5034) 2024-05-31 22:06:45 -06:00
youkaichao 5bd3c65072
[Core][Optimization] remove vllm-nccl (#5091) 2024-05-29 05:13:52 +00:00
Sanger Steel 8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208) 2024-05-13 14:57:07 -07:00
kliuae ff5abcd746
[ROCm] Add support for Punica kernels on AMD GPUs (#3140)
Co-authored-by: miloice <jeffaw99@hotmail.com>
2024-05-09 09:19:50 -07:00
Woosuk Kwon 89579a201f
[Misc] Use vllm-flash-attn instead of flash-attn (#4686) 2024-05-08 13:15:34 -07:00
youkaichao 344bf7cd2d
[Misc] add installation time env vars (#4574) 2024-05-03 15:55:56 -07:00
Hu Dong 5ad60b0cbd
[Misc] Exclude the `tests` directory from being packaged (#4552) 2024-05-02 10:50:25 -07:00
Travis Johnson 8b798eec75
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-05-01 18:01:50 +00:00
Alpay Ariyak 715c2d854d
[Frontend] [Core] Tensorizer: support dynamic `num_readers`, update version (#4467) 2024-04-30 16:32:13 -07:00
SangBin Cho a88081bf76
[CI] Disable non-lazy string operation on logging (#4326)
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
2024-04-26 00:16:58 -07:00
Liangfu Chen cd2f63fb36
[CI/CD] add neuron docker and ci test scripts (#3571) 2024-04-18 15:26:01 -07:00
Nick Hill 563c54f760
[BugFix] Fix tensorizer extra in setup.py (#4072) 2024-04-14 14:12:42 -07:00
Sanger Steel 711a000255
[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476) 2024-04-13 17:13:01 -07:00
Michael Feil c2b4a1bce9
[Doc] Add typing hints / mypy types cleanup (#3816)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-11 17:17:21 -07:00
Woosuk Kwon cfaf49a167
[Misc] Define common requirements (#3841) 2024-04-05 00:39:17 -07:00
youkaichao ca81ff5196
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805) 2024-04-04 10:26:19 -07:00
bigPYJ1151 0e3f06fe9c
[Hardware][Intel] Add CPU inference backend (#3634)
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
2024-04-01 22:07:30 -07:00
youkaichao 3492859b68
[CI/Build] update default number of jobs and nvcc threads to avoid overloading the system (#3675) 2024-03-28 00:18:54 -04:00
youkaichao 8f44facddd
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00
SangBin Cho 01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
youkaichao 42bc386129
[CI/Build] respect the common environment variable MAX_JOBS (#3600) 2024-03-24 17:04:00 -07:00
Zhuohan Li 523e30ea0c
[BugFix] Hot fix in setup.py for neuron build (#3537) 2024-03-20 17:59:52 -07:00
bnellnm ba8ae1d84f
Check for _is_cuda() in compute_num_jobs (#3481) 2024-03-20 10:06:56 -07:00
bnellnm 9fdf3de346
Cmake based build system (#2830) 2024-03-18 15:38:33 -07:00
Woosuk Kwon abfc4f3387
[Misc] Use dataclass for InputMetadata (#3452)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-03-17 10:02:46 +00:00
Simon Mo 6b78837b29
Fix setup.py neuron-ls issue (#2671) 2024-03-16 16:00:25 -07:00
Simon Mo 8e67598aa6
[Misc] fix line length for entire codebase (#3444) 2024-03-16 00:36:29 -07:00
youkaichao 604f235937
[Misc] add error message in non linux platform (#3438) 2024-03-15 21:21:37 +00:00
陈序 739c350c19
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256) 2024-03-13 09:43:24 -07:00
Zhuohan Li 2f8844ba08
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Woosuk Kwon 1cb0cc2975
[FIX] Make `flash_attn` optional (#3269) 2024-03-08 10:52:20 -08:00
Woosuk Kwon 2daf23ab0c
Separate attention backends (#3005) 2024-03-07 01:45:50 -08:00
Robert Shaw c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
Billy Cao 2c08ff23c0
Fix building from source on WSL (#3112) 2024-02-29 11:13:58 -08:00
Philipp Moritz cfc15a1031
Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
James Whedbee 264017a2bf
[ROCm] include gfx908 as supported (#2792) 2024-02-19 17:58:59 -08:00
Hongxia Yang 0580aab02f
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768) 2024-02-10 23:14:37 -08:00
Philipp Moritz 931746bc6d
Add documentation on how to do incremental builds (#2796) 2024-02-07 14:42:02 -08:00
Woosuk Kwon f0d4e14557
Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
Douglas Lehr 2ccee3def6
[ROCm] Fixup arch checks for ROCM (#2627) 2024-02-05 14:59:09 -08:00
wangding zeng 5d60def02c
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
Rasmus Larsen ea8489fce2
ROCm: Allow setting compilation target (#2581) 2024-01-29 10:52:31 -08:00
zhaoyang-star 9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou 380170038e
Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Philipp Moritz 390b495ff3
Don't build punica kernels by default (#2605) 2024-01-26 15:19:19 -08:00
Hongxia Yang 6b7de1a030
[ROCm] add support to ROCm 6.0 and MI300 (#2274) 2024-01-26 12:41:10 -08:00
Antoni Baum 9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Liangfu Chen 18473cf498
[Neuron] Add an option to build with neuron (#2065) 2024-01-18 10:58:50 -08:00
Simon Mo 6e01e8c1c8
[CI] Add Buildkite (#2355) 2024-01-14 12:37:58 -08:00
kliuae 1b7c791d60
[ROCm] Fixes for GPTQ on ROCm (#2180) 2023-12-18 10:41:04 -08:00
Woosuk Kwon 2acd76f346
[ROCm] Temporarily remove GPTQ ROCm support (#2138) 2023-12-15 17:13:58 -08:00
CHU Tianxiang 0fbfc4b81b
Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
TJian 6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Daya Khudia c8e7eb1eb3
fix typo in getenv call (#1972) 2023-12-07 16:04:41 -08:00
AguirreNicolas 24f60a54f4
[Docker] Adding number of nvcc_threads during build as envar (#1893) 2023-12-07 11:00:32 -08:00
Yanming W e0c6f556e8
[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
Simon Mo 5ffc0d13a2
Migrate linter from `pylint` to `ruff` (#1665) 2023-11-20 11:58:01 -08:00
Woosuk Kwon fd58b73a40
Build CUDA11.8 wheels for release (#1596) 2023-11-09 03:52:29 -08:00
Stephen Krider 9cabcb7645
Add Dockerfile (#1350) 2023-10-31 12:36:47 -07:00
Jared Roesch 79a30912b8
Add py.typed so consumers of vLLM can get type checking (#1509)
* Add py.typed so consumers of vLLM can get type checking

* Update py.typed

---------
Co-authored-by: aarnphm <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-30 14:50:47 -07:00
chooper1 1f24755bf8
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon d0740dff1b
Fix error message on `TORCH_CUDA_ARCH_LIST` (#1239)
Co-authored-by: Yunfeng Bai <yunfeng.bai@scale.com>
2023-10-14 14:47:43 -07:00
Antoni Baum cf5cb1e33e
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00
Woosuk Kwon a425bd9a9a
[Setup] Enable `TORCH_CUDA_ARCH_LIST` for selecting target GPUs (#1074) 2023-09-26 10:21:08 -07:00
Woosuk Kwon e3e79e9e8a
Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Woosuk Kwon d6770d1f23
Update setup.py (#1006) 2023-09-10 23:42:45 -07:00
Woosuk Kwon a41c20435e
Add compute capability 8.9 to default targets (#829) 2023-08-23 07:28:38 +09:00
Xudong Zhang 65fc1c3127
set default coompute capability according to cuda version (#773) 2023-08-21 16:05:44 -07:00
Cody Yu 2b7d3aca2e
Update setup.py (#282)
Co-authored-by: neubig <neubig@gmail.com>
2023-06-27 14:34:23 -07:00
Woosuk Kwon 570fb2e9cc
[PyPI] Fix package info in setup.py (#158) 2023-06-19 18:05:01 -07:00