vLLM/vllm - vllm - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Michael Goin	a30482f054	[CI] Expand test_guided_generate to test all backends (#11313 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-19 04:00:38 +00:00
Travis Johnson	17ca964273	[Model] IBM Granite 3.1 (#11307 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-12-19 11:27:24 +08:00
Tyler Michael Smith	5a9da2e6e9	[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-12-19 02:43:30 +00:00
Joe Runde	ca5f54a9b9	[Bugfix] fix minicpmv test (#11304 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-12-18 10:34:26 -08:00
Isotr0py	996aa70f00	[Bugfix] Fix broken phi3-v mm_processor_kwargs tests (#11263 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-12-18 10:16:40 -08:00
Dipika Sikka	60508ffda9	[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995 ) Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com> Co-authored-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2024-12-18 09:57:16 -05:00
Wallas Henrique	8b79f9e107	[Bugfix] Fix guided decoding with tokenizer mode mistral (#11046 )	2024-12-17 22:34:08 -08:00
Cody Yu	bf8717ebae	[V1] Prefix caching for vision language models (#11187 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2024-12-17 16:37:59 -08:00
Michael Goin	c77eb8a33c	[Bugfix] Set temperature=0.7 in test_guided_choice_chat (#11264 )	2024-12-17 16:34:06 -08:00
Joe Runde	2d1b9baa8f	[Bugfix] Fix request cancellation without polling (#11190 )	2024-12-17 12:26:32 -08:00
kYLe	66d4b16724	[Frontend] Add OpenAI API support for input_audio (#11027 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-16 22:09:58 -08:00
Michael Goin	0064f697d3	[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse (#10935 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-17 11:39:58 +08:00
youkaichao	551603feff	[core] overhaul memory profiling and fix backward compatibility (#10511 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-16 13:32:25 -08:00
Isotr0py	d927dbcd88	[Model] Refactor Ultravox to use merged input processor (#11198 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-12-16 10:09:53 +00:00
Jani Monoses	bddbbcb132	[Model] Support Cohere2ForCausalLM (Cohere R7B) (#11203 )	2024-12-16 09:56:19 +00:00
Cyrus Leung	b10609e6a1	[Misc] Clean up multi-modal processor (#11207 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-15 06:30:28 +00:00
Cyrus Leung	93abf23a64	[VLM] Fully dynamic prompt replacement in merged input processor (#11199 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-14 17:52:18 +00:00
Brad Hilton	9c3dadd1c9	[Frontend] Add `logits_processors` as an extra completion argument (#11150 ) Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com>	2024-12-14 16:46:42 +00:00
Cyrus Leung	0920ab9131	[Doc] Reorganize online pooling APIs (#11172 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-14 00:22:22 +08:00
Sungjae Lee	c31d4a57a6	[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching (#8240 )	2024-12-13 07:51:25 -08:00
Cyrus Leung	eeec9e3390	[Frontend] Separate pooling APIs in offline inference (#11129 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-13 10:40:07 +00:00
youkaichao	be39e3cd18	[core] clean up cudagraph batchsize padding logic (#10996 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-13 06:57:50 +00:00
Pooya Davoodi	1efce68605	[Bugfix] Use runner_type instead of task in GritLM (#11144 ) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>	2024-12-13 04:09:53 +00:00
Luka Govedič	30870b4f66	[torch.compile] Dynamic fp8 + rms_norm fusion (#10906 ) Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-12-13 03:19:23 +00:00
Cody Yu	78ed8f57d8	[Misc][V1] Fix type in v1 prefix caching (#11151 )	2024-12-13 00:57:40 +00:00
Jiaxin Shan	85362f028c	[Misc][LoRA] Ensure Lora Adapter requests return adapter name (#11094 ) Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-12 09:25:16 +00:00
youkaichao	62de37a38e	[core][distributed] initialization from StatelessProcessGroup (#10986 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-12 09:04:19 +00:00
Pooya Davoodi	1da8f0e1dd	[Model] Add support for embedding model GritLM (#10816 ) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>	2024-12-12 06:39:16 +00:00
Alexander Matveev	4e11683368	[V1] VLM preprocessor hashing (#11020 ) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alexander Matveev <alexm@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-12-12 00:55:30 +00:00
Cyrus Leung	d1e21a979b	[CI/Build] Split up VLM tests (#11083 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-12 06:18:16 +08:00
Cyrus Leung	8f10d5e393	[Misc] Split up pooling tasks (#10820 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-11 01:28:00 -08:00
Cyrus Leung	2e33fe4191	[CI/Build] Check transformers v4.47 (#10991 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-11 05:02:02 +00:00
Mor Zusman	ffa48c9146	[Model] PP support for Mamba-like models (#10992 ) Signed-off-by: mzusman <mor.zusmann@gmail.com>	2024-12-10 21:53:37 -05:00
Aurick Qiao	d5c5154fcf	[Misc] LoRA + Chunked Prefill (#9057 )	2024-12-11 10:09:20 +08:00
Jee Jee Li	d05f88679b	[Misc][LoRA] Add PEFTHelper for LoRA (#11003 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-10 11:12:01 +00:00
youkaichao	ebf778061d	monitor metrics of tokens per step using cudagraph batchsizes (#11031 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-09 22:35:36 -08:00
Tyler Michael Smith	28b3a1c7e5	[V1] Multiprocessing Tensor Parallel Support for v1 (#9856 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-12-10 06:28:14 +00:00
Isotr0py	a811dd6608	[Model] merged input processor for Phi-3-Vision models (#10977 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-12-09 12:55:10 -08:00
Jee Jee Li	ca871491ed	[Misc][LoRA] Abstract PunicaWrapper (#10955 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-09 12:54:44 -08:00
youkaichao	fd57d2b534	[torch.compile] allow candidate compile sizes (#10984 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-08 11:05:21 +00:00
zhou fan	78029b34ed	[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928 ) Signed-off-by: xffxff <1247714429@qq.com>	2024-12-08 01:21:18 +08:00
Cyrus Leung	c889d5888b	[Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-07 17:20:49 +00:00
Cyrus Leung	39e227c7ae	[Model] Update multi-modal processor to support Mantis(LLaVA) model (#10711 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-07 17:10:05 +00:00
Cyrus Leung	955fa9533a	[3/N] Support and implement merged input processor for LLaVA model (#10676 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-12-07 00:50:58 -08:00
Cyrus Leung	222f5b082a	[CI/Build] Fix broken multimodal test (#10950 )	2024-12-06 10:41:23 +00:00
youkaichao	9743d64e4e	[ci][build] add tests for python only compilation (#10915 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-05 08:54:47 -08:00
Isotr0py	998eeafe58	[CI/Build] Bump test transformers version (#10106 ) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-05 16:05:52 +00:00
Jee Jee Li	571da8fc43	[Misc][LoRA] Clean up the function interface of Punica (#10917 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-05 13:22:28 +00:00
Michael Goin	8d370e91cb	[Bugfix] Fallback to outlines for complex json schemas (#10899 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-05 11:14:06 +08:00
Xin Yang	01d079fd8e	[LoRA] Change lora_tokenizers capacity (#10796 ) Signed-off-by: Xin Yang <xyang19@gmail.com>	2024-12-04 17:40:16 +00:00
Tyler Michael Smith	d2bd88b122	[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-12-04 03:23:21 +00:00
Alexander Matveev	3bc94cab69	[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#10640 ) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-12-03 10:33:10 +00:00
Aaron Pham	9323a3153b	[Core][Performance] Add XGrammar support for guided decoding and set it as default (#10785 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-12-03 15:17:00 +08:00
youkaichao	dc5ce861bf	[torch.compile] remove compilation_context and simplify code (#10838 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-03 06:19:02 +00:00
youkaichao	21fe7b481a	[core][distributed] add pynccl broadcast (#10843 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-03 04:53:23 +00:00
Jee Jee Li	b45f0d7946	[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-02 17:53:36 +00:00
zhou fan	ef31eabc68	[Model]: add some tests for aria model (#10770 ) Signed-off-by: xffxff <1247714429@qq.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2024-12-02 05:36:36 +00:00
Woosuk Kwon	073a4bd1c0	[Kernel] Use `out` arg in flash_attn_varlen_func (#10811 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-12-01 17:55:39 -08:00
Kuntai Du	0590ec3fd9	[Core] Implement disagg prefill by StatelessProcessGroup (#10502 ) This PR provides initial support for single-node disaggregated prefill in 1P1D scenario. Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Co-authored-by: ApostaC <yihua98@uchicago.edu> Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn>	2024-12-01 19:01:00 -06:00
Cyrus Leung	d2f058e76c	[Misc] Rename embedding classes to pooling (#10801 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-01 14:36:51 +08:00
Cyrus Leung	133707123e	[Model] Replace embedding models with pooling adapter (#10769 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-01 08:02:54 +08:00
Cyrus Leung	fa6ecb9aa7	[Model] Clean up MiniCPMV (#10751 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-29 04:47:06 +00:00
sixgod	5fc5ce0fe4	[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-11-28 14:53:31 +00:00
Woosuk Kwon	a79b122400	[V1] Do not allocate beyond the max_model_len (#10730 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-11-28 00:13:15 -08:00
Ricky Xu	d9b4b3f069	[Bug][CLI] Allow users to disable prefix caching explicitly (#10724 ) Signed-off-by: rickyx <rickyx@anyscale.com>	2024-11-27 23:59:28 -08:00
tomeras91	395b1c7454	[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server (#10635 ) Signed-off-by: Tomer Asida <tomera@ai21.com>	2024-11-27 13:21:10 -08:00
Mor Zusman	197b4484a3	[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705 ) Signed-off-by: mzusman <mor.zusmann@gmail.com>	2024-11-27 19:02:27 +00:00
youkaichao	308cc5e21e	[ci] fix slow tests (#10698 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-27 09:26:14 -08:00
shunxing12345	1209261e93	[Model] Support telechat2 (#10311 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn> Co-authored-by: Isotr0py <2037008807@qq.com>	2024-11-27 11:32:35 +00:00
jeongin601	1bf905ddaa	[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. (#10198 ) Signed-off-by: jeongin601 <0200angela@gmail.com> Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com>	2024-11-27 05:07:30 +00:00
Chendi.Xue	0a71900bc9	Remove hard-dependencies of Speculative decode to CUDA workers (#10587 ) Signed-off-by: Chendi Xue <chendi.xue@intel.com>	2024-11-26 17:57:11 -08:00
Murali Andoorveedu	db66e018ea	[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232 ) Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com> Signed-off-by: Sourashis Roy <sroy@roblox.com> Co-authored-by: Sourashis Roy <sroy@roblox.com>	2024-11-26 09:11:16 -08:00
youkaichao	334d64d1e8	[ci] add vllm_test_utils (#10659 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-26 00:20:04 -08:00
Sage Moore	9a88f89799	custom allreduce + torch.compile (#10121 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-11-25 22:00:16 -08:00
Ricky Xu	519e8e4182	[v1] EngineArgs for better config handling for v1 (#10382 ) Signed-off-by: rickyx <rickyx@anyscale.com>	2024-11-25 21:09:43 -08:00
Shane A	9db713a1dc	[Model] Add OLMo November 2024 model (#10503 )	2024-11-25 17:26:40 -05:00
zhou fan	b1d920531f	[Model]: Add support for Aria model (#10514 ) Signed-off-by: xffxff <1247714429@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2024-11-25 18:10:55 +00:00
Wallas Henrique	c27df94e1f	[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850 ) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-11-25 12:23:32 -05:00
Chauncey	d04b13a380	[Bug]: Authorization ignored when root_path is set (#10606 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2024-11-25 16:21:41 +00:00
Cyrus Leung	ed46f14321	[Model] Support `is_causal` HF config field for Qwen2 model (#10621 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-25 09:51:20 +00:00
youkaichao	05d1f8c9c6	[misc] move functions to config.py (#10624 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-25 09:27:30 +00:00
youkaichao	25d806e953	[misc] add torch.compile compatibility check (#10618 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-24 23:40:08 -08:00
youkaichao	571841b7fc	[torch.compile] support encoder based models (#10613 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-25 05:24:33 +00:00
Maximilien de Bayser	214efc2c3c	Support Cross encoder models (#10400 ) Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Flavia Beo <flavia.beo@ibm.com> Co-authored-by: Flavia Beo <flavia.beo@ibm.com>	2024-11-24 18:56:20 -08:00
Jee Jee Li	1700c543a5	[Bugfix] Fix LoRA weight sharding (#10450 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-11-23 17:23:17 -08:00
Cyrus Leung	c8acd80548	[2/N] handling placeholders in merged multi-modal processor (#10485 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-22 21:25:09 -08:00
Ricky Xu	4634a89d18	Prefix Cache Aware Scheduling [1/n] (#10128 ) Signed-off-by: rickyx <rickyx@anyscale.com>	2024-11-22 21:15:55 -08:00
Varun Vinayak Shenoy	7d8ffb344f	[Bugfix] Internal Server Error when tool_choice is incorrect. (#10567 ) Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com>	2024-11-22 21:13:29 -08:00
youkaichao	4aba6e3d1a	[core] gemma2 full context length support (#10584 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-22 20:13:54 -08:00
Tyler Michael Smith	978b39744b	[Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432 )	2024-11-22 22:14:03 -05:00
Travis Johnson	9195dbdbca	[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use (#10164 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-11-23 10:17:38 +08:00
Ricky Xu	97814fbf0f	[v1] Refactor KVCacheManager for more hash input than token ids (#10507 ) Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-11-22 23:27:25 +00:00
youkaichao	eebad39f26	[torch.compile] support all attention backends (#10558 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-22 14:04:42 -08:00
youkaichao	db100c5cde	[bugfix] fix full graph tests (#10581 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-22 10:02:14 -08:00
youkaichao	33e0a2540a	[9/N] torch.compile LLM usage (#10552 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-21 19:13:31 -08:00
youkaichao	7560ae5caf	[8/N] enable cli flag without a space (#10529 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-21 12:30:42 -08:00
Jee Jee Li	2385b60d83	[Kernel] Register punica ops directly (#10522 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-11-21 09:18:11 -08:00
Chauncey	da7e702c6f	[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2024-11-21 16:24:32 +00:00
Isotr0py	d5ec121f95	[Model] Expose `dynamic_image_size` as mm_processor_kwargs for InternVL2 models (#10518 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-21 14:20:08 +00:00
Luka Govedič	8b0fe06c89	[torch.compile] Inductor code caching fix (#10273 ) Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: Luka Govedic <luka.govedic@gmail.com>	2024-11-20 21:44:57 -08:00
Pavani Majety	6c1208d083	[Core] Add Sliding Window Support with Flashinfer (#10462 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2024-11-20 19:56:47 -08:00
youkaichao	388ee3de66	[torch.compile] limit inductor threads and lazy import quant (#10482 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-20 18:36:33 -08:00
Guillaume Calmettes	c68f7ede6a	[Bugfix]: allow extra fields in requests to openai compatible server (#10463 ) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>	2024-11-20 16:42:21 -05:00
youkaichao	0cd3d9717e	[7/N] torch.compile, reduce compilation time (#10460 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-20 11:20:38 -08:00
Li, Jiang	63f1fde277	[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2024-11-20 10:57:39 +00:00
Lucas Wilkinson	d200972e7f	[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-11-19 19:40:33 -08:00
ElizaWszola	b00b33d77e	[Model][Quantization] HQQ support through Marlin kernel expansion (#9766 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com>	2024-11-19 13:31:12 -08:00
youkaichao	803f37eaaa	[6/N] torch.compile rollout to users (#10437 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-19 10:09:03 -08:00
Mengqing Cao	8c1fb50705	[Platform][Refactor] Extract func `get_default_attn_backend` to `Platform` (#10358 ) Signed-off-by: Mengqing Cao <cmq0113@163.com>	2024-11-19 11:22:26 +08:00
Lucas Wilkinson	96d999fbe8	[Kernel] Initial Machete W4A8 support + Refactors (#9855 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-11-18 12:59:29 -07:00
Yan Ma	6b2d25efc7	[Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107 ) Signed-off-by: yan ma <yan.ma@intel.com>	2024-11-18 11:18:05 -07:00
lkchen	c7dec926f6	[VLM] Report multi_modal_placeholders in output (#10407 ) Signed-off-by: Linkun Chen <lkchen+anyscale@github.com>	2024-11-18 16:06:16 +08:00
youkaichao	4fd9375028	[2/N][torch.compile] make compilation cfg part of vllm cfg (#10383 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-16 18:02:14 -08:00
电脑星人	361c29e174	[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled (#10388 ) Signed-off-by: imkero <kerorek@outlook.com>	2024-11-17 02:10:00 +08:00
Cyrus Leung	32e46e000f	[Frontend] Automatic detection of chat content format from AST (#9919 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-16 13:35:40 +08:00
ElizaWszola	79ee45b428	[Misc] Bump up test_fused_moe tolerance (#10364 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com>	2024-11-15 16:31:18 +00:00
Cyrus Leung	b311efd0bd	[Misc] Fix import error in tensorizer tests and cleanup some code (#10349 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-15 09:34:17 +00:00
Cyrus Leung	2ac6d0e75b	[Misc] Consolidate pooler config overrides (#10351 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-15 06:59:00 +00:00
Cyrus Leung	b40cf6402e	[Model] Support Qwen2 embeddings and use tags to select model tests (#10184 )	2024-11-14 20:23:09 -08:00
Luka Govedič	bf2ddc6610	[bugfix] Fix static asymmetric quantization case (#10334 ) Signed-off-by: Daniël de Kok <me@danieldk.eu> Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-11-15 09:35:11 +08:00
Cyrus Leung	972112d82f	[Bugfix] Fix unable to load some models (#10312 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-14 16:55:54 -08:00
Patrick von Platen	11cd1ae6ad	[Tool parsing] Improve / correct mistral tool parsing (#10333 )	2024-11-15 00:42:49 +00:00
Maximilien de Bayser	4a18fd14ba	Support Roberta embedding models (#9387 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Flavia Beo <flavia.beo@ibm.com> Co-authored-by: Flavia Beo <flavia.beo@ibm.com>	2024-11-14 21:23:29 +00:00
youkaichao	29f3ef26a3	[ci][distributed] disable hanging tests (#10317 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-14 00:23:39 -08:00
Mike Depinet	f67ce05d0b	[Frontend] Pythonic tool parser (#9859 ) Signed-off-by: Mike Depinet <mike@fixie.ai>	2024-11-14 04:14:34 +00:00
Isotr0py	15bb8330aa	[Bugfix] Fix tensor parallel for qwen2 classification model (#10297 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-14 10:54:59 +08:00
HoangCongDuc	ac49b59d8b	[Bugfix] bitsandbytes models fail to run pipeline parallel (#10200 ) Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com>	2024-11-13 09:56:39 -07:00
Cyrus Leung	0b8bb86bf1	[1/N] Initial prototype for multi-modal processor (#10044 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-13 12:39:03 +00:00
Austin Veselka	1b886aa104	[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 (#9944 ) Signed-off-by: FurtherAI <austin.veselka@lighton.ai> Co-authored-by: FurtherAI <austin.veselka@lighton.ai>	2024-11-13 08:28:13 +00:00
电脑星人	3945c82346	[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions (#10221 ) Signed-off-by: imkero <kerorek@outlook.com>	2024-11-13 07:07:22 +00:00
youkaichao	0d4ea3fb5c	[core][distributed] use tcp store directly (#10275 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-12 17:36:08 -08:00
Woosuk Kwon	112fa0bbe5	[V1] Fix CI tests on V1 engine (#10272 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-11-12 16:17:20 -08:00
Umesh	8a06428c70	[LoRA] Adds support for bias in LoRA (#5733 ) Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com> Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com>	2024-11-12 11:08:40 -08:00
sroy745	b41fb9d3b1	[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers (#9982 ) Signed-off-by: Sourashis Roy <sroy@roblox.com>	2024-11-12 10:53:57 -08:00
zifeitong	47db6ec831	[Frontend] Add per-request number of cached token stats (#10174 )	2024-11-12 16:42:28 +00:00
Jee Jee Li	7f5edb5900	[Misc][LoRA] Replace hardcoded cuda device with configurable argument (#10223 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-11-12 11:10:15 +08:00
youkaichao	eea55cca5b	[1/N] torch.compile user interface design (#10237 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-11 18:01:06 -08:00
Robert Shaw	6ace6fba2c	[V1] `AsyncLLM` Implementation (#9826 ) Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-11-11 23:05:38 +00:00
youkaichao	8a7fe47d32	[misc][distributed] auto port selection and disable tests (#10226 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-11 11:54:59 -08:00
youkaichao	330e82d34a	[v1][torch.compile] support managing cudagraph buffer (#10203 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-11-11 11:10:27 -08:00
youkaichao	e6de9784d2	[core][distributed] add stateless process group (#10216 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-11 09:02:14 -08:00
Jee Jee Li	36e4acd02a	[LoRA][Kernel] Remove the unused libentry module (#10214 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-11-11 09:43:23 +00:00
Isotr0py	58170d6503	[Hardware][CPU] Add embedding models support for CPU backend (#10193 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-11 08:54:28 +00:00
youkaichao	73b9083e99	[misc] improve cloudpickle registration and tests (#10202 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-11 00:10:53 +00:00
Cyrus Leung	51c2e1fcef	[CI/Build] Split up models tests (#10069 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-09 11:39:14 -08:00
Krishna Mandal	b09895a618	[Frontend][Core] Override HF `config.json` via CLI (#5836 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-09 16:19:27 +00:00
bnellnm	f192aeba74	[Bugfix] Enable some fp8 and quantized fullgraph tests (#10171 ) Signed-off-by: Bill Nell <bill@neuralmagic.com>	2024-11-09 08:01:27 +00:00
Isotr0py	47672f38b5	[CI/Build] Fix VLM broadcast tests `tensor_parallel_size` passing (#10161 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-09 04:02:59 +00:00
Cyrus Leung	e0191a95d8	[0/N] Rename `MultiModalInputs` to `MultiModalKwargs` (#10040 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-09 11:31:02 +08:00
rasmith	127c07480e	[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2024-11-08 19:59:22 -05:00
Luka Govedič	4f93dfe952	[torch.compile] Fuse RMSNorm with quant (#9138 ) Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com>	2024-11-08 21:20:08 +00:00
Florian Zimmermeister	e1b5a82179	Rename vllm.logging to vllm.logging_utils (#10134 )	2024-11-08 20:53:24 +00:00
sroy745	f6778620a9	Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 (#10136 ) Signed-off-by: Sourashis Roy <sroy@roblox.com>	2024-11-08 15:56:18 +00:00
Cyrus Leung	b489fc3c91	[CI/Build] Update CPU tests to include all "standard" tests (#5481 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-08 23:30:04 +08:00
Isotr0py	1ff4aed5bd	[Model] Expose size to Idefics3 as mm_processor_kwargs (#10146 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-08 09:56:58 +00:00
Cody Yu	201fc07730	[V1] Prefix caching (take 2) (#9972 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2024-11-07 17:34:44 -08:00
litianjian	28b2877d30	Online video support for VLMs (#10020 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: litianjian <litianjian@bytedance.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-07 20:25:59 +00:00
Russell Bryant	3be5b26a76	[CI/Build] Add shell script linting using shellcheck (#7925 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2024-11-07 18:17:29 +00:00
Nicolò Lucchesi	9d43afcc53	[Feature] [Spec decode]: Combine chunked prefill with speculative decoding (#9291 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2024-11-07 08:15:14 -08:00
Maximilien de Bayser	ae62fd17c0	[Frontend] Tool calling parser for Granite 3.0 models (#9027 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2024-11-07 07:09:02 -08:00
Flávia Béo	aa9078fa03	Adds method to read the pooling types from model's files (#9506 ) Signed-off-by: Flavia Beo <flavia.beo@ibm.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com>	2024-11-07 08:42:40 +00:00
Hanzhi Zhou	6192e9b8fe	[Core][Distributed] Refactor ipc buffer init in CustomAllreduce (#10030 ) Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com>	2024-11-06 23:50:47 -08:00
Cyrus Leung	db7db4aab9	[Misc] Consolidate ModelConfig code related to HF config (#10104 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-07 06:00:21 +00:00
Li, Jiang	a4b3e0c1e9	[Hardware][CPU] Update torch 2.5 (#9911 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2024-11-07 04:43:08 +00:00
youkaichao	719c1ca468	[core][distributed] add stateless_init_process_group (#10072 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-06 16:42:09 -08:00
Joe Runde	d58268c56a	[V1] Make v1 more testable (#9888 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-11-06 11:57:35 -08:00
Jee Jee Li	a5bba7d234	[Model] Add Idefics3 support (#9767 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: B-201 <Joy25810@foxmail.com> Co-authored-by: B-201 <Joy25810@foxmail.com>	2024-11-06 11:41:17 +00:00
Isotr0py	a5fda50a10	[CI/Build] Fix large_gpu_mark reason (#10070 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-06 08:50:37 +00:00
Aaron Pham	21063c11c7	[CI/Build] drop support for Python 3.8 EOL (#8464 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz>	2024-11-06 07:11:55 +00:00
youkaichao	4be3a45158	[distributed] add function to create ipc buffers directly (#10064 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-05 22:35:03 -08:00
Travis Johnson	2bcbae704c	[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer (#10051 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-11-06 04:28:29 +00:00
Sungjae Lee	0c63c34f72	[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode (#9730 ) Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2024-11-06 01:45:45 +00:00
Wallas Henrique	966e31697b	[Bugfix] Fix pickle of input when async output processing is on (#9931 ) Signed-off-by: Wallas Santos <wallashss@ibm.com>	2024-11-06 00:39:26 +00:00
youkaichao	ca9844b340	[bugfix] fix weak ref in piecewise cudagraph and tractable test (#10048 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-05 14:49:20 -08:00
Michael Goin	235366fe2e	[CI] Prune back the number of tests in tests/kernels/* (#9932 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-11-05 16:02:32 -05:00
Michael Goin	02462465ea	[CI] Prune tests/models/decoder_only/language/* tests (#9940 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-11-05 16:02:23 -05:00
Cyrus Leung	bbc3619dc8	[Core] Make encoder-decoder inputs a nested structure to be more composable (#9604 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-05 10:07:31 +08:00
tomeras91	ac04a97a9f	[Frontend] Add max_tokens prometheus metric (#9881 ) Signed-off-by: Tomer Asida <tomera@ai21.com>	2024-11-04 22:53:24 +00:00
hissu-hyvarinen	5208dc7a20	[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs (#9279 ) Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>	2024-11-04 11:37:46 -08:00
Robert Shaw	1c45f4c385	[CI] Basic Integration Test For TPU (#9968 ) Signed-off-by: Robert Shaw <rshaw@neuralmagic.com>	2024-11-04 11:34:26 -08:00
Chauncey	ac6b8f19b9	[Frontend] Multi-Modality Support for Loading Local Image Files (#9915 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2024-11-04 15:34:57 +00:00
shanshan wang	54597724f4	[Model] Add support for H2OVL-Mississippi models (#9747 ) Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-11-04 00:15:36 +00:00
youkaichao	cea808f325	[3/N] model runner pass the whole config to model (#9958 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-02 12:08:49 -07:00
youkaichao	e893795443	[2/N] executor pass the complete config to worker/modelrunner (#9938 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2024-11-02 07:35:05 -07:00
sroy745	a78dd3303e	[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models (#9559 )	2024-11-01 23:22:49 -07:00
Peter Salas	6c0b7f548d	[Core][VLM] Add precise multi-modal placeholder tracking (#8346 ) Signed-off-by: Peter Salas <peter@fixie.ai>	2024-11-01 16:21:10 -07:00
Pavani Majety	598b6d7b07	[Bugfix/Core] Flashinfer k_scale and v_scale (#9861 )	2024-11-01 12:15:05 -07:00
Travis Johnson	1dd4cb2935	[Bugfix] Fix edge cases for MistralTokenizer (#9625 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>	2024-11-01 10:33:15 -07:00
Cyrus Leung	ba0d892074	[Frontend] Use a proper chat template for VLM2Vec (#9912 )	2024-11-01 14:09:07 +00:00
Michael Goin	30a2e80742	[CI/Build] Add Model Tests for PixtralHF (#9813 )	2024-11-01 07:55:29 -06:00
Cyrus Leung	06386a64dd	[Frontend] Chat-based Embeddings API (#9759 )	2024-11-01 08:13:35 +00:00
Yongzao	2b5bf20988	[torch.compile] Adding torch compile annotations to some models (#9876 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-11-01 00:25:47 -07:00
youkaichao	566cd27797	[torch.compile] rework test plans (#9866 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-31 22:20:17 -07:00
youkaichao	96e0c9cbbd	[torch.compile] directly register custom op (#9896 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-31 21:56:09 -07:00
Joe Runde	031a7995f3	[Bugfix][Frontend] Reject guided decoding in multistep mode (#9892 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-11-01 01:09:46 +00:00
Mor Zusman	9fb12f7848	[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (#9838 ) Signed-off-by: mzusman <mor.zusmann@gmail.com>	2024-10-31 20:06:25 +00:00
sasha0552	55650c83a0	[Bugfix] Fix `illegal memory access` error with chunked prefill, prefix caching, block manager v2 and xformers enabled together (#9532 ) Signed-off-by: sasha0552 <admin@sasha0552.org>	2024-10-31 11:46:36 -07:00
Alex Brooks	16b8f7a86f	[CI/Build] Add Model Tests for Qwen2-VL (#9846 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-10-31 09:10:52 -07:00
Guillaume Calmettes	abbfb6134d	[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint (#9837 )	2024-10-30 18:15:56 -07:00
youkaichao	64384bbcdf	[torch.compile] upgrade tests (#9858 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-30 16:34:22 -07:00
Yongzao	00d91c8a2c	[CI/Build] Simplify exception trace in api server tests (#9787 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-10-30 14:52:05 -07:00
Joe Runde	3b3f1e7436	[Bugfix][core] replace heartbeat with pid check (#9818 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-10-30 09:34:07 -07:00
Elfie Guo	9ff4511e43	[Misc] Add chunked-prefill support on FlashInfer. (#9781 )	2024-10-30 09:33:53 -07:00
Alex Brooks	cc98f1e079	[CI/Build] VLM Test Consolidation (#9372 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-10-30 09:32:17 -07:00
youkaichao	ff5ed6e1bc	[torch.compile] rework compile control with piecewise cudagraph (#9715 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-29 23:03:49 -07:00
Will Eaton	882a1ad0de	[Model] tool calling support for ibm-granite/granite-20b-functioncalling (#8339 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>	2024-10-29 15:07:37 -07:00
Joe Runde	67bdf8e523	[Bugfix][Frontend] Guard against bad token ids (#9634 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-10-29 14:13:20 -07:00
Michael Goin	ab6f981671	[CI][Bugfix] Skip chameleon for transformers 4.46.1 (#9808 )	2024-10-29 11:12:43 -07:00
wangshuai09	622b7ab955	[Hardware] using current_platform.seed_everything (#9785 ) Signed-off-by: wangshuai09 <391746016@qq.com>	2024-10-29 14:47:44 +00:00
Zhong Qishuai	ef7865b4f9	[Frontend] re-enable multi-modality input in the new beam search implementation (#9427 ) Signed-off-by: Qishuai Ferdinandzhong@gmail.com	2024-10-29 11:49:47 +00:00
litianjian	5f8d8075f9	[Model][VLM] Add multi-video support for LLaVA-Onevision (#8905 ) Co-authored-by: litianjian <litianjian@bytedance.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-10-28 18:04:10 +00:00
youkaichao	32176fee73	[torch.compile] support moe models (#9632 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-27 21:58:04 -07:00
wangshuai09	4e2d95e372	[Hardware][ROCM] using current_platform.is_rocm (#9642 ) Signed-off-by: wangshuai09 <391746016@qq.com>	2024-10-28 04:07:00 +00:00
madt2709	34a9941620	[Bugfix] Fix load config when using bools (#9533 )	2024-10-27 13:46:41 -04:00
bnellnm	3cb07a36a2	[Misc] Upgrade to pytorch 2.5 (#9588 ) Signed-off-by: Bill Nell <bill@neuralmagic.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-10-27 09:44:24 +00:00
kakao-kevin-us	6650e6a930	[Model] Add classification Task with Qwen2ForSequenceClassification (#9704 ) Signed-off-by: Kevin-Yang <ykcha9@gmail.com> Co-authored-by: Kevin-Yang <ykcha9@gmail.com>	2024-10-26 17:53:35 +00:00
Vasiliy Alekseev	07e981fdf4	[Frontend] Bad words sampling parameter (#9717 ) Signed-off-by: Vasily Alexeev <alvasian@yandex.ru>	2024-10-26 16:29:38 +00:00
Mengqing Cao	5cbdccd151	[Hardware][openvino] is_openvino --> current_platform.is_openvino (#9716 )	2024-10-26 10:59:06 +00:00
Kevin H. Luu	9f7b4ba865	[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 (#9676 )	2024-10-24 20:59:00 -07:00
Charlie Fu	59449095ab	[Performance][Kernel] Fused_moe Performance Improvement (#9384 ) Signed-off-by: charlifu <charlifu@amd.com>	2024-10-24 15:37:52 -07:00
Alex Brooks	722d46edb9	[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints (#9650 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-10-24 10:42:24 -07:00
Cyrus Leung	c866e0079d	[CI/Build] Fix VLM test failures when using transformers v4.46 (#9666 )	2024-10-25 01:40:40 +08:00
Yongzao	d27cfbf791	[torch.compile] Adding torch compile annotations to some models (#9641 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-10-24 09:31:42 -07:00
Jee Jee Li	295a061fb3	[Kernel] add kernel for FATReLU (#9610 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-10-24 16:18:27 +08:00
Yongzao	8a02cd045a	[torch.compile] Adding torch compile annotations to some models (#9639 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-10-24 00:54:57 -07:00
youkaichao	4fdc581f9e	[core] simplify seq group code (#9569 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-10-24 00:16:44 -07:00
Cyrus Leung	836e8ef6ee	[Bugfix] Fix PP for ChatGLM and Molmo (#9422 )	2024-10-24 06:12:05 +00:00
Vinay R Damodaran	33bab41060	[Bugfix]: Make chat content text allow type content (#9358 ) Signed-off-by: Vinay Damodaran <vrdn@hey.com>	2024-10-24 05:05:49 +00:00
Yunfei Chu	fc6c274626	[Model] Add Qwen2-Audio model support (#9248 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-10-23 17:54:22 +00:00
Alex Brooks	150b779081	[Frontend] Enable Online Multi-image Support for MLlama (#9393 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-10-23 17:28:57 +00:00
Alex Brooks	31a08f5bd2	[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs (#9612 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-10-23 14:05:18 +00:00
Isotr0py	3ff57ebfca	[Model] Initialize Florence-2 language backbone support (#9555 )	2024-10-23 10:42:47 +00:00
Cyrus Leung	831540cf04	[Model] Support E5-V (#9576 )	2024-10-23 11:35:29 +08:00
yulei	b17046e298	[BugFix] Fix metrics error for --num-scheduler-steps > 1 (#8234 )	2024-10-22 15:43:03 -07:00
Ronen Schaffer	cd5601ac37	[BugFix] Prevent exporting duplicate OpenTelemetry spans (#9017 )	2024-10-22 11:11:53 -07:00
Isotr0py	bb392ea2d2	[Model][VLM] Initialize support for Mono-InternVL model (#9528 )	2024-10-22 16:01:46 +00:00
Jee Jee Li	a48e3ec052	[CI/Build][LoRA] Temporarily fix long context failure issue (#9579 )	2024-10-22 11:32:51 +00:00
wangshuai09	3ddbe25502	[Hardware][CPU] using current_platform.is_cpu (#9536 )	2024-10-22 00:50:43 -07:00
Wallas Henrique	c0292211ce	[CI/Build] Replaced some models on tests for smaller ones (#9570 ) Signed-off-by: Wallas Santos <wallashss@ibm.com>	2024-10-22 04:52:14 +00:00
Cyrus Leung	f085995a7b	[CI/Build] Remove unnecessary `fork_new_process` (#9484 )	2024-10-21 19:47:29 -07:00
Travis Johnson	b729901139	[Bugfix]: serialize config by value for --trust-remote-code (#6751 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-10-21 19:46:24 -07:00
youkaichao	76a5e13270	[core] move parallel sampling out from vllm core (#9302 )	2024-10-22 00:31:44 +00:00
Joe Runde	ef7faad1b8	🐛 Fixup more test failures from memory profiling (#9563 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-10-21 17:10:56 -07:00
Wallas Henrique	711f3a7806	[Frontend] Don't log duplicate error stacktrace for every request in the batch (#9023 ) Signed-off-by: Wallas Santos <wallashss@ibm.com>	2024-10-21 14:49:41 -07:00
Dhia Eddine Rhaiem	f6b97293aa	[Model] FalconMamba Support (#9325 )	2024-10-21 12:50:16 -04:00
Cyrus Leung	696b01af8f	[CI/Build] Split up decoder-only LM tests (#9488 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-10-20 21:27:50 -07:00
Chen Zhang	4fa3e33349	[Kernel] Support sliding window in flash attention backend (#9403 )	2024-10-20 10:57:52 -07:00
Chen Zhang	5b59fe0f08	[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger (#9530 )	2024-10-20 00:05:02 +00:00
Yue Zhang	c5eea3c8ba	[Frontend] Support simpler image input format (#9478 )	2024-10-18 23:17:07 -07:00
Joe Runde	380e18639f	🐛 fix torch memory profiling (#9516 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-10-18 21:25:19 -04:00
sasha0552	337ed76671	[Bugfix] Fix offline mode when using `mistral_common` (#9457 )	2024-10-18 18:12:32 -07:00
Cody Yu	d11bf435a0	[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510 )	2024-10-18 14:30:55 -07:00
Cyrus Leung	051eaf6db3	[Model] Add user-configurable task for models that support both generation and embedding (#9424 )	2024-10-18 11:31:58 -07:00
tomeras91	d2b1bf55ec	[Frontend][Feature] Add jamba tool parser (#9154 )	2024-10-18 10:27:48 +00:00
Joe Runde	de4008e2ab	[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage (#9352 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-10-17 22:47:27 -04:00
Robert Shaw	343f8e0905	Support `BERTModel` (first `encoder-only` embedding model) (#9056 ) Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com> Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: laishzh <laishengzhang@gmail.com> Co-authored-by: Max de Bayser <maxdebayser@gmail.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-10-17 23:21:01 +00:00
bnellnm	eca2c5f7c0	[Bugfix] Fix support for dimension like integers and ScalarType (#9299 )	2024-10-17 19:08:34 +00:00
Luka Govedič	0f41fbe5a3	[torch.compile] Fine-grained CustomOp enabling mechanism (#9300 )	2024-10-17 18:36:37 +00:00
Kuntai Du	81ede99ca4	[Core] Deprecating block manager v1 and make block manager v2 default (#8704 ) Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).	2024-10-17 11:38:15 -05:00
Mor Zusman	fb60ae9b91	[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189 )	2024-10-16 12:12:43 -04:00
Cyrus Leung	cee711fdbb	[Core] Rename input data types (#8688 )	2024-10-16 10:49:37 +00:00
Cyrus Leung	7abba39ee6	[Model] VLM2Vec, the first multimodal embedding model in vLLM (#9303 )	2024-10-16 14:31:00 +08:00
Cyrus Leung	7e7eae338d	[Misc] Standardize RoPE handling for Qwen2-VL (#9250 )	2024-10-16 13:56:17 +08:00
Chang Su	ba30942240	[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-10-15 15:40:43 -07:00
Michael Goin	22f8a69549	[Misc] Directly use compressed-tensors for checkpoint definitions (#8909 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-10-15 15:40:25 -07:00
Nick Hill	e9d517f276	[BugFix] Fix chat API continuous usage stats (#9357 )	2024-10-14 23:19:48 -07:00
Xiang Xu	f0fe4fe86d	[Model] Make llama3.2 support multiple and interleaved images (#9095 )	2024-10-14 15:24:26 -07:00
Lily Liu	89feb4c84d	[SpecDec] Remove Batch Expansion (2/3) (#9298 )	2024-10-12 05:13:37 +00:00
sixgod	6cf1167c1a	[Model] Add GLM-4v support and meet vllm==0.6.2 (#9242 )	2024-10-11 17:36:13 +00:00
Tyler Michael Smith	7342a7d7f8	[Model] Support Mamba (#6484 )	2024-10-11 15:40:06 +00:00
Jee Jee Li	36ea79079b	[Misc][LoRA] Support loading LoRA weights for target_modules in reg format (#9275 )	2024-10-11 12:31:21 +00:00
youkaichao	cbc2ef5529	[misc] hide best_of from engine (#9261 ) Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>	2024-10-10 21:30:44 -07:00
youkaichao	e4d652ea3e	[torch.compile] integration with compilation control (#9058 )	2024-10-10 12:39:36 -07:00
sroy745	f3a507f1d3	[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149 )	2024-10-10 14:17:17 +08:00
Lucas Wilkinson	a64e7b9407	[Bugfix] Machete garbage results for some models (large K dim) (#9212 )	2024-10-10 14:16:17 +08:00
Michael Goin	ce00231a8b	[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models (#9213 )	2024-10-10 14:15:40 +08:00
Li, Jiang	ca77dd7a44	[Hardware][CPU] Support AWQ for CPU backend (#7515 )	2024-10-09 10:28:08 -06:00
youkaichao	c8627cd41b	[ci][test] use load dummy for testing (#9165 )	2024-10-09 00:38:40 -07:00
chenqianfzh	2f4117c38e	support bitsandbytes quantization with more models (#9148 )	2024-10-08 19:52:19 -06:00
bnellnm	bd37b9fbe2	[Bugfix] Try to handle older versions of pytorch (#9086 )	2024-10-08 14:28:12 -07:00
Daniele	9a94ca4a5d	[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (#8537 )	2024-10-08 09:38:40 -07:00
Alex Brooks	069d3bd8d0	[Frontend] Add Early Validation For Chat Template / Tool Call Parser (#9151 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-10-08 14:31:26 +00:00
Alex Brooks	a3691b6b5e	[Core][Frontend] Add Support for Inference Time mm_processor_kwargs (#9131 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-10-08 14:12:56 +00:00
Brendan Wong	8c746226c9	[Frontend] API support for beam search for MQLLMEngine (#9117 )	2024-10-08 05:51:43 +00:00
youkaichao	04c12f8157	[misc] update utils to support comparing multiple settings (#9140 )	2024-10-08 02:51:49 +00:00
Isotr0py	f19da64871	[Core] Refactor GGUF parameters packing and forwarding (#8859 )	2024-10-07 10:01:46 +00:00
Isotr0py	4f95ffee6f	[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (#9089 )	2024-10-07 06:50:35 +00:00
Cyrus Leung	8c6de96ea1	[Model] Explicit interface for vLLM models and support OOT embedding models (#9108 )	2024-10-07 06:10:35 +00:00
youkaichao	18b296fdb2	[core] remove beam search from the core (#9105 )	2024-10-07 05:47:04 +00:00
Varun Sundar Rabindranath	cb3b2b9ba4	[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (#9038 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-10-06 12:48:11 -07:00
Cyrus Leung	b22b798471	[Model] PP support for embedding models and update docs (#9090 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-10-06 16:35:27 +08:00
Brendan Wong	168cab6bbf	[Frontend] API support for beam search (#9087 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-10-05 23:39:03 -07:00
Andy Dai	5df1834895	[Bugfix] Fix order of arguments matters in config.yaml (#8960 )	2024-10-05 17:35:11 +00:00
Chen Zhang	cfadb9c687	[Bugfix] Deprecate registration of custom configs to huggingface (#9083 )	2024-10-05 21:56:40 +08:00
Xin Yang	15986f598c	[Model] Support Gemma2 embedding model (#9004 )	2024-10-05 06:57:05 +00:00
ElizaWszola	05d686432f	[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973 ) Co-authored-by: Dipika <dipikasikka1@gmail.com> Co-authored-by: Dipika Sikka <ds3822@columbia.edu>	2024-10-04 12:34:44 -06:00
Flávia Béo	0dcc8cbe5a	Adds truncate_prompt_tokens param for embeddings creation (#8999 ) Signed-off-by: Flavia Beo <flavia.beo@ibm.com>	2024-10-04 18:31:40 +00:00
Roger Wang	26aa325f4f	[Core][VLM] Test registration for OOT multimodal models (#8717 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-10-04 10:38:25 -07:00
Prashant Gupta	9ade8bbc8d	[Model] add a bunch of supported lora modules for mixtral (#9008 ) Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>	2024-10-04 16:24:40 +00:00
Cyrus Leung	0e36fd4909	[Misc] Move registry to its own file (#9064 )	2024-10-04 10:01:37 +00:00
Murali Andoorveedu	0f6d7a9a34	[Models] Add remaining model PP support (#7168 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-10-04 10:56:58 +08:00
代君	3dbb215b38	[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (#8405 )	2024-10-04 10:36:39 +08:00
sroy745	91add85ec4	Fix failing spec decode test (#9054 )	2024-10-03 23:07:29 +00:00
youkaichao	9aaf14c62e	[misc] add forward context for attention (#9029 )	2024-10-03 12:09:42 -07:00
xendo	63e39937f9	[Frontend] [Neuron] Parse literals out of override-neuron-config (#8959 ) Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com>	2024-10-03 18:02:07 +00:00
Guillaume Calmettes	83caf35e08	[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser (#9020 )	2024-10-03 16:44:52 +08:00
Shawn Tan	19f0d25796	[Model] Adding Granite MoE. (#8206 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-10-03 09:33:57 +08:00
afeldman-nm	563649aafe	[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (#8804 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Andrew Feldman <afeld2012@gmail.com>	2024-10-02 07:52:20 +00:00
Lily Liu	1570203864	[Spec Decode] (1/2) Remove batch expansion (#8839 )	2024-10-01 16:04:42 -07:00
Isotr0py	bc4eb65b54	[Bugfix] Fix Fuyu tensor parallel inference (#8986 )	2024-10-01 17:51:41 +08:00
Joe Runde	062c89e7c9	[Frontend][Core] Move guided decoding params into sampling params (#8252 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-10-01 09:34:25 +08:00
Lily Liu	bce324487a	[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. (#8975 )	2024-10-01 00:51:40 +00:00
Mor Zusman	f13a07b1f8	[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533 )	2024-09-29 17:35:58 -04:00
danieljannai21	6c9ba48fde	[Frontend] Added support for HF's new `continue_final_message` parameter (#8942 )	2024-09-29 17:59:47 +00:00
Jee Jee Li	3d49776bbb	[Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199 )	2024-09-29 06:59:45 +00:00
Cyrus Leung	26a68d5d7e	[CI/Build] Add test decorator for minimum GPU memory (#8925 )	2024-09-29 02:50:51 +00:00
ElizaWszola	d081da0064	[Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-28 18:19:40 -07:00
sroy745	5bf8789b2a	[Bugfix] Block manager v2 with preemption and lookahead slots (#8824 )	2024-09-29 09:17:45 +08:00
Cyrus Leung	e1a3f5e831	[CI/Build] Update models tests & examples (#8874 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-09-28 09:54:35 -07:00
Varun Sundar Rabindranath	19d02ff938	[Bugfix] Fix PP for Multi-Step (#8887 )	2024-09-28 08:52:46 -07:00
Varun Sundar Rabindranath	c2ec430ab5	[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-09-27 13:32:07 -07:00
Luka Govedič	172d1cd276	[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271 )	2024-09-27 14:25:10 -04:00
youkaichao	a9b15c606f	[torch.compile] use empty tensor instead of None for profiling (#8875 )	2024-09-27 08:11:32 -07:00
Isotr0py	6d792d2f31	[Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (#8892 )	2024-09-27 01:15:58 -07:00
Cyrus Leung	3b00b9c26c	[Core] rename`PromptInputs` and `inputs` (#8876 )	2024-09-26 20:35:15 -07:00
Maximilien de Bayser	344cd2b6f4	[Feature] Add support for Llama 3.1 and 3.2 tool use (#8343 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2024-09-26 17:01:42 -07:00
Nick Hill	4b377d6feb	[BugFix] Fix test breakages from transformers 4.45 upgrade (#8829 )	2024-09-26 16:46:43 -07:00
Chirag Jain	ee2da3e9ef	fix validation: Only set tool_choice `auto` if at least one tool is provided (#8568 )	2024-09-26 16:23:17 -07:00
Chen Zhang	770ec6024f	[Model] Add support for the multi-modal Llama 3.2 model (#8811 ) Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-09-25 13:29:32 -07:00
Simon Mo	4f1ba0844b	Revert "rename PromptInputs and inputs with backward compatibility (#8760 ) (#8810 )	2024-09-25 10:36:26 -07:00
Michael Goin	873edda6cf	[Misc] Support FP8 MoE for compressed-tensors (#8588 )	2024-09-25 09:43:36 -07:00
Cyrus Leung	28e1299e60	rename PromptInputs and inputs with backward compatibility (#8760 )	2024-09-25 09:36:47 -07:00
bnellnm	300da09177	[Kernel] Fullgraph and opcheck tests (#8479 )	2024-09-25 08:35:52 -06:00
sroy745	fc3afc20df	Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (#8752 )	2024-09-24 21:26:36 -07:00
sroy745	ee777d9c30	Fix test_schedule_swapped_simple in test_scheduler.py (#8780 )	2024-09-24 21:26:18 -07:00
Joe Runde	6e0c9d6bd0	[Bugfix] Use heartbeats instead of health checks (#8583 )	2024-09-24 20:37:38 -07:00
Travis Johnson	01b6f9e1f0	[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-09-24 17:29:56 -07:00
Jee Jee Li	13f9f7a3d0	[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768 )	2024-09-24 17:08:55 -07:00
Andy	2529d09b5a	[Frontend] Batch inference for llm.chat() API (#8648 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-09-24 09:44:11 -07:00
Alex Brooks	8ff7ced996	[Model] Expose Phi3v num_crops as a mm_processor_kwarg (#8658 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-24 07:36:46 +00:00
Simon Mo	3185fb0cca	Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (#8750 )	2024-09-24 05:45:20 +00:00
youkaichao	0250dd68c5	re-implement beam search on top of vllm core (#8726 ) Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>	2024-09-23 22:08:12 -07:00
sroy745	88577ac928	Fix tests in test_scheduler.py that fail with BlockManager V2 (#8728 )	2024-09-24 04:43:13 +00:00
Alexander Matveev	1a2aef3e59	Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (#8335 )	2024-09-23 15:38:04 -07:00
jiqing-feng	5f7bb58427	Fix typical acceptance sampler with correct recovered token ids (#8562 )	2024-09-23 12:32:27 -07:00
Jee Jee Li	9b0e3ec970	[Kernel][LoRA] Add assertion for punica sgmv kernels (#7585 )	2024-09-23 18:57:42 +00:00
Lucas Wilkinson	86e9c8df29	[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-23 13:46:26 -04:00
Daniele	ee5f34b1c2	[CI/Build] use setuptools-scm to set __version__ (#4738 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-09-23 09:44:26 -07:00
Yanyi Liu	a79e522984	[Model] Support pp for qwen2-vl (#8696 )	2024-09-23 13:46:59 +00:00
Alex Brooks	9b8c8ba119	[Core][Frontend] Support Passing Multimodal Processor Kwargs (#8657 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-09-23 07:44:48 +00:00
Lily Liu	c6bd70d772	[SpecDec][Misc] Cleanup, remove bonus token logic. (#8701 )	2024-09-22 12:34:14 -07:00
litianjian	5b59532760	[Model][VLM] Add LLaVA-Onevision model support (#8486 ) Co-authored-by: litianjian <litianjian@bytedance.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-22 10:51:44 -07:00
youkaichao	0faab90eb0	[beam search] add output for manually checking the correctness (#8684 )	2024-09-20 19:55:33 -07:00
Cyrus Leung	0455c46ed4	[Core] Factor out common code in `SequenceData` and `Sequence` (#8675 )	2024-09-21 02:30:39 +00:00
Cyrus Leung	0057894ef7	[Core] Rename `PromptInputs` and `inputs`(#8673 )	2024-09-20 19:00:54 -07:00
Patrick von Platen	b4e4eda92e	[Bugfix][Core] Fix tekken edge case for mistral tokenizer (#8640 )	2024-09-20 14:33:03 -07:00
Jiaxin Shan	260d40b5ea	[Core] Support Lora lineage and base model metadata management (#6315 )	2024-09-20 06:20:56 +00:00
Charlie Fu	9cc373f390	[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577 )	2024-09-19 17:37:57 +00:00
sroy745	3118f63385	[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545 )	2024-09-19 02:24:15 +00:00
Tyler Michael Smith	db9120cded	[Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039 )	2024-09-18 20:05:06 +00:00
afeldman-nm	a8c1d161a7	[Core] Prompt logprobs support in Multi-step (#8199 )	2024-09-18 08:38:43 -07:00
Alexander Matveev	7c7714d856	[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (#8157 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-09-18 13:56:58 +00:00
Aaron Pham	9d104b5beb	[CI/Build] Update Ruff version (#8469 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-09-18 11:00:56 +00:00
Cyrus Leung	6ffa3f314c	[CI/Build] Avoid CUDA initialization (#8534 )	2024-09-18 10:38:11 +00:00
Tyler Michael Smith	8110e44529	[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012 )	2024-09-17 23:44:27 +00:00
youkaichao	fa0c114fad	[doc] improve installation doc (#8550 ) Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>	2024-09-17 16:24:06 -07:00
Patrick von Platen	a54ed80249	[Model] Add mistral function calling format to all models loaded with "mistral" format (#8515 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-09-17 17:50:37 +00:00
chenqianfzh	9855b99502	[Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434 )	2024-09-17 08:09:12 -07:00
sroy745	1009e93c5d	[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631 )	2024-09-17 07:35:01 -07:00
youkaichao	99aa4eddaf	[torch.compile] register allreduce operations as custom ops (#8526 )	2024-09-16 22:57:57 -07:00
Alex Brooks	1c1bb388e0	[Frontend] Improve Nullable kv Arg Parsing (#8525 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-09-17 04:17:32 +00:00
Simon Mo	546034b466	[refactor] remove triton based sampler (#8524 )	2024-09-16 20:04:48 -07:00
Luka Govedič	5d73ae49d6	[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270 )	2024-09-16 11:52:40 -07:00
Nick Hill	acd5511b6d	[BugFix] Fix clean shutdown issues (#8492 )	2024-09-16 09:33:46 -07:00
ElizaWszola	a091e2da3e	[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032 ) Co-authored-by: Dipika <dipikasikka1@gmail.com>	2024-09-16 09:47:19 -06:00
Isotr0py	fc990f9795	[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (#8357 )	2024-09-15 16:51:44 -06:00
youkaichao	47790f3e32	[torch.compile] add a flag to disable custom op (#8488 )	2024-09-14 13:07:16 -07:00
youkaichao	a36e070dad	[torch.compile] fix functionalization (#8480 )	2024-09-14 09:46:04 -07:00
ywfang	8a0cf1ddc3	[Model] support minicpm3 (#8297 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-14 14:50:26 +00:00
Charlie Fu	1ef0d2efd0	[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310 )	2024-09-13 17:01:11 -07:00
Nick Hill	18e9e1f7b3	[HotFix] Fix final output truncation with stop string + streaming (#8468 )	2024-09-13 11:31:12 -07:00
Cyrus Leung	a84e598e21	[CI/Build] Reorganize models tests (#7820 )	2024-09-13 10:20:06 -07:00
youkaichao	a2469127db	[misc][ci] fix quant test (#8449 )	2024-09-13 17:20:14 +08:00
Isotr0py	9b4a3b235e	[CI/Build] Enable InternVL2 PP test only on single node (#8437 )	2024-09-13 06:35:20 +00:00
Alexander Matveev	6821020109	[Bugfix] Fix async log stats (#8417 )	2024-09-12 20:48:59 -07:00
Cyrus Leung	8427550488	[CI/Build] Update pixtral tests to use JSON (#8436 )	2024-09-13 03:47:52 +00:00
shangmingc	40c396533d	[Bugfix] Mapping physical device indices for e2e test utils (#8290 )	2024-09-13 11:06:28 +08:00
Cyrus Leung	5ec9c0fb3c	[Core] Factor out input preprocessing to a separate class (#7329 )	2024-09-13 02:56:13 +00:00
Patrick von Platen	d31174a4e1	[Hotfix][Pixtral] Fix multiple images bugs (#8415 )	2024-09-12 15:21:51 -07:00
Roger Wang	b61bd98f90	[CI/Build] Disable multi-node test for InternVL2 (#8428 )	2024-09-12 15:05:35 -07:00
Nick Hill	551ce01078	[Core] Add engine option to return only deltas or final output (#7381 )	2024-09-12 12:02:00 -07:00
William Lin	a6c0f3658d	[multi-step] add flashinfer backend (#7928 )	2024-09-12 11:16:22 -07:00
Joe Runde	f2e263b801	[Bugfix] Offline mode fix (#8376 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-09-12 11:11:57 -07:00
Alex Brooks	c6202daeed	[Model] Support multiple images for qwen-vl (#8247 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-12 10:10:54 -07:00
Isotr0py	e56bf27741	[Bugfix] Fix InternVL2 inference with various num_patches (#8375 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-12 10:10:35 -07:00
youkaichao	7de49aa86c	[torch.compile] hide slicing under custom op for inductor (#8384 )	2024-09-12 00:11:55 -07:00
youkaichao	f842a7aff1	[misc] remove engine_use_ray (#8126 )	2024-09-11 18:23:36 -07:00
Cody Yu	a65cb16067	[MISC] Dump model runner inputs when crashing (#8305 )	2024-09-12 01:12:25 +00:00
Patrick von Platen	d394787e52	Pixtral (#8377 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-09-11 14:41:55 -07:00
Lily Liu	775f00f81e	[Speculative Decoding] Test refactor (#8317 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-09-11 14:07:34 -07:00
bnellnm	73202dbe77	[Kernel][Misc] register ops to prevent graph breaks (#6917 ) Co-authored-by: Sage Moore <sage@neuralmagic.com>	2024-09-11 12:52:19 -07:00
Li, Jiang	0b952af458	[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257 )	2024-09-11 09:46:46 -07:00
Yang Fan	3b7fea770f	[Model][VLM] Add Qwen2-VL model support (#7905 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-11 09:31:19 -07:00
Pooya Davoodi	cea95dfb94	[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (#8347 )	2024-09-11 05:30:11 +00:00
Yangshen⚡Deng	6a512a00df	[model] Support for Llava-Next-Video model (#7559 ) Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-09-10 22:21:36 -07:00
Pavani Majety	efcf946a15	[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (#6112 )	2024-09-11 00:38:40 -04:00
Isotr0py	1230263e16	[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (#8299 )	2024-09-11 10:11:01 +08:00
Cyrus Leung	8c054b7a62	[Frontend] Clean up type annotations for mistral tokenizer (#8314 )	2024-09-10 16:49:11 +00:00
Dipika Sikka	6cd5e5b07e	[Misc] Fused MoE Marlin support for GPTQ (#8217 )	2024-09-09 23:02:52 -04:00
Kyle Sayers	c7cb5c3335	[Misc] GPTQ Activation Ordering (#8135 )	2024-09-09 16:27:26 -04:00
Kyle Mistele	08287ef675	[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (#8272 )	2024-09-09 10:45:11 -04:00
Joe Runde	cfe712bf1a	[CI/Build] Use python 3.12 in cuda image (#8133 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-09-07 13:03:16 -07:00
Isotr0py	e807125936	[Model][VLM] Support multi-images inputs for InternVL2 models (#8201 )	2024-09-07 16:38:23 +08:00
Cyrus Leung	9f68e00d27	[Bugfix] Fix broken OpenAI tensorizer test (#8258 )	2024-09-07 08:02:39 +00:00
youkaichao	ce2702a923	[tpu][misc] fix typo (#8260 )	2024-09-06 22:40:46 -07:00
Cyrus Leung	2f707fcb35	[Model] Multi-input support for LLaVA (#8238 )	2024-09-07 02:57:24 +00:00
Patrick von Platen	29f49cd6e3	[Model] Allow loading from original Mistral format (#8168 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-09-06 17:02:05 -06:00
Alexey Kondratiev(AMD)	1447c97e75	[CI/Build] Increasing timeout for multiproc worker tests (#8203 )	2024-09-06 11:51:03 -07:00
afeldman-nm	e5cab71531	[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191 )	2024-09-06 09:01:14 -07:00
Jiaxin Shan	db3bf7c991	[Core] Support load and unload LoRA in api server (#6566 ) Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2024-09-05 18:10:33 -07:00
Alex Brooks	9da25a88aa	[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-05 12:48:10 +00:00
manikandan.tm@zucisystems.com	8685ba1a1e	Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (#7860 )	2024-09-05 11:33:37 +00:00
Elfie Guo	e39ebf5cf5	[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173 )	2024-09-05 05:12:26 +00:00
Kyle Mistele	e02ce498be	[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649 ) Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com> Co-authored-by: Kyle Mistele <kyle@constellate.ai>	2024-09-04 13:18:13 -07:00
Woosuk Kwon	561d6f8077	[CI] Change test input in Gemma LoRA test (#8163 )	2024-09-04 13:05:50 -07:00
alexeykondrat	d1dec64243	[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-09-04 11:57:54 -07:00
Cody Yu	2ad2e5608e	[MISC] Consolidate FP8 kv-cache tests (#8131 )	2024-09-04 18:53:25 +00:00
Cyrus Leung	855c262a6b	[Frontend] Multimodal support in offline chat (#8098 )	2024-09-04 05:22:17 +00:00
Peter Salas	2be8ec6e71	[Model] Add Ultravox support for multiple audio chunks (#7963 )	2024-09-04 04:38:21 +00:00
Dipika Sikka	2188a60c7e	[Misc] Update `GPTQ` to use `vLLMParameters` (#7976 )	2024-09-03 17:21:44 -04:00
Alexander Matveev	6d646d08a2	[Core] Optimize Async + Multi-step (#8050 )	2024-09-03 18:50:29 +00:00
wang.yuqi	6e36f4fa6c	improve chunked prefill performance [Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)	2024-09-02 14:20:12 -07:00
Lily Liu	e6a26ed037	[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244 )	2024-09-01 21:23:29 -07:00
Shawn Tan	f8d60145b4	[Model] Add Granite model (#7436 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-09-01 18:37:18 -07:00
Roger Wang	5b86b19954	[Misc] Optional installation of audio related packages (#8063 )	2024-09-01 14:46:57 -07:00
Roger Wang	5231f0898e	[Frontend][VLM] Add support for multiple multi-modal items (#8049 )	2024-08-31 16:35:53 -07:00
Pavani Majety	622f8abff8	[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013 )	2024-08-30 22:18:50 -07:00
Wenxiang	1248e8506a	[Model] Adding support for MSFT Phi-3.5-MoE (#7729 ) Co-authored-by: Your Name <you@example.com> Co-authored-by: Zeqi Lin <zelin@microsoft.com> Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>	2024-08-30 13:42:57 -06:00
Kaunil Dhruv	058344f89a	[Frontend]-config-cli-args (#7737 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>	2024-08-30 08:21:02 -07:00
Jungho Christopher Cho	f97be32d1d	[VLM][Model] TP support for ViTs (#7186 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-08-30 08:19:27 -07:00
afeldman-nm	428dd1445e	[Core] Logprobs support in Multi-step (#7652 )	2024-08-29 19:19:08 -07:00
Cyrus Leung	4abed65c58	[VLM] Disallow overflowing `max_model_len` for multimodal models (#7998 )	2024-08-29 17:49:04 -07:00
chenqianfzh	4664ceaad6	support bitsandbytes 8-bit and FP4 quantized models (#7445 )	2024-08-29 19:09:08 -04:00
Pavani Majety	6b3421567d	[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-29 14:53:11 -04:00
Alexander Matveev	3f60f2244e	[Core] Combine async postprocessor and multi-step (#7921 )	2024-08-29 11:18:26 -07:00
Jonas M. Kübler	f205c09854	[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899 )	2024-08-28 22:18:13 -07:00
youkaichao	ef99a78760	Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982 )	2024-08-28 21:27:06 -07:00
Peter Salas	74d5543ec5	[VLM][Core] Fix exceptions on ragged NestedTensors (#7974 )	2024-08-29 03:24:31 +00:00
youkaichao	a7f65c2be9	[torch.compile] remove reset (#7975 )	2024-08-28 17:32:26 -07:00
youkaichao	ce6bf3a2cf	[torch.compile] avoid Dynamo guard evaluation overhead (#7898 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-08-28 16:10:12 -07:00
Mor Zusman	fdd9daafa3	[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651 )	2024-08-28 15:06:52 -07:00
rasmith	e5697d161c	[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386 )	2024-08-28 15:37:47 -04:00
Pavani Majety	b98cc28f91	[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-28 10:01:22 -07:00
Cody Yu	e3580537a4	[Performance] Enable chunked prefill and prefix caching together (#7753 )	2024-08-28 00:36:31 -07:00
Cyrus Leung	51f86bf487	[mypy][CI/Build] Fix mypy errors (#7929 )	2024-08-27 23:47:44 -07:00
Peter Salas	fab5f53e2d	[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902 )	2024-08-28 01:53:56 +00:00
zifeitong	5340a2dccf	[Model] Add multi-image input support for LLaVA-Next offline inference (#7230 )	2024-08-28 07:09:02 +08:00
Dipika Sikka	fc911880cc	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-27 15:07:09 -07:00
Isotr0py	9db642138b	[CI/Build][VLM] Cleanup multiple images inputs model test (#7897 )	2024-08-27 15:28:30 +00:00
Patrick von Platen	6fc4e6e07a	[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739 )	2024-08-27 12:40:02 +00:00
youkaichao	64cc644425	[core][torch.compile] discard the compile for profiling (#7796 )	2024-08-26 21:33:58 -07:00
Nick Hill	39178c7fbc	[Tests] Disable retries and use context manager for openai client (#7565 )	2024-08-26 21:33:17 -07:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
Dipika Sikka	665304092d	[Misc] Update `qqq` to use vLLMParameters (#7805 )	2024-08-26 13:16:15 -06:00
Cody Yu	2deb029d11	[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822 )	2024-08-26 11:24:53 -07:00
Cyrus Leung	029c71de11	[CI/Build] Avoid downloading all HF files in `RemoteOpenAIServer` (#7836 )	2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	0b769992ec	[Bugfix]: Use float32 for base64 embedding (#7855 ) Signed-off-by: Hollow Man <hollowman@opensuse.org>	2024-08-26 03:16:38 +00:00
Nick Hill	1856aff4d6	[Spec Decoding] Streamline batch expansion tensor manipulation (#7851 )	2024-08-25 15:45:14 -07:00
Isotr0py	2059b8d9ca	[Misc] Remove snapshot_download usage in InternVL2 test (#7835 )	2024-08-25 15:53:09 +00:00
Isotr0py	8aaf3d5347	[Model][VLM] Support multi-images inputs for Phi-3-vision models (#7783 )	2024-08-25 11:51:20 +00:00
zifeitong	80162c44b1	[Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840 )	2024-08-24 18:16:24 -07:00
youkaichao	aab0fcdb63	[ci][test] fix RemoteOpenAIServer (#7838 )	2024-08-24 17:31:28 +00:00
youkaichao	ea9fa160e3	[ci][test] exclude model download time in server start time (#7834 )	2024-08-24 01:03:27 -07:00
youkaichao	7d9ffa2ae1	[misc][core] lazy import outlines (#7831 )	2024-08-24 00:51:38 -07:00
Tyler Rockwood	d81abefd2e	[Frontend] add json_schema support from OpenAI protocol (#7654 )	2024-08-23 23:07:24 -07:00
Pooya Davoodi	8da48e4d95	[Frontend] Publish Prometheus metrics in run_batch API (#7641 )	2024-08-23 23:04:22 -07:00
Alexander Matveev	9db93de20c	[Core] Add multi-step support to LLMEngine (#7789 )	2024-08-23 12:45:53 -07:00
Dipika Sikka	f1df5dbfd6	[Misc] Update `marlin` to use vLLMParameters (#7803 )	2024-08-23 14:30:52 -04:00
Maximilien de Bayser	e25fee57c2	[BugFix] Fix server crash on empty prompt (#7746 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2024-08-23 13:12:44 +00:00
SangBin Cho	c01a6cb231	[Ray backend] Better error when pg topology is bad. (#7584 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-08-22 17:44:25 -07:00
Joe Runde	b903e1ba7f	[Frontend] error suppression cleanup (#7786 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-22 21:50:21 +00:00
Travis Johnson	cc0eaf12b1	[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-08-22 09:33:48 -04:00
Dipika Sikka	955b5191c9	[Misc] update fp8 to use `vLLMParameter` (#7437 )	2024-08-22 08:36:18 -04:00
Abhinav Goyal	a3fce56b88	[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830 )	2024-08-22 02:42:24 -07:00
Michael Goin	aae74ef95c	Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 )" (#7764 )	2024-08-22 03:42:14 +00:00
Joe Runde	cde9183b40	[Bug][Frontend] Improve ZMQ client robustness (#7443 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-22 02:18:11 +00:00
zifeitong	df1a21131d	[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710 )	2024-08-22 09:36:24 +08:00
Luka Govedič	7937009a7e	[Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` (#7233 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-21 20:18:00 -04:00
Dipika Sikka	8678a69ab5	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-21 16:17:10 -07:00
Peter Salas	1ca0d4f86b	[Model] Add UltravoxModel and UltravoxConfig (#7615 )	2024-08-21 22:49:39 +00:00
Robert Shaw	970dfdc01d	[Frontend] Improve Startup Failure UX (#7716 )	2024-08-21 19:53:01 +00:00
Robert Shaw	f7e3b0c5aa	[Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend (#7394 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-21 13:34:14 -04:00
LI MOU	53328d7536	[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509 )	2024-08-21 08:54:31 -07:00
Nick Hill	c75363fbc0	[BugFix] Avoid premature async generator exit and raise all exception variations (#7698 )	2024-08-21 11:45:55 -04:00
Cyrus Leung	baaedfdb2d	[mypy] Enable following imports for entrypoints (#7248 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Fei <dfdfcai4@gmail.com>	2024-08-20 23:28:21 -07:00
Isotr0py	12e1c65bc9	[Model] Add AWQ quantization support for InternVL2 model (#7187 )	2024-08-20 23:18:57 -07:00
youkaichao	9e51b6a626	[ci][test] adjust max wait time for cpu offloading test (#7709 )	2024-08-20 17:12:44 -07:00
Antoni Baum	3b682179dd	[Core] Add `AttentionState` abstraction (#7663 )	2024-08-20 18:50:45 +00:00
Isotr0py	aae6927be0	[VLM][Model] Add test for InternViT vision encoder (#7409 )	2024-08-20 23:10:20 +08:00
Lucas Wilkinson	5288c06aa0	[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174 )	2024-08-20 07:09:33 -06:00
Abhinav Goyal	312f761232	[Speculative Decoding] Fixing hidden states handling in batch expansion (#7508 )	2024-08-19 17:58:14 -07:00
Isotr0py	7601cb044d	[Core] Support tensor parallelism for GGUF quantization (#7520 )	2024-08-19 17:30:14 -04:00
William Lin	47b65a5508	[core] Multi Step Scheduling (#7000 ) Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>	2024-08-19 13:52:13 -07:00
Cody Yu	3ac50b47d0	[MISC] Add prefix cache hit rate to metrics (#7606 )	2024-08-19 11:52:07 -07:00
Peng Guanwen	f710fb5265	[Core] Use flashinfer sampling kernel when available (#7137 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-19 03:24:03 +00:00
SangBin Cho	ff7ec82c4d	[Core] Optimize SPMD architecture with delta + serialization optimization (#7109 )	2024-08-18 17:57:20 -07:00
Alex Brooks	40e1360bb6	[CI/Build] Add text-only test for Qwen models (#7475 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-08-19 07:43:46 +08:00
Robert Shaw	e3b318216d	[ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend (#7279 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-18 20:19:48 +00:00
Roger Wang	bbf55c4805	[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530 )	2024-08-17 13:30:55 -07:00
youkaichao	832163b875	[ci][test] allow longer wait time for api server (#7629 )	2024-08-17 11:26:38 -07:00
youkaichao	5bf45db7df	[ci][test] fix engine/logger test (#7621 )	2024-08-16 23:00:59 -07:00
SangBin Cho	4706eb628e	[aDAG] Unflake aDAG + PP tests (#7600 )	2024-08-16 20:49:30 -07:00
Mahesh Keralapura	93478b63d2	[Core] Fix tracking of model forward time in case of PP>1 (#7440 ) [Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)	2024-08-16 13:46:01 -07:00
Mor Zusman	7fc23be81c	[Kernel] W8A16 Int8 inside FusedMoE (#7415 )	2024-08-16 10:06:51 -07:00
Charlie Fu	e837b624f2	[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210 )	2024-08-16 10:06:30 -07:00
youkaichao	54bd9a03c4	register custom op for flash attn and use from torch.ops (#7536 )	2024-08-15 22:38:56 -07:00
jon-chuang	50b8d08dbd	[Misc/Testing] Use `torch.testing.assert_close` (#7324 )	2024-08-16 04:24:04 +00:00
Michael Goin	e165528778	[CI] Move quantization cpu offload tests out of fastcheck (#7574 )	2024-08-15 21:16:20 -07:00
nunjunj	3b19e39dc5	Chat method for offline llm (#5049 ) Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal> Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal> Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-08-15 19:41:34 -07:00
youkaichao	4cd7d47fed	[ci/test] rearrange tests and make adag test soft fail (#7572 )	2024-08-15 19:39:04 -07:00
Grant Pinkert	f878c8feb0	[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453 )	2024-08-16 02:38:08 +00:00
shangmingc	b67ae00cdb	[Misc] Add quantization config support for speculative model. (#7343 )	2024-08-15 19:34:28 -07:00
Kyle Sayers	f55a9aea45	[Misc] Revert `compressed-tensors` code reuse (#7521 )	2024-08-14 15:07:37 -07:00
Cyrus Leung	3f674a49b5	[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126 )	2024-08-14 17:55:42 +00:00
Wallas Henrique	70b746efcf	[Misc] Deprecation Warning when setting --engine-use-ray (#7424 ) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: youkaichao <youkaichao@126.com>	2024-08-14 09:44:27 -07:00
youkaichao	ea49e6a3c8	[misc][ci] fix cpu test with plugins (#7489 )	2024-08-13 19:27:46 -07:00
Jee Jee Li	97992802f3	[CI/Build]Reduce the time consumption for LoRA tests (#7396 )	2024-08-13 17:27:29 -07:00
youkaichao	16422ea76f	[misc][plugin] add plugin system implementation (#7426 )	2024-08-13 16:24:17 -07:00
Kyle Sayers	373538f973	[Misc] `compressed-tensors` code reuse (#7277 )	2024-08-13 19:05:15 -04:00
youkaichao	33e5d7e6b6	[frontend] spawn engine process from api server process (#7484 )	2024-08-13 15:40:17 -07:00
Dipika Sikka	b1e5afc3e7	[Misc] Update `awq` and `awq_marlin` to use `vLLMParameters` (#7422 )	2024-08-13 17:08:20 -04:00
Dipika Sikka	fb377d7e74	[Misc] Update `gptq_marlin` to use new vLLMParameters (#7281 )	2024-08-13 14:30:11 -04:00
Peter Salas	00c3d68e45	[Frontend][Core] Add plumbing to support audio language models (#7446 )	2024-08-13 17:39:33 +00:00
Cyrus Leung	7025b11d94	[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410 )	2024-08-13 05:33:41 +00:00
Andrew Wang	97a6be95ba	[Misc] improve logits processors logging message (#7435 )	2024-08-13 02:29:34 +00:00
Cyrus Leung	9ba85bc152	[mypy] Misc. typing improvements (#7417 )	2024-08-13 09:20:20 +08:00
Rui Qiao	198d6a2898	[Core] Shut down aDAG workers with clean async llm engine exit (#7224 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-08-12 17:57:16 -07:00
jon-chuang	a046f86397	[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-12 22:47:41 +00:00
Roger Wang	e6e42e4b17	[Core][VLM] Support image embeddings as input (#6613 )	2024-08-12 16:16:06 +08:00
Isotr0py	4c5d8e8ea9	[Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392 )	2024-08-10 16:19:33 +00:00
Cade Daniel	baa240252e	[Core] Fix edge case in chunked prefill + block manager v2 (#7380 )	2024-08-09 23:48:49 +00:00
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00
Pooya Davoodi	249b88228d	[Frontend] Support embeddings in the run_batch API (#7132 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-09 09:48:21 -07:00
Nick Hill	b4e9528f95	[Core] Streamline stream termination in `AsyncLLMEngine` (#7336 )	2024-08-09 07:06:36 +00:00
William Lin	57b7be0e1c	[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971 )	2024-08-09 05:42:45 +00:00
Travis Johnson	99b4cf5f23	[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-08-08 22:08:46 -07:00
Cyrus Leung	7eb4a51c5f	[Core] Support serving encoder/decoder models (#7258 )	2024-08-09 10:39:41 +08:00
Zach Zheng	782e53ab59	[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849 )	2024-08-08 10:43:30 -07:00
Joe Runde	21b9c49aa3	[Frontend] Kill the server on engine death (#6594 ) Signed-off-by: Joe Runde <joe@joerun.de> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-08 09:47:48 -07:00
Luka Govedič	5fb4a3f678	[Bugfix][Kernel] Increased atol to fix failing tests (#7305 )	2024-08-08 12:16:13 -04:00
Michael Goin	5223199e03	[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219 )	2024-08-07 11:23:12 -07:00
Maximilien de Bayser	fde47d3bc2	[BugFix] Fix frontend multiprocessing hang (#7217 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-08-07 18:09:36 +00:00
Isotr0py	b764547616	[Bugfix] Fix input processor for InternVL2 model (#7164 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-07 09:32:07 -07:00
Dipika Sikka	0f7052bc7e	[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` (#5874 )	2024-08-07 09:17:58 -07:00
Cyrus Leung	66d617e343	[Frontend] Gracefully handle missing chat template and fix CI failure (#7238 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-08-07 09:12:05 +00:00
Nick Hill	9a3f49ae07	[BugFix] Overhaul async request cancellation (#7111 )	2024-08-07 13:21:41 +08:00
Michael Goin	f9a5600649	[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225 )	2024-08-06 18:34:26 -07:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
Luka Govedič	8d59dbb000	[Kernel] Add per-tensor and per-token AZP epilogues (#5941 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-08-06 18:17:08 +00:00
Lily Liu	5c60c8c423	[SpecDecode] [Minor] Fix spec decode sampler tests (#7183 )	2024-08-06 10:40:32 -07:00
Cyrus Leung	1f26efbb3a	[Model] Support SigLIP encoder and alternative decoders for LLaVA models (#7153 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-08-06 16:55:31 +08:00
Jee Jee Li	9118217f58	[LoRA] Relax LoRA condition (#7146 )	2024-08-06 01:57:25 +00:00
Isotr0py	360bd67cf0	[Core] Support loading GGUF model (#5191 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-05 17:54:23 -06:00
youkaichao	dfb1a15dcb	[ci][frontend] deduplicate tests (#7101 )	2024-08-05 15:59:22 -07:00
Cade Daniel	82a1b1a82b	[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963 )	2024-08-05 08:46:44 +00:00
Alphi	7b86e7c9cd	[Model] Add multi-image support for minicpmv (#7122 ) Co-authored-by: hezhihui <hzh7269@modelbest.cn> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-05 09:23:17 +08:00
Yihuan Bu	654bc5ca49	Support for guided decoding for offline LLM (#6878 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-04 03:12:09 +00:00
youkaichao	44dcb52e39	[ci][test] finalize fork_new_process_for_each_test (#7114 )	2024-08-03 10:44:53 -07:00
Jee Jee Li	99d7cabd7b	[LoRA] ReplicatedLinear support LoRA (#7081 )	2024-08-02 22:40:19 -07:00
Zach Zheng	fb2c1c86c1	[Bugfix] Fix block table for seqs that have prefix cache hits (#7018 )	2024-08-02 22:38:15 -07:00
youkaichao	a0d164567c	[ci][distributed] disable ray dag tests (#7099 )	2024-08-02 22:32:04 -07:00
youkaichao	04e5583425	[ci][distributed] merge distributed test commands (#7097 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-02 21:33:53 -07:00
youkaichao	69ea15e5cc	[ci][distributed] shorten wait time if server hangs (#7098 )	2024-08-02 21:05:16 -07:00
Robert Shaw	ed812a73fa	[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` (#6883 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <joe@joerun.de> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-02 18:27:28 -07:00
Rui Qiao	05308891e2	[Core] Pipeline parallel with Ray ADAG (#6837 ) Support pipeline-parallelism with Ray accelerated DAG. Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-08-02 13:55:40 -07:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00
youkaichao	806949514a	[ci] set timeout for test_oot_registration.py (#7082 )	2024-08-02 10:03:24 -07:00
youkaichao	252357793d	[ci][distributed] try to fix pp test (#7054 )	2024-08-01 22:03:12 -07:00
Woosuk Kwon	805a8a75f2	[Misc] Support attention logits soft-capping with flash-attn (#7022 )	2024-08-01 13:14:37 -07:00
Michael Goin	fb3db61688	[CI/Build] Remove sparseml requirement from testing (#7037 )	2024-08-01 12:00:51 -07:00
youkaichao	c8a7e93273	[core][scheduler] simplify and improve scheduler (#6867 )	2024-07-31 23:51:09 -07:00
zifeitong	3c10591ef2	[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954 )	2024-07-31 21:13:34 -07:00
Jee Jee Li	7ecee34321	[Kernel][RFC] Refactor the punica kernel based on Triton (#5036 )	2024-07-31 17:12:24 -07:00
Michael Goin	460c1884e3	[Bugfix] Support cpu offloading with fp8 quantization (#6960 )	2024-07-31 12:47:46 -07:00
Cody Yu	bd70013407	[MISC] Introduce pipeline parallelism partition strategies (#6920 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-07-31 12:02:17 -07:00
Cyrus Leung	daed30c4a9	[Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982 )	2024-07-31 23:46:17 +08:00
HandH1998	6512937de1	Support W4A8 quantization for vllm (#5218 )	2024-07-31 07:55:21 -06:00
Cyrus Leung	f230cc2ca6	[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836 )	2024-07-31 10:38:45 +08:00
Tyler Michael Smith	d7a299edaa	[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842 )	2024-07-30 16:37:01 -04:00
Sanger Steel	052b6f8ca4	[Bugfix] Fix tensorizer memory profiling bug during testing (#6881 )	2024-07-30 11:48:50 -07:00
Nick Hill	5cf9254a9c	[BugFix] Fix use of per-request seed with pipeline parallel (#6698 )	2024-07-30 10:40:08 -07:00
Varun Sundar Rabindranath	af647fb8b3	[Kernel] Tuned int8 kernels for Ada Lovelace (#6848 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 20:24:58 -06:00
Nick Hill	9f69d8245a	[Frontend] New `allowed_token_ids` decoding request parameter (#6753 )	2024-07-29 23:37:27 +00:00
Thomas Parnell	9a7e2d0534	[Bugfix] Allow vllm to still work if triton is not installed. (#6786 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-29 14:51:27 -07:00
Peng Guanwen	db9e5708a9	[Core] Reduce unnecessary compute when logprobs=None (#6532 )	2024-07-29 16:47:31 +00:00
Varun Sundar Rabindranath	766435e660	[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 09:42:35 -06:00
Isotr0py	7cbd9ec7a9	[Model] Initialize support for InternVL2 series models (#6514 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-29 10:16:30 +00:00
Alexander Matveev	75acdaa4b6	[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795 )	2024-07-27 17:52:33 -04:00
Cyrus Leung	1ad86acf17	[Model] Initial support for BLIP-2 (#5920 ) Co-authored-by: ywang96 <ywang@roblox.com>	2024-07-27 11:53:07 +00:00
Joe	14dbd5a767	[Model] H2O Danube3-4b (#6451 )	2024-07-26 20:47:50 -07:00
Sanger Steel	969d032265	[Bugfix]: Fix Tensorizer test failures (#6835 )	2024-07-26 20:02:25 -07:00
youkaichao	443c7cf4cf	[ci][distributed] fix flaky tests (#6806 )	2024-07-25 17:44:09 -07:00
Michael Goin	65b1f121c8	[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints (#6761 )	2024-07-25 09:46:15 -07:00
Chang Su	316a41ac1d	[Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755 )	2024-07-24 22:48:07 -07:00
Cody Yu	309aaef825	[Bugfix] Fix decode tokens w. CUDA graph (#6757 )	2024-07-24 22:33:56 -07:00
Alphi	9e169a4c61	[Model] Adding support for MiniCPM-V (#4087 )	2024-07-24 20:59:30 -07:00
Evan Z. Liu	5689e256ba	[Frontend] Represent tokens with identifiable strings (#6626 )	2024-07-25 09:51:00 +08:00
Michael Goin	421e218b37	[Bugfix] Bump transformers to 4.43.2 (#6752 )	2024-07-24 13:22:16 -07:00
Antoni Baum	0e63494cf3	Add fp8 support to `reshape_and_cache_flash` (#6667 )	2024-07-24 18:36:52 +00:00
Nick Hill	2cf0df3381	[Bugfix] Fix speculative decode seeded test (#6743 )	2024-07-24 08:58:31 -07:00
Nick Hill	c882a7f5b3	[SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714 )	2024-07-24 07:34:22 +00:00
William Lin	5e8ca973eb	[Bugfix] fix flashinfer cudagraph capture for PP (#6708 )	2024-07-24 01:49:44 +00:00
dongmao zhang	87525fab92	[bitsandbytes]: support read bnb pre-quantized model (#5753 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-23 23:45:09 +00:00
Thomas Parnell	2f808e69ab	[Bugfix] StatLoggers: cache spec decode metrics when they get collected. (#6645 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-23 23:05:05 +00:00
Michael Goin	01c16ede6b	[CI] Add smoke test for non-uniform AutoFP8 quantization (#6702 )	2024-07-23 22:45:12 +00:00
Roger Wang	1bedf210e3	Bump `transformers` version for Llama 3.1 hotfix and patch Chameleon (#6690 )	2024-07-23 13:47:48 -07:00
Yehoshua Cohen	58f53034ad	[Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652 )	2024-07-23 11:41:55 -07:00
Roger Wang	22fa2e35cb	[VLM][Model] Support image input for Chameleon (#6633 )	2024-07-22 23:50:48 -07:00
Cyrus Leung	97234be0ec	[Misc] Manage HTTP connections in one place (#6600 )	2024-07-22 21:32:02 -07:00
Michael Goin	9e0b558a09	[Misc] Support FP8 kv cache scales from compressed-tensors (#6528 )	2024-07-23 04:11:50 +00:00
Jiaxin Shan	42c7f66a38	[Core] Support dynamically loading Lora adapter from HuggingFace (#6234 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-22 15:42:40 -07:00
Tyler Michael Smith	fea59c7712	[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649 )	2024-07-22 14:08:30 -06:00
Cyrus Leung	739b61a348	[Frontend] Refactor prompt processing (#4028 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-22 10:13:53 -07:00
Alexander Matveev	396d92d5e0	[Kernel][Core] Add AWQ support to the Marlin kernel (#6612 )	2024-07-21 19:41:42 -04:00
sroy745	14f91fe67c	[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485 )	2024-07-20 23:58:58 -07:00
Cyrus Leung	d7f4178dd9	[Frontend] Move chat utils (#6602 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-21 08:38:17 +08:00
Matt Wong	06d6c5fe9f	[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543 )	2024-07-20 09:39:07 -07:00
Cyrus Leung	9042d68362	[Misc] Consolidate and optimize logic for building padded tensors (#6541 )	2024-07-20 04:17:24 +00:00
Antoni Baum	7bd82002ae	[Core] Allow specifying custom Executor (#6557 )	2024-07-20 01:25:06 +00:00
Varun Sundar Rabindranath	2e26564259	[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593 ) Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>	2024-07-19 18:15:26 -07:00
Robert Shaw	4cc24f01b1	[ Kernel ] Enable Dynamic Per Token `fp8` (#6547 )	2024-07-19 23:08:15 +00:00
Thomas Parnell	f0bbfaf917	[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578 )	2024-07-19 14:01:03 -07:00
Antoni Baum	9ed82e7074	[Misc] Small perf improvements (#6520 )	2024-07-19 12:10:56 -07:00
Thomas Parnell	a5314e8698	[Model] RowParallelLinear: pass bias to quant_method.apply (#6327 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-19 07:15:22 -06:00
Woo-Yeon Lee	a921e86392	[BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369 )	2024-07-19 06:01:09 -07:00
Cyrus Leung	6366efc67b	[Bugfix][Frontend] Fix missing `/metrics` endpoint (#6463 )	2024-07-19 03:55:13 +00:00
Thomas Parnell	d4201e06d5	[Bugfix] Make spec. decode respect per-request seed. (#6034 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-07-18 19:22:08 -07:00
Nick Hill	b5672a112c	[Core] Multiprocessing Pipeline Parallel support (#6130 ) Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-18 19:15:52 -07:00
youkaichao	f53b8f0d05	[ci][test] add correctness test for cpu offloading (#6549 )	2024-07-18 23:41:06 +00:00
Nick Hill	e2fbaee725	[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-07-18 15:13:30 +08:00
Cody Yu	b5af8c223c	[Model] Pipeline parallel support for Mixtral (#6516 )	2024-07-17 19:26:04 -07:00
Varun Sundar Rabindranath	b5241e41d9	[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-18 01:38:35 +00:00
Alexander Matveev	e76466dde2	[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338 )	2024-07-17 14:30:28 -07:00
Antoni Baum	5f0b9933e6	[Bugfix] Fix Ray Metrics API usage (#6354 )	2024-07-17 19:40:10 +00:00
Cody Yu	2fa4623d9e	[Core] Refactor _prepare_model_input_tensors - take 2 (#6164 )	2024-07-17 09:37:16 -07:00
Murali Andoorveedu	5fa6e9876e	[Bugfix] Fix for multinode crash on 4 PP (#6495 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-17 08:25:10 +00:00
Cyrus Leung	5bf35a91e4	[Doc][CI/Build] Update docs and tests to use `vllm serve` (#6431 )	2024-07-17 07:43:21 +00:00
youkaichao	7f62077af5	[misc][distributed] improve tests (#6488 )	2024-07-16 17:35:52 -07:00
youkaichao	09c2eb85dd	[ci][distributed] add pipeline parallel correctness test (#6410 )	2024-07-16 15:44:22 -07:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
Cody Yu	160e1d8c99	[Misc] Log spec decode metrics (#6454 )	2024-07-16 20:37:10 +00:00
Cyrus Leung	38ef94888a	[CI/Build] Remove "boardwalk" image asset (#6460 )	2024-07-16 08:59:36 -07:00
sasha0552	7a3d2a5b95	[Frontend] Support for chat completions input in the tokenize endpoint (#5923 )	2024-07-16 20:18:09 +08:00
Cyrus Leung	d97011512e	[CI/Build] vLLM cache directory for images (#6444 )	2024-07-15 23:12:25 -07:00
Joe	d92b3c5cde	[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419 )	2024-07-15 18:54:15 -07:00
Mor Zusman	9ad32dacd9	[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425 ) Co-authored-by: Mor Zusman <morz@ai21.com>	2024-07-16 01:32:55 +00:00
Thomas Parnell	4ef95b0f06	[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-15 13:14:49 -04:00
youkaichao	69672f116c	[core][distributed] simplify code to support pipeline parallel (#6406 )	2024-07-14 21:20:51 -07:00
zifeitong	b47008b4d2	[BugFix] BatchResponseData body should be optional (#6345 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-07-15 04:06:09 +00:00
Ethan Xu	dbfe254eda	[Feature] vLLM CLI (#5090 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-07-14 15:36:43 -07:00
Isotr0py	540c0368b1	[Model] Initialize Fuyu-8B support (#3924 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-14 05:27:14 +00:00
youkaichao	41708e5034	[ci] try to add multi-node tests (#6280 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-12 21:51:48 -07:00
Michael Goin	111fc6e7ec	[Misc] Add generated git commit hash as `vllm.__commit__` (#6386 )	2024-07-12 22:52:15 +00:00
Yihuan Bu	b039cbbce3	[Misc] add fixture to guided processor tests (#6341 )	2024-07-12 09:55:39 -07:00
Cyrus Leung	024ad87cdc	[Bugfix] Fix dtype mismatch in PaliGemma (#6367 )	2024-07-12 08:22:18 -07:00
Robert Shaw	aea19f0989	[ Misc ] Support Models With Bias in `compressed-tensors` integration (#6356 )	2024-07-12 11:11:29 -04:00
Hongxia Yang	b6c16cf8ff	[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352 )	2024-07-11 21:30:46 -07:00
Lily Liu	d6ab528997	[Misc] Remove flashinfer warning, add flashinfer tests to CI (#6351 )	2024-07-12 01:32:06 +00:00
Robert Shaw	7ed6a4f0e1	[ BugFix ] Prompt Logprobs Detokenization (#6223 ) Co-authored-by: Zifei Tong <zifeitong@gmail.com>	2024-07-11 22:02:29 +00:00
xwjiang2010	1df43de9bb	[bug fix] Fix llava next feature size calculation. (#6339 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-07-11 17:21:10 +00:00
Robert Shaw	b675069d74	[ Misc ] Refactor Marlin Python Utilities (#6082 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-07-11 15:40:11 +00:00
sroy745	ae151d73be	[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765 )	2024-07-10 16:02:47 -07:00
youkaichao	da78caecfa	[core][distributed] zmq fallback for broadcasting large objects (#6183 ) [core][distributed] add zmq fallback for broadcasting large objects (#6183)	2024-07-09 18:49:11 -07:00
Abhinav Goyal	2416b26e11	[Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978 )	2024-07-09 18:34:02 -07:00
Swapnil Parekh	4d6ada947c	[CORE] Adding support for insertion of soft-tuned prompts (#4645 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com> Co-authored-by: Joe G <joseph.granados@h2o.ai> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-09 13:26:36 -07:00
tomeras91	ddc369fba1	[Bugfix] Mamba cache Cuda Graph padding (#6214 )	2024-07-08 11:25:51 -07:00
afeldman-nm	543aa48573	[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-08 17:12:15 +00:00
Robert Shaw	abfe705a02	[ Misc ] Support Fp8 via `llm-compressor` (#6110 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-07-07 20:42:11 +00:00
Roger Wang	6206dcb29e	[Model] Add PaliGemma (#5189 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-07 09:25:50 +08:00
jvlunteren	f1e15da6fe	[Frontend] Continuous usage stats in OpenAI completion API (#5742 )	2024-07-05 10:37:09 -07:00
Lily Liu	69ec3ca14c	[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-07-04 16:35:51 -07:00
Cyrus Leung	3dd507083f	[CI/Build] Cleanup VLM tests (#6107 )	2024-07-03 18:58:18 -07:00
Robert Shaw	62963d129e	[ Misc ] Clean Up `CompressedTensorsW8A8` (#6113 )	2024-07-03 22:50:08 +00:00
xwjiang2010	d9e98f42e4	[vlm] Remove vision language config. (#6089 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-03 22:14:16 +00:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
SangBin Cho	d18bab3587	[CI] Fix base url doesn't strip "/" (#6087 )	2024-07-02 21:31:25 -07:00
Cyrus Leung	9831aec49f	[Core] Dynamic image size support for VLMs (#5276 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: ywang96 <ywang@roblox.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-07-02 20:34:00 -07:00
youkaichao	482045ee77	[hardware][misc] introduce platform abstraction (#6080 )	2024-07-02 20:12:22 -07:00
Mor Zusman	9d6a8daa87	[Model] Jamba support (#4115 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Co-authored-by: Erez Schwartz <erezs@ai21.com> Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-02 23:11:29 +00:00
Qubitium-ModelCloud	ee93f4f92a	[CORE] Quantized lm-head Framework (#4442 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com> Co-authored-by: ZX <zx@lbx.dev>	2024-07-02 22:25:17 +00:00
Robert Shaw	7c008c51a9	[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-02 21:54:35 +00:00
Robert Shaw	4d26d806e1	Update conftest.py (#6076 )	2024-07-02 20:14:22 +00:00
Murali Andoorveedu	c5832d2ae9	[Core] Pipeline Parallel Support (#4412 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-02 10:58:08 -07:00
Sirej Dua	15aba081f3	[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050 ) Co-authored-by: Sirej Dua <sirej.dua@databricks.com> Co-authored-by: Sirej Dua <Sirej Dua>	2024-07-02 07:20:29 -07:00
xwjiang2010	98d6682cd1	[VLM] Remove `image_input_type` from VLM config (#5852 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-02 07:57:09 +00:00
Alexander Matveev	3476ed0809	[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602 )	2024-07-01 20:10:37 -07:00
Avshalom Manevich	12a59959ed	[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029 )	2024-07-01 21:08:29 +00:00
sroy745	80ca1e6a3a	[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348 )	2024-07-01 00:33:05 -07:00
youkaichao	614aa51203	[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007 )	2024-06-30 20:07:34 -07:00
Robert Shaw	af9ad46fca	[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-30 23:06:27 +00:00
SangBin Cho	f5e73c9f1b	[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909 ) Co-authored-by: sang <sangcho@anyscale.com>	2024-06-30 17:11:15 +00:00
llmpros	c6c240aa0a	[Frontend]: Support base64 embedding (#5935 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-06-30 23:53:00 +08:00
youkaichao	2be6955a3f	[ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991)	2024-06-30 08:06:13 +00:00
Cyrus Leung	9d47f64eb6	[CI/Build] [3/3] Reorganize entrypoints tests (#5966 )	2024-06-30 12:58:49 +08:00
Cyrus Leung	cff6a1fec1	[CI/Build] Reuse code for checking output consistency (#5988 )	2024-06-30 11:44:25 +08:00
Matt Wong	9def10664e	[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949 )	2024-06-29 12:47:58 -07:00
Cyrus Leung	99397da534	[CI/Build] Add TP test for vision models (#5892 )	2024-06-29 15:45:54 +00:00
Robert Shaw	8dbfcd35bf	[ CI/Build ] Added E2E Test For Compressed Tensors (#5839 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-29 21:12:58 +08:00
Cyrus Leung	51e971d39e	[Bugfix] Support `eos_token_id` from `config.json` (#5954 )	2024-06-29 11:19:02 +00:00
Woosuk Kwon	580353da93	[Bugfix] Fix precisions in Gemma 1 (#5913 )	2024-06-29 03:10:21 +00:00
Joe Runde	ba4994443a	[Kernel] Add punica dimensions for Granite 3b and 8b (#5930 ) Signed-off-by: Joe Runde <joe@joerun.de>	2024-06-29 10:48:25 +08:00
William Lin	906a19cdb0	[Misc] Extend vLLM Metrics logging API (#5925 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-29 10:36:06 +08:00
Lily Liu	7041de4384	[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>	2024-06-28 15:28:49 -07:00
Tyler Michael Smith	6a2d659d28	[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931 )	2024-06-28 17:10:34 +00:00
Cody Yu	b2c620230a	[Spec Decode] Introduce DraftModelRunner (#5799 )	2024-06-28 09:17:51 -07:00
xwjiang2010	b90d8cd832	[Distributed] Make it clear that % should not be in tensor dict keys. (#5927 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-06-28 15:20:22 +00:00
Cyrus Leung	3b752a6555	[CI/Build] [2/3] Reorganize entrypoints tests (#5904 )	2024-06-28 07:59:18 -07:00
Ilya Lavrenov	57f09a419c	[Hardware][Intel] OpenVINO vLLM backend (#5379 )	2024-06-28 13:50:16 +00:00
Cyrus Leung	5cbe8d155c	[Core] Registry for processing model inputs (#5214 ) Co-authored-by: ywang96 <ywang@roblox.com>	2024-06-28 12:09:56 +00:00
Roger Wang	736ed38849	[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922 )	2024-06-27 11:43:04 -07:00
Cyrus Leung	e9d32d077d	[CI/Build] [1/3] Reorganize entrypoints tests (#5526 )	2024-06-27 12:43:17 +00:00
xwjiang2010	d12af207d2	[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (#5880 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-06-27 15:15:24 +08:00
sasha0552	c54269d967	[Frontend] Add tokenize/detokenize endpoints (#5054 )	2024-06-26 16:54:22 +00:00
Luka Govedič	5bfd1bbc98	[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560 ) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-06-26 15:16:00 +00:00
Cyrus Leung	6984c02a27	[CI/Build] Refactor image test assets (#5821 )	2024-06-26 01:02:34 -07:00
youkaichao	515080ad2f	[bugfix][distributed] fix shm broadcast when the queue size is full (#5801 )	2024-06-25 21:56:02 -07:00
Stephanie Wang	dda4811591	[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408 ) Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Stephanie <swang@anyscale.com> Co-authored-by: Stephanie <swang@anyscale.com>	2024-06-25 20:30:03 -07:00
Thomas Parnell	c2a8ac75e0	[CI/Build] Add E2E tests for MLPSpeculator (#5791 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-26 00:04:08 +00:00
Matt Wong	dd793d1de5	[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422 )	2024-06-25 15:56:15 -07:00
Dipika Sikka	dd248f7675	[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794 )	2024-06-25 19:23:35 +00:00
Michael Goin	d9b34baedd	[CI/Build] Add unit testing for FlexibleArgumentParser (#5798 )	2024-06-25 12:18:03 -07:00
Antoni Baum	67882dbb44	[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748 )	2024-06-25 10:15:10 -07:00
Woo-Yeon Lee	2ce5d6688b	[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414 )	2024-06-25 09:56:06 +00:00
Isotr0py	edd5fe5fa2	[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (#5772 )	2024-06-24 12:11:53 +08:00
Murali Andoorveedu	5d4d90536f	[Distributed] Add send and recv helpers (#5719 )	2024-06-23 14:42:28 -07:00
rohithkrn	f5dda63eb5	[LoRA] Add support for pinning lora adapters in the LRU cache (#5603 )	2024-06-21 15:42:46 -07:00
youkaichao	d9a252bc8e	[Core][Distributed] add shm broadcast (#5399 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-06-21 05:12:35 +00:00
Jee Li	67005a07bc	[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-21 04:46:28 +00:00
Chang Su	c35e4a3dd7	[BugFix] Fix test_phi3v.py (#5725 )	2024-06-21 04:45:34 +00:00
Jinzhen Lin	1f5674218f	[Kernel] Add punica dimension for Qwen2 LoRA (#5441 )	2024-06-20 17:55:41 -07:00
Joshua Rosenkranz	b12518d3cf	[Model] MLPSpeculator speculative decoding support (#4947 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>	2024-06-20 20:23:12 -04:00
Michael Goin	8065a7e220	[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718 )	2024-06-20 17:00:13 -06:00
Cyrus Leung	3730a1c832	[Misc] Improve conftest (#5681 )	2024-06-19 19:09:21 -07:00
Dipika Sikka	4a30d7e3cc	[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650 )	2024-06-19 18:06:44 -04:00
zifeitong	78687504f7	[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654 )	2024-06-19 13:57:12 -07:00
youkaichao	d571ca0108	[ci][distributed] add tests for custom allreduce (#5689 )	2024-06-19 20:16:04 +00:00
Thomas Parnell	e5150f2c28	[Bugfix] Added test for sampling repetition penalty bug. (#5659 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-19 06:03:55 +00:00
sergey-tinkoff	07feecde1a	[Model] LoRA support added for command-r (#5178 )	2024-06-18 11:01:21 -07:00
Dipika Sikka	95db455e7f	[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542 )	2024-06-18 12:45:05 -04:00
Ronen Schaffer	7879f24dcc	[Misc] Add OpenTelemetry support (#4687 ) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here	2024-06-19 01:17:03 +09:00
Roger Wang	4ad7b53e59	[CI/Build][Misc] Update Pytest Marker for VLMs (#5623 )	2024-06-18 13:10:04 +00:00
Joe Runde	5002175e80	[Kernel] Add punica dimensions for Granite 13b (#5559 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-06-18 03:54:11 +00:00
Isotr0py	daef218b55	[Model] Initialize Phi-3-vision support (#4986 )	2024-06-17 19:34:33 -07:00
sroy745	fa9e385229	[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131 )	2024-06-17 21:29:09 -05:00
Dipika Sikka	890d8d960b	[Kernel] `compressed-tensors` marlin 24 support (#5435 )	2024-06-17 12:32:48 -04:00
Michael Goin	4a6769053a	[CI][BugFix] Flip is_quant_method_supported condition (#5577 )	2024-06-16 14:07:34 +00:00
Alexander Matveev	d919ecc771	add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 (#5145 )	2024-06-15 13:38:16 -04:00
Cyrus Leung	81fbb3655f	[CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568 )	2024-06-15 07:29:42 -04:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
leiwen83	1b8a0d71cf	[Core][Bugfix]: fix prefix caching for blockv2 (#5364 ) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-06-14 17:23:56 -07:00
youkaichao	48f589e18b	[mis] fix flaky test of test_cuda_device_count_stateless (#5546 )	2024-06-14 10:02:23 -07:00
Antoni Baum	50eed24d25	Add `cuda_device_count_stateless` (#5473 )	2024-06-13 16:06:49 -07:00
Tyler Michael Smith	33e3b37242	[CI/Build] Disable test_fp8.py (#5508 )	2024-06-13 13:37:48 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Cyrus Leung	39873476f8	[CI/Build] Simplify OpenAI server setup in tests (#5100 )	2024-06-13 11:21:53 -07:00
Michael Goin	23ec72fa03	[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466 )	2024-06-13 15:18:08 +00:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
youkaichao	ea3890a5f0	[Core][Distributed] code deduplication in tp&pp with coordinator(#5293 ) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)	2024-06-12 17:27:08 -07:00
Travis Johnson	51602eefd3	[Frontend] [Core] Support for sharded tensorized models (#4990 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 14:13:52 -07:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
SangBin Cho	847cdcca1c	[CI] Upgrade codespell version. (#5381 )	2024-06-12 10:06:14 -07:00
Simon Mo	e3c12bf6d2	Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463 )	2024-06-12 10:03:24 -07:00
Michael Goin	3dd6853bc8	[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253 )	2024-06-12 09:58:02 -07:00
Nick Hill	99dac099ab	[Core][Doc] Default to multiprocessing for single-node distributed case (#5230 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-11 11:10:41 -07:00
youkaichao	c4bd03c7c5	[Core][Distributed] add same-node detection (#5369 )	2024-06-11 10:53:59 -07:00
sasha0552	dcbf4286af	[Frontend] Customizable RoPE theta (#5197 )	2024-06-11 10:42:26 -07:00
Cyrus Leung	640052b069	[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026 )	2024-06-10 22:36:46 -07:00
maor-ps	351d5e7b82	[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-11 10:30:31 +08:00
Itay Etelis	774d1035e4	[Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319 )	2024-06-10 14:22:09 +00:00
Cyrus Leung	6b29d6fe70	[Model] Initial support for LLaVA-NeXT (#4199 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-10 12:47:15 +00:00
Cyrus Leung	0bfa1c4f13	[Misc] Improve error message when LoRA parsing fails (#5194 )	2024-06-10 19:38:49 +08:00
Dipika Sikka	5884c2b454	[Misc] Update to comply with the new `compressed-tensors` config (#5350 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-06-10 03:49:46 +00:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
youkaichao	5d7e3d0176	[mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361 ) [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361)	2024-06-09 03:50:14 +00:00
youkaichao	8ea5e44a43	[CI/Test] improve robustness of test (vllm_runner) (#5357 ) [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)	2024-06-08 08:59:20 +00:00
youkaichao	9fb900f90c	[CI/Test] improve robustness of test (hf_runner) (#5347 ) [CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347)	2024-06-07 22:31:32 -07:00
Roger Wang	7a9cb294ae	[Frontend] Add OpenAI Vision API Support (#5237 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-07 11:23:32 -07:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
youkaichao	388596c914	[Misc][Utils] allow get_open_port to be called for multiple times (#5333 )	2024-06-06 22:15:11 -07:00
Itay Etelis	baa15a9ec3	[Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135 )	2024-06-07 03:29:24 +00:00
Antoni Baum	ccdc490dda	[Core] Change LoRA embedding sharding to support loading methods (#5038 )	2024-06-06 19:07:57 -07:00
Matthew Goldey	828da0d44e	[Frontend] enable passing multiple LoRA adapters at once to generate() (#5300 )	2024-06-06 15:48:13 -05:00
liuyhwangyh	4efff036f0	Bugfix: fix broken of download models from modelscope (#5233 ) Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>	2024-06-06 09:28:10 -07:00
Cyrus Leung	89c920785f	[CI/Build] Update vision tests (#5307 )	2024-06-06 05:17:18 -05:00
Breno Faria	7b0a0dfb22	[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>	2024-06-05 16:49:12 -07:00
Nick Hill	faf71bcd4b	[Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252 )	2024-06-05 14:53:05 -07:00
Woosuk Kwon	41ca62cf03	[Misc] Add CustomOp interface for device portability (#5255 )	2024-06-05 09:18:19 -07:00
zifeitong	974fc9b845	[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226 )	2024-06-04 19:37:28 -07:00
Cyrus Leung	9ba093b4f4	[CI/Build] Simplify model loading for `HfRunner` (#5251 )	2024-06-04 10:09:19 -07:00
Cyrus Leung	ec784b2526	[CI/Build] Add inputs tests (#5215 )	2024-06-03 21:01:46 -07:00
afeldman-nm	f42a006b15	[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210 )	2024-06-03 20:32:57 -07:00
Toshiki Kataoka	06b2550cbb	[Bugfix] Support `prompt_logprobs==0` (#5217 )	2024-06-03 17:59:30 -07:00
Breno Faria	f775a07e30	[FRONTEND] OpenAI `tools` support named functions (#5032 )	2024-06-03 18:25:29 -05:00
Kaiyang Chen	10c38e3e46	[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834 )	2024-06-03 13:37:11 -07:00
Yuan	cafb8e06c5	[CI/BUILD] enable intel queue for longer CPU tests (#4113 )	2024-06-03 10:39:50 -07:00
Tyler Michael Smith	cbb2f59cc8	[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159 )	2024-06-03 09:52:30 -07:00
Cyrus Leung	7a64d24aad	[Core] Support image processor (#4197 )	2024-06-02 22:56:41 -07:00
Cyrus Leung	dfbe60dc62	[Misc] Simplify code and fix type annotations in `conftest.py` (#5118 )	2024-06-02 16:05:50 -07:00
Simon Mo	ed59a7ed23	Update test_ignore_eos (#4898 )	2024-06-02 02:21:53 +00:00
chenqianfzh	b9c0605a8e	[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776 )	2024-06-01 14:51:10 -06:00
Varun Sundar Rabindranath	f081c3ce4b	[Kernel] Update Cutlass fp8 configs (#5144 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-01 08:46:07 +00:00
Tyler Michael Smith	260d119e86	[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137 )	2024-06-01 06:45:32 +00:00
SnowDist	a22dea54d3	[Model] Support MAP-NEO model (#5081 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-05-30 19:24:41 -07:00
Breno Faria	87d41c849d	[BUGFIX] [FRONTEND] Correct chat logprobs (#5029 ) Co-authored-by: Breno Faria <breno.faria@intrafind.com>	2024-05-30 02:52:14 -07:00
Cyrus Leung	b1c255630d	[Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099 )	2024-05-29 16:05:01 -07:00
Cyrus Leung	eecd864388	[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` (#5096 )	2024-05-29 16:02:25 -07:00
afeldman-nm	4238bc82f2	[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837 )	2024-05-29 16:09:13 +00:00
Cyrus Leung	18c1f16d86	[Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092 )	2024-05-29 07:16:41 +00:00
youkaichao	5bd3c65072	[Core][Optimization] remove vllm-nccl (#5091 )	2024-05-29 05:13:52 +00:00
Junichi Sato	dfba529b40	[Bugfix] Remove the last EOS token unless explicitly specified (#5077 )	2024-05-28 17:15:35 -07:00
Cyrus Leung	5ae5ed1e60	[Core] Consolidate prompt arguments to LLM engines (#4328 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-05-28 13:29:31 -07:00
Michał Moskal	d4f3985907	[Core] Sliding window for block manager v2 (#4545 ) Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>	2024-05-28 11:07:07 +09:00
Zhuohan Li	1102bef219	[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846 ) Co-authored-by: rsnm2 <rshaw@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-27 15:18:17 -07:00
Lily Liu	d5a1697772	[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000 )	2024-05-25 10:00:14 -07:00
Eric Xihui Lin	8e192ff967	[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799 ) Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-05-24 22:00:52 -07:00
leiwen83	e64fde4b01	[Core][Bugfix]: fix prefix caching for blockv2 (#4764 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-24 10:07:09 -07:00
Robert Shaw	919770957f	[Bugfix] Fix Mistral v0.3 Weight Loading (#5005 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-24 12:28:27 +00:00
Dipika Sikka	a1242324c9	[Kernel] Initial Activation Quantization Support (#4525 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-05-23 21:29:18 +00:00
Murali Andoorveedu	5eda2ea02a	[Core][1/N] Support send/recv in PyNCCL Groups (#4988 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-05-23 09:54:48 -07:00
Alexander Matveev	6066253296	Marlin 24 prefill performance improvement (about 25% better on average) (#4983 )	2024-05-23 02:39:27 -04:00
Cody Yu	ee3eea0a1b	[Misc] Take user preference in attention selector (#4960 )	2024-05-23 07:55:56 +09:00
raywanb	97b030005c	[Model] LoRA gptbigcode implementation (#3949 )	2024-05-22 13:58:59 -07:00
Cody Yu	a3a73ab069	[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893 ) The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).	2024-05-22 13:28:20 -07:00
Tyler Michael Smith	8674f9880e	[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954 ) Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs	2024-05-22 14:10:43 +00:00
SangBin Cho	c74c913bfb	[misc] remove comments that were supposed to be removed (#4977 )	2024-05-22 09:02:58 -04:00
sasha0552	9b9a10d6cb	[Frontend] Dynamic RoPE scaling (#4638 )	2024-05-22 01:32:35 -04:00
Isotr0py	f12c3b5b3d	[Model] Add Phi-2 LoRA support (#4886 )	2024-05-21 14:24:17 +09:00
Alexei-V-Ivanov-AMD	943e72ca56	[Build/CI] Enabling AMD Entrypoints Test (#4834 ) Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>	2024-05-20 11:29:28 -07:00
Woosuk Kwon	b57e6c5949	[Kernel] Add flash-attn back (#4907 )	2024-05-19 18:11:30 -07:00
Alexander Matveev	27ce85476e	[Kernel] Add marlin_24 unit tests (#4901 )	2024-05-19 11:37:34 -04:00
Cyrus Leung	f68470e803	[Bugfix][Model] Add base class for vision-language models (#4809 )	2024-05-19 00:13:33 -07:00
SangBin Cho	2e9a2227ec	[Lora] Support long context lora (#4787 ) Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files	2024-05-18 16:05:23 +09:00
Jinzhen Lin	33e0823de5	[Bugfix] fix rope error when load models with different dtypes (#4835 )	2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD	26148120b3	[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797 )	2024-05-16 20:58:25 -07:00
Tyler Michael Smith	2060e93659	[Kernel] Add w8a8 CUTLASS kernels (#4749 )	2024-05-16 18:32:50 -04:00
Silencio	8435b207af	[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850 ) Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>	2024-05-16 11:16:09 -07:00
youkaichao	e08188081b	[Core][Distributed] remove graph mode function (#4818 )	2024-05-16 10:59:52 -07:00
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
Jinzhen Lin	99caa49106	[Kernel] add bfloat16 support for gptq marlin kernel (#4788 )	2024-05-16 09:55:29 -04:00
alexm-nm	5c342570d7	Add marlin unit tests and marlin benchmark script (#4815 )	2024-05-16 09:36:49 -04:00
Cody Yu	973617ae02	[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840 ) Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cade Daniel <cade@anyscale.com>	2024-05-16 00:53:51 -07:00
Aurick Qiao	30e754390c	[Core] Implement sharded state loader (#4690 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-05-15 22:11:54 -07:00
Alex Wu	52f8107cf2	[Frontend] Support OpenAI batch file format (#4794 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-15 19:13:36 -04:00
Cyrus Leung	fc0d9dfc3a	[Frontend] Re-enable custom roles in Chat Completions API (#4758 )	2024-05-15 14:58:46 -07:00
Cyrus Leung	e9cdd2b1e2	[CI/Build] Further decouple HuggingFace implementation from ours during tests (#4166 )	2024-05-14 23:38:40 -07:00
SangBin Cho	65bf2ac165	[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681 ) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR	2024-05-15 14:00:10 +09:00
SangBin Cho	8a7cc254a0	Revert "[Kernel] Use flash-attn for decoding (#3648 )" (#4820 ) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit `1356df5`. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)	2024-05-15 11:52:45 +09:00
Nick Hill	676a99982f	[Core] Add MultiprocessingGPUExecutor (#4539 ) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>	2024-05-14 10:38:59 -07:00
Stephen Krider	1356df53bd	[Kernel] Use flash-attn for decoding (#3648 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2024-05-13 15:50:33 -07:00
Cody Yu	ce532ff45c	[Speculative decoding] Improve n-gram efficiency (#4724 )	2024-05-13 15:00:13 -07:00
Sanger Steel	8bc68e198c	[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208 )	2024-05-13 14:57:07 -07:00
Woosuk Kwon	0fca3cdcf2	[Misc] Enhance attention selector (#4751 )	2024-05-13 10:47:25 -07:00
SangBin Cho	e7c46b9527	[Scheduler] Warning upon preemption and Swapping (#4647 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-13 23:50:44 +09:00
Cyrus Leung	350f9e107f	[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425 ) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.	2024-05-13 23:50:09 +09:00
youkaichao	702bee461f	[Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754 )	2024-05-12 17:47:59 -07:00
Robert Shaw	a709e87a4f	[CI/Build] Tweak Marlin Nondeterminism Issues (#4713 )	2024-05-12 17:46:31 -07:00
Chang Su	e254497b66	[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734 )	2024-05-11 11:30:37 -07:00
youkaichao	4e12131089	[Core][Test] fix function name typo in custom allreduce (#4750 )	2024-05-10 15:14:40 -07:00
Robert Shaw	fcc2994be6	[CI] Nits for bad initialization of SeqGroup in testing (#4748 )	2024-05-10 18:01:01 -04:00
heeju-kim2	2e7796f2cf	[Speculative decoding] CUDA graph support (#4295 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-10 17:36:25 +00:00
SangBin Cho	6a0f617210	[Core] Fix circular reference which leaked llm instance in local dev env (#4737 ) Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.	2024-05-10 23:54:32 +09:00
Allen.Dou	e965d46184	[Misc] Keep only one implementation of the create_dummy_prompt function. (#4716 )	2024-05-09 21:42:38 -07:00
youkaichao	208b71bcc1	[Core][Distributed] refactor pynccl (#4591 ) [Core][Distributed] refactor pynccl to hold multiple communicators (#4591)	2024-05-09 19:48:43 -07:00
Cody Yu	c833101740	[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535 )	2024-05-09 18:04:17 -06:00
Woosuk Kwon	0ee535b294	[Misc] Set block size at initialization & Fix test_model_runner (#4705 )	2024-05-09 09:04:59 -07:00
Woosuk Kwon	190bc838e1	[Misc] Remove unnecessary ModelRunner imports (#4703 )	2024-05-09 00:17:17 -07:00
Cyrus Leung	f12b20decc	[Frontend] Move async logic outside of constructor (#4674 )	2024-05-08 22:48:33 -07:00
Cody Yu	f942efb5a3	[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-08 21:44:00 +00:00
youkaichao	230c4b38c1	[CI/Test] fix swap test for multi gpu (#4689 )	2024-05-08 13:14:02 -07:00
youkaichao	20cfcdec99	[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659 )	2024-05-08 12:07:05 -07:00
DefTruth	0f9a6e3d22	[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573 )	2024-05-08 09:19:58 -07:00
SangBin Cho	f6a593093a	[CI] Make mistral tests pass (#4596 )	2024-05-08 08:44:35 -07:00
youkaichao	cc466a3290	[Core][Distributed] support cpu&device in broadcast tensor dict (#4660 ) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)	2024-05-07 19:34:47 -07:00
leiwen83	8344f7742b	[Bug fix][Core] fixup ngram not setup correctly (#4551 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-07 11:40:18 -07:00
youkaichao	469f85c782	[Core][Optimization] change copy-on-write from dict[int, list] to list (#4648 )	2024-05-07 11:06:32 -07:00
youkaichao	63575bc2e1	[Core][Optimization] change python dict to pytorch tensor (#4607 )	2024-05-06 21:30:27 -07:00
DearPlanet	4302987069	[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937 )	2024-05-04 15:39:34 -07:00
Michael Goin	2a052011ca	[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527 ) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.	2024-05-04 11:45:16 -07:00
Cody Yu	bc8ad68455	[Misc][Refactor] Introduce ExecuteModelData (#4540 )	2024-05-03 17:47:07 -07:00
Cade Daniel	ab50275111	[Speculative decoding] Support target-model logprobs (#4378 )	2024-05-03 15:52:01 -07:00
Lily Liu	43c413ec57	[Kernel] Use flashinfer for decoding (#4353 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>	2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck	f8e7adda21	Fix/async chat serving (#2727 )	2024-05-03 11:04:14 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
youkaichao	344a5d0c33	[Core][Distributed] enable allreduce for multiple tp groups (#4566 )	2024-05-02 17:32:33 -07:00
SangBin Cho	0f8a91401c	[Core] Ignore infeasible swap requests. (#4557 )	2024-05-02 14:31:20 -07:00
Michał Moskal	32881f3f31	[kernel] fix sliding window in prefix prefill Triton kernel (#4405 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2024-05-02 11:23:37 -07:00
alexm-nm	7038e8b803	[Kernel] Support running GPTQ 8-bit models in Marlin (#4533 )	2024-05-02 12:56:22 -04:00
youkaichao	2a85f93007	[Core][Distributed] enable multiple tp group (#4512 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-05-02 04:28:21 +00:00
Ronen Schaffer	5e401bce17	[CI]Add regression tests to ensure the async engine generates metrics (#4524 )	2024-05-01 19:57:12 -07:00
SangBin Cho	0d62fe58db	[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451 )	2024-05-01 19:24:13 -07:00
Danny Guinther	b8afa8b95a	[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273 )	2024-05-01 17:34:40 -07:00
sasha0552	c47ba4aaa9	[Bugfix] Add validation for seed (#4529 )	2024-05-01 19:31:22 +00:00
Nick Hill	a657bfc48a	[Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357 )	2024-05-01 18:41:59 +00:00
leiwen83	24750f4cad	[Core] Enable prefix caching with block manager v2 enabled (#4142 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>	2024-05-01 11:20:32 -07:00
leiwen83	b38e42fbca	[Speculative decoding] Add ngram prompt lookup decoding (#4237 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-01 11:13:03 -07:00
SangBin Cho	6f1df80436	[Test] Add ignore_eos test (#4519 )	2024-05-01 08:45:42 -04:00
Jee Li	d6f4bd7cdd	[Misc]Add customized information for models (#4132 )	2024-04-30 21:18:14 -07:00
Robert Caulk	c3845d82dc	Allow user to define whitespace pattern for outlines (#4305 )	2024-04-30 20:48:39 -07:00
Florian Greinacher	a494140433	[Frontend] Support complex message content for chat completions endpoint (#3467 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-04-30 16:28:46 -07:00
Robert Shaw	111815d482	[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-04-30 21:46:12 +00:00
leiwen83	4bb53e2dde	[BugFix] fix num_lookahead_slots missing in async executor (#4165 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-04-30 10:12:59 -07:00
youkaichao	f4f921b7f1	[Core][Distributed] use cpu group to broadcast metadata in cpu (#4444 )	2024-04-29 13:52:22 -07:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Prashant Gupta	d6e520e170	[Core] Support offline use of local cache for models (#4374 ) Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>	2024-04-27 09:59:55 -07:00
Nick Hill	81661da7b2	[BugFix] Fix `min_tokens` when `eos_token_id` is None (#4389 ) Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>	2024-04-27 09:52:46 -07:00
Ruoyu Qin	dfea173148	[Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363 )	2024-04-27 09:48:37 -07:00
Roy	7134303cbb	[Bugfix][Core] Fix get decoding config from ray (#4335 )	2024-04-27 11:30:08 +00:00
Austin Veselka	eefeb16464	[Kernel] Full Tensor Parallelism for LoRA Layers (#3524 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-04-27 00:03:48 -07:00
Cyrus Leung	8947bc3c15	[Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355 )	2024-04-27 05:08:24 +00:00
Cody Yu	a62aaf1df5	[Misc][Refactor] Generalize linear_method to be quant_method (#4373 )	2024-04-26 16:41:14 -04:00
SangBin Cho	603ad84815	[Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309 )	2024-04-26 13:02:02 +00:00
Cyrus Leung	a74dee9b62	[Bugfix] Fix parameter name in `get_tokenizer` (#4107 )	2024-04-25 19:10:48 -07:00
Woosuk Kwon	468d761b32	[Misc] Reduce supported Punica dtypes (#4304 )	2024-04-23 18:54:33 -07:00
youkaichao	91f50a6fe2	[Core][Distributed] use cpu/gloo to initialize pynccl (#4248 )	2024-04-23 18:32:19 -07:00
Cyrus Leung	1e8f4252aa	[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292 )	2024-04-23 18:19:03 +00:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Cade Daniel	62b8aebc6f	[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951 )	2024-04-23 08:02:36 +00:00
SangBin Cho	050f285ff6	[Core] Scheduling optimization 2 (#4280 )	2024-04-23 08:02:11 +00:00
SangBin Cho	ad8d696a99	[Core] Scheduler perf fix (#4270 )	2024-04-22 21:11:06 +00:00
GeauxEric	a37d815b83	Make initialization of tokenizer and detokenizer optional (#3748 ) Co-authored-by: Yun Ding <yunding@nvidia.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-21 22:06:46 +00:00
nunjunj	91528575ec	[Frontend] multiple sampling params support (#3570 )	2024-04-20 00:11:57 -07:00
Cody Yu	a22cdea371	[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118 ) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.	2024-04-20 04:28:57 +00:00
Ayush Rautwar	138485a82d	[Bugfix] Add fix for JSON whitespace (#4189 ) Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>	2024-04-19 20:49:22 -07:00
Jee Li	d17c8477f1	[Bugfix] Fix LoRA loading check (#4138 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-04-19 00:59:54 -07:00
youkaichao	8a7a3e4436	[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-18 16:15:12 -07:00
James Whedbee	e1bb2fd52d	[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149 )	2024-04-18 21:12:55 +00:00
Michał Moskal	e8cc7967ff	[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128 )	2024-04-18 00:51:28 -07:00
Michael Goin	53b018edcb	[Bugfix] Get available quantization methods from quantization registry (#4098 )	2024-04-18 00:21:55 -07:00
youkaichao	6dc1fc9cfe	[Core] nccl integrity check and test (#4155 ) [Core] Add integrity check during initialization; add test for it (#4155)	2024-04-17 22:28:52 -07:00
Shoichi Uchinami	a53222544c	[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134 )	2024-04-17 10:02:45 -07:00
youkaichao	8438e0569e	[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024 ) [Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)	2024-04-17 08:34:33 +00:00
Cade Daniel	e95cd87959	[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894 )	2024-04-16 13:09:21 -07:00
Antoni Baum	69e1d2fb69	[Core] Refactor model loading code (#4097 )	2024-04-16 11:34:39 -07:00
Noam Gat	05434764cd	LM Format Enforcer Guided Decoding Support (#3868 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-16 05:54:57 +00:00
SangBin Cho	4e7ee664e2	[Core] Fix engine-use-ray broken (#4105 )	2024-04-16 05:24:53 +00:00
Sanger Steel	711a000255	[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476 )	2024-04-13 17:13:01 -07:00
Jee Li	989ae2538d	[Kernel] Add punica dimension for Baichuan-13B (#4053 )	2024-04-13 07:55:05 -07:00
SangBin Cho	36729bac13	[Test] Test multiple attn backend for chunked prefill. (#4023 )	2024-04-12 09:56:57 -07:00
Jee Li	1096717ae9	[Core] Support LoRA on quantized models (#4012 )	2024-04-11 21:02:44 -07:00
Nick Hill	e46a60aa4c	[BugFix] Fix handling of stop strings and stop token ids (#3672 )	2024-04-11 15:34:12 -07:00
Antoni Baum	1e96c3341a	Add extra punica sizes to support bigger vocabs (#4015 )	2024-04-11 22:18:57 +00:00
Dylan Hawk	95e7d4a97c	Fix echo/logprob OpenAI completion bug (#3441 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-04-11 22:15:50 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
Kunshang Ji	e9da5a40c6	[Misc] Add indirection layer for custom ops (#3913 )	2024-04-10 20:26:07 -07:00
SangBin Cho	e42df7227d	[Test] Add xformer and flash attn tests (#3961 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-11 03:09:50 +00:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
youkaichao	63e7176f26	[Core][Refactor] move parallel_utils into vllm/distributed (#3950 ) [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)	2024-04-10 15:33:30 -07:00
Travis Johnson	0258b7a94b	[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-04-10 01:39:56 -07:00
胡译文	b3104b2a10	[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899 )	2024-04-10 00:09:36 -07:00
Jee Li	11dd6ebb89	[Misc] Avoid loading incorrect LoRA config (#3777 )	2024-04-09 19:47:15 -07:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
youkaichao	95baec828f	[Core] enable out-of-tree model register (#3871 )	2024-04-06 17:11:41 -07:00
SangBin Cho	18de883489	[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853 )	2024-04-05 10:17:58 -07:00
Cade Daniel	e5043a3e75	[Misc] Add pytest marker to opt-out of global test cleanup (#3863 )	2024-04-04 21:54:16 -07:00
Matthias Gerstgrasser	aabe8f40f2	[Core] [Frontend] Make detokenization optional (#3749 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-04-03 21:52:18 -07:00
Michael Feil	537ee25f43	[Core] Enable hf_transfer by default if available (#3817 )	2024-04-04 04:02:43 +00:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
SangBin Cho	3dcb3e8b98	[3/N] Refactor scheduler for chunked prefill scheduling (#3550 )	2024-04-03 14:13:49 -07:00
Cade Daniel	5757d90e26	[Speculative decoding] Adding configuration object for speculative decoding (#3706 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com>	2024-04-03 00:40:57 +00:00
Cade Daniel	eb69d68804	[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783 )	2024-04-02 00:49:51 +00:00
Qubitium	7d4e1b85e7	[Misc] Add support for new autogptq checkpoint_format (#3689 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-04-01 19:32:01 -04:00
Cade Daniel	93deb0b38f	[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250 )	2024-04-01 22:55:24 +00:00
Nick Hill	49782fcb76	[Misc] Some minor simplifications to detokenization logic (#3670 ) Some simplifications made for clarity. Also moves detokenization-related functions from tokenizer.py to detokenizer.py.	2024-04-01 13:22:06 -07:00
Robert Shaw	563c1d7ec5	[CI/Build] Make Marlin Tests Green (#3753 )	2024-03-30 19:18:34 -07:00
mawong-amd	b6d103542c	[Kernel] Layernorm performance optimization (#3662 )	2024-03-30 14:26:38 -07:00
Roy	f510395bbf	[BugFix][Frontend] Fix completion logprobs=0 error (#3731 )	2024-03-29 09:38:21 -07:00
Roy	6110c39dc8	[BugFix] Fix tokenizer out of vocab size (#3685 )	2024-03-29 08:18:59 -07:00
youkaichao	756b30a5f3	[Core][Test] move local_rank to the last arg with default value(#3711 ) [Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)	2024-03-28 21:19:45 -07:00
SangBin Cho	26422e477b	[Test] Make model tests run again and remove --forked from pytest (#3631 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-03-28 21:06:40 -07:00
Roy	515386ef3c	[Core] Support multi-node inference(eager and cuda graph) (#3686 )	2024-03-28 15:01:55 -07:00
SangBin Cho	b51c1cc9d2	[2/N] Chunked prefill data update (#3538 )	2024-03-28 10:06:01 -07:00
Cade Daniel	14ccd94c89	[Core][Bugfix]Refactor block manager for better testability (#3492 )	2024-03-27 23:59:28 -07:00
Roger Wang	45b6ef6513	feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )	2024-03-27 13:39:26 -07:00
youkaichao	8f44facddd	[Core] remove cupy dependency (#3625 )	2024-03-27 00:33:26 -07:00
Jee Li	566b57c5c4	[Kernel] support non-zero cuda devices in punica kernels (#3636 )	2024-03-27 00:37:42 +00:00
Jee Li	8af890a865	Enable more models to inference based on LoRA (#3382 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-25 18:09:31 -07:00
Nick Hill	dfeb2ecc3a	[Misc] Include matched stop string/token in responses (#2976 ) Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>	2024-03-25 17:31:32 -07:00
xwjiang2010	64172a976c	[Feature] Add vision language model support. (#3042 )	2024-03-25 14:16:30 -07:00
Simon Mo	f408d05c52	hotfix isort on logprobs ranks pr (#3622 )	2024-03-25 11:55:46 -07:00
Dylan Hawk	0b4997e05c	[Bugfix] API stream returning two stops (#3450 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-03-25 10:14:34 -07:00
Travis Johnson	c13ad1b7bd	feat: implement the min_tokens sampling parameter (#3124 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-03-25 10:14:26 -07:00
Swapnil Parekh	819924e749	[Core] Adding token ranks along with logprobs (#3516 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>	2024-03-25 10:13:10 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Woosuk Kwon	925f3332ca	[Core] Refactor Attention Take 2 (#3462 )	2024-03-25 04:39:33 +00:00
youkaichao	837e185142	[CI/Build] fix flaky test (#3602 )	2024-03-24 17:43:05 -07:00
youkaichao	8b268a46a7	[CI] typo fix: is_hip --> is_hip() (#3595 )	2024-03-24 16:03:06 -07:00
Nick Hill	41deac4a3d	[BugFix] 1D query fix for MoE models (#3597 )	2024-03-24 16:00:16 -07:00
Antoni Baum	bfdb1ba5c3	[Core] Improve detokenization performance for prefill (#3469 ) Co-authored-by: MeloYang <meloyang05@gmail.com>	2024-03-22 13:44:12 -07:00
Thomas Parnell	cf2f084d56	Dynamic scheduler delay to improve ITL performance (#3279 ) Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>	2024-03-22 12:28:14 -07:00
Zhuohan Li	e90fc21f2e	[Hardware][Neuron] Refactor neuron support (#3471 )	2024-03-22 01:22:17 +00:00
Roy	ea5f14e6ff	[Bugfix][Model] Fix Qwen2 (#3554 )	2024-03-22 00:18:58 +00:00
Roy	f1c0fc3919	Migrate `logits` computation and gather to `model_runner` (#3233 )	2024-03-20 23:25:01 +00:00
SangBin Cho	6e435de766	[1/n][Chunked Prefill] Refactor input query shapes (#3236 )	2024-03-20 14:46:05 -07:00
Antoni Baum	426ec4ec67	[1/n] Triton sampling kernel (#3186 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-20 14:45:08 -07:00
Woosuk Kwon	5ee14494e4	[Misc] Remove cache stream and cache events (#3461 )	2024-03-20 00:38:53 -07:00
ElizaWszola	9474e89ba4	[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled (#3357 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-03-20 00:11:11 -07:00
Robert Shaw	097aa0ea22	[CI/Build] Fix Bad Import In Test (#3473 )	2024-03-18 20:28:00 +00:00
Simon Mo	120157fd2a	Support arbitrary json_object in OpenAI and Context Free Grammar (#3211 )	2024-03-16 13:35:27 -07:00
simon-mo	ad50bf4b25	fix lint	2024-03-15 22:23:38 -07:00
Tao He	3123f15138	Fixes the incorrect argument in the prefix-prefill test cases (#3246 )	2024-03-15 20:58:10 -07:00
Antoni Baum	fb96c1e98c	Asynchronous tokenization (#2879 )	2024-03-15 23:37:01 +00:00
陈序	54be8a0be2	Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-03-14 13:56:57 -07:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
Or Sharir	ae0ccb4017	Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350 )	2024-03-13 12:18:25 -07:00
Woosuk Kwon	602358f8a8	Add kernel for GeGLU with approximate GELU (#3337 )	2024-03-12 22:06:17 -07:00
Breno Faria	49a3c8662b	Fixes #1556 double free (#3347 )	2024-03-13 00:30:08 +00:00
Zhuohan Li	4c922709b6	Add distributed model executor abstraction (#3191 )	2024-03-11 11:03:45 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Roy	9e8744a545	[BugFix] Fix get tokenizer when using ray (#3301 )	2024-03-10 19:17:16 -07:00
Terry	0bba88df03	Enhance lora tests with more layer and rank variations (#3243 )	2024-03-09 17:14:16 -08:00
Cade Daniel	8437bae6ef	[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103 )	2024-03-08 23:32:46 -08:00
ElizaWszola	b35cc93420	Fix auto prefix bug (#3239 )	2024-03-07 16:37:28 -08:00
jacobthebanana	8cbba4622c	Possible fix for conflict between Automated Prefix Caching (#2762 ) and multi-LoRA support (#1804 ) (#3263 )	2024-03-07 23:03:22 +00:00
Woosuk Kwon	2daf23ab0c	Separate attention backends (#3005 )	2024-03-07 01:45:50 -08:00
Cade Daniel	a33ce60c66	[Testing] Fix core tests (#3224 )	2024-03-06 01:04:23 -08:00
SangBin Cho	24aecf421a	[Tests] Add block manager and scheduler tests (#3108 )	2024-03-05 18:23:34 -08:00
Nick Hill	8999ec3c16	Store `eos_token_id` in `Sequence` for easy access (#3166 )	2024-03-05 15:35:43 -08:00
Antoni Baum	ff578cae54	Add health check, make async Engine more robust (#3015 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-03-04 22:01:40 +00:00
Antoni Baum	22de45235c	Push logprob generation to LLMEngine (#3065 ) Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-03-04 19:54:06 +00:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
felixzhu555	703e42ee4b	Add guided decoding for OpenAI API server (#2819 ) Co-authored-by: br3no <breno@veltefaria.de> Co-authored-by: simon-mo <simon.mo@hey.com>	2024-02-29 22:13:08 +00:00
Seonghyeon	bfdcfa6a05	Support starcoder2 architecture (#3089 )	2024-02-29 00:51:48 -08:00
Woosuk Kwon	929b4f2973	Add LoRA support for Gemma (#3050 )	2024-02-28 13:03:28 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Tao He	71bcaf99e2	Enable GQA support in the prefix prefill kernels (#3007 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-02-27 01:14:31 -08:00
Dylan Hawk	e0ade06d63	Support logit bias for OpenAI API (#3027 )	2024-02-27 11:51:53 +08:00
Jared Moore	70f3e8e3a1	Add LogProbs for Chat Completions in OpenAI (#2918 )	2024-02-26 10:39:34 +08:00
Harry Mellor	ef978fe411	Port metrics from `aioprometheus` to `prometheus_client` (#2730 )	2024-02-25 11:54:00 -08:00
Ronen Schaffer	4caf7044e0	Include tokens from prompt phase in `counter_generation_tokens` (#2802 )	2024-02-22 14:00:12 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
Antoni Baum	017d9f1515	Add metrics to RequestOutput (#2876 )	2024-02-20 21:55:57 -08:00
Zhuohan Li	63e2a6419d	[FIX] Fix beam search test (#2930 )	2024-02-20 14:37:39 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Isotr0py	ab3a5a8259	Support OLMo models. (#2832 )	2024-02-18 21:05:15 -08:00
Zhuohan Li	a61f0521b8	[Test] Add basic correctness test (#2908 )	2024-02-18 16:44:50 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00
Woosuk Kwon	d7afab6d3a	[BugFix] Fix GC bug for `LLM` class (#2882 )	2024-02-14 22:17:44 -08:00

... 19 20 21 22 23 ...

2143 Commits