vLLM/vllm - vllm - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Thien Tran	4f044b1d67	[Kernel][CPU] CPU MLA (#14744 ) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>	2025-03-25 09:34:59 +00:00
Russell Bryant	a09ad90a72	[V1] guidance backend for structured output + `auto` fallback mode (#14779 ) Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Loc Huynh <jc1da.3011@gmail.com> Co-authored-by: Michal Moskal <michal@moskal.me>	2025-03-24 21:02:33 -07:00
Harry Mellor	97cfa65df7	Add pipeline parallel support to `TransformersModel` (#12832 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2025-03-25 10:41:45 +08:00
Woosuk Kwon	ebcebeeb6b	[V1][Spec Decode] Enable spec decode for top-p & top-k sampling (#15063 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-03-24 17:16:46 -07:00
Gregory Shtrasberg	f533b5837f	[ROCm][Kernel] MoE weights padding (#14454 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com>	2025-03-24 23:45:30 +00:00
Gregory Shtrasberg	8279201ce6	[Build] Cython compilation support fix (#14296 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>	2025-03-24 23:37:54 +00:00
Siyuan Liu	23fdab00a8	[Hardware][TPU] Skip failed compilation test (#15421 ) Signed-off-by: Siyuan Liu <lsiyuan@google.com>	2025-03-24 23:28:57 +00:00
Nick Hill	9d72daf4ce	[V1][Perf] Simpler request output queues (#15156 ) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>	2025-03-24 22:44:08 +00:00
Manish Sethi	761702fd19	[Core] Integrate `fastsafetensors` loader for loading model weights (#10647 ) Signed-off-by: Manish Sethi <Manish.sethi1@ibm.com>	2025-03-24 08:08:02 -07:00
Cyrus Leung	cbcdf2c609	[Bugfix] Fix chat template loading (#15143 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2025-03-24 13:50:09 +00:00
Jinzhen Lin	6b3cc75be0	[Kernel] allow non-contiguous input for marlin kernel (#14658 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-03-24 09:21:33 -04:00
Luka Govedič	f622dbcf39	[Fix] [torch.compile] Improve UUID system for custom passes (#15249 ) Signed-off-by: luka <luka@neuralmagic.com>	2025-03-24 01:54:07 +00:00
Robin	d6cd59f122	[Frontend] Support tool calling and reasoning parser (#14511 ) Signed-off-by: WangErXiao <863579016@qq.com>	2025-03-23 14:00:07 -07:00
Woosuk Kwon	b9bd76ca14	[V1][Spec Decode] Respect prompt_lookup_max (#15348 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-03-23 10:41:44 -07:00
youkaichao	f68cce8e64	[ci/build] fix broken tests in LLM.collective_rpc (#15350 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-03-23 14:49:48 +08:00
shangmingc	50c9636d87	[V1][Usage] Refactor speculative decoding configuration and tests (#14434 ) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>	2025-03-22 19:28:10 -10:00
Russell Bryant	b877031d80	Remove openvino support in favor of external plugin (#15339 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-03-22 14:06:39 -07:00
Russell Bryant	eb63ea1e18	[V1] Add `disable-any-whitespace` option support for xgrammar (#15316 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-03-22 15:56:17 +00:00
Naitong Yu	2f4bd358f1	[Model] Support Tele-FLM Model (#15023 ) Signed-off-by: Naitong Yu <ntyu@baai.ac.cn> Signed-off-by: jiangxin <horizon94@outlook.com> Co-authored-by: Jason Fang <jasonfang3900@gmail.com> Co-authored-by: jiangxin <horizon94@outlook.com>	2025-03-22 02:04:44 -07:00
Varun Sundar Rabindranath	8a8b30eac1	[Bugfix] LoRA V0 - Fix case where `max_num_seqs` is between cudagraph capture sizes (#15308 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-03-22 02:03:32 -07:00
TJian	ec870fba9a	[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature (#14959 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-03-21 22:36:14 -07:00
Nicolò Lucchesi	cfbb8c930f	[TPU][V1] MHA Pallas backend (#15288 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-03-21 08:50:39 -07:00
Cyrus Leung	baec0d4de9	Revert "[Feature] specify model in config.yaml (#14855 )" (#15293 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-21 08:30:23 -07:00
Chen Zhang	93a00d7dde	[v1] Refactor KVCacheConfig (#14079 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-03-21 04:56:27 -07:00
Isotr0py	8afcd0f633	[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend (#15282 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-03-21 11:42:06 +00:00
Wei Zeng	0fa3970deb	[Feature] specify model in config.yaml (#14855 ) Signed-off-by: weizeng <weizeng@roblox.com>	2025-03-21 00:26:03 -07:00
Nick Hill	da6ea29f7a	[V1] Avoid redundant input processing in n>1 case (#14985 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-03-20 22:24:10 -07:00
Hyesoo Yang	47195057e9	[V1][TPU] Speed up top-k on TPU by using torch.topk (#15242 ) Signed-off-by: Hyesoo Yang <hyeygit@gmail.com>	2025-03-20 19:19:40 -07:00
Isotr0py	1e508343e1	[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation (#15200 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-03-20 19:18:04 -07:00
Varun Sundar Rabindranath	0cfe7d386d	[CI/Build] LoRA : make add_lora_test safer (#15181 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-03-21 09:28:53 +08:00
Woosuk Kwon	0c6f5023c3	[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-03-20 17:50:43 -07:00
Jason	d8e82bc06d	[Bugfix] fix V1 Engine crash while handling requests with duplicate request id (#15043 ) Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>	2025-03-20 10:01:02 -07:00
Chi Zhang	086b56824c	[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 (#15172 ) Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2025-03-21 00:30:04 +08:00
Matt Ritter	a8652f4f0f	Enable CUDA graph support for llama 3.2 vision (#14917 ) Signed-off-by: Matt Ritter <100659061+mritterfigma@users.noreply.github.com>	2025-03-19 23:29:16 -07:00
Mickaël Seznec	a597a57595	[Attention] Flash Attention 3 - fp8 (#14570 ) Signed-off-by: Mickael Seznec <mickael@mistral.ai>	2025-03-20 01:14:20 -04:00
Russell Bryant	1f16b7fe74	[Core][V0] Add guidance backend for structured output (#14589 ) Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Loc Huynh <lohuynh@microsoft.com> Co-authored-by: Michal Moskal <michal@moskal.me> Co-authored-by: Aaron Pham <contact@aarnphm.xyz>	2025-03-19 21:33:51 -07:00
Nicolò Lucchesi	d8c6d7d6b5	[V1][TPU] Support V1 Sampler for ragged attention (#14227 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-03-19 21:00:39 -07:00
Jovan Sardinha	70e500cad9	Fix broken tests (#14713 ) Signed-off-by: JovanSardinha <jovan.sardinha@gmail.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2025-03-20 02:06:49 +00:00
Murali Andoorveedu	61c7a1b856	[V1] Minor V1 async engine test refactor (#15075 ) Signed-off-by: andoorve <murali.andoorveedu@mail.utoronto.ca> Co-authored-by: andoorve <murali.andoorveedu@mail.utoronto.ca>	2025-03-19 10:37:17 -07:00
Cyrus Leung	3d446433ec	[Bugfix] Fix size calculation of processing cache (#15114 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-19 05:53:19 -07:00
Cyrus Leung	61f412187d	[Bugfix] Re-enable Gemma3 for V1 (#14980 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-18 23:58:22 -07:00
Cyrus Leung	f690372b68	[Core] Update dtype detection and defaults (#14858 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-19 13:49:33 +08:00
Alexander Matveev	72a8639b68	[V1] TPU - CI/CD use smaller model (#15054 ) Signed-off-by: Alexander Matveev <amatveev@redhat.com>	2025-03-18 21:39:21 +00:00
Woosuk Kwon	99abb8b650	[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14930 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-03-18 14:31:54 -07:00
Jee Jee Li	46c759c165	[Bugfix] Fix LoRA extra vocab size (#15047 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-03-18 09:40:29 -07:00
yury-tokpanov	452e8fd968	[MODEL] Add support for Zamba2 models (#13185 ) Signed-off-by: Yury Tokpanov <yury@zyphra.com> Signed-off-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2025-03-18 08:56:21 -07:00
Patrick von Platen	f863ffc965	[Mistral-Small 3.1] Update docs and tests (#14977 ) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2025-03-18 03:29:42 -07:00
Varun Sundar Rabindranath	400d483e87	[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-03-18 09:47:53 +00:00
Liangfu Chen	53a0cf8b95	[Neuron] trim attention kernel tests to fit trn1.2x instance (#14988 ) Signed-off-by: Liangfu Chen <liangfc@amazon.com>	2025-03-18 15:05:52 +08:00
Tristan Leclercq	5eeabc2a44	[Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights (#14950 )	2025-03-17 23:27:26 +00:00
Alexander Matveev	18551e820c	[V1] TPU - Fix CI/CD runner (#14974 )	2025-03-17 21:07:07 +00:00
Cyrus Leung	b89fb2a4a1	[CI/Build] Use `AutoModelForImageTextToText` to load VLMs in tests (#14945 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-17 18:35:17 +00:00
Aaron Pham	c0efdd655b	[Fix][Structured Output] using vocab_size to construct matcher (#14868 ) Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2025-03-17 11:42:45 -04:00
vllmellm	2bb0e1a799	[Bugfix][ROCm] running new process using spawn method for rocm in tests. (#14810 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2025-03-17 11:33:35 +00:00
Lily Liu	8d6cf89526	[V1] [Spec Decode] Support random sampling for spec decode (#13933 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-03-16 22:00:20 -07:00
Sibi	a73e183e36	[Misc] Replace os environ to monkeypatch in test suite (#14516 ) Signed-off-by: sibi <85477603+t-sibiraj@users.noreply.github.com> Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Aaron Pham <contact@aarnphm.xyz>	2025-03-16 20:35:57 -07:00
Cyrus Leung	8a5a9b70d7	[CI/Build] Update defaults for test reproducibility (#14893 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-17 10:38:15 +08:00
Robert Shaw	bb3aeddfaf	[CI] Nightly Tests (#14898 ) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2025-03-17 02:06:43 +00:00
Robert Shaw	aecc780dba	[V1] Enable Entrypoints Tests (#14903 )	2025-03-16 17:56:16 -07:00
Rui Qiao	b9b5bdfc7d	[Misc] Catching Ray Compiled Graph PP test failures for V1 (#14847 )	2025-03-16 15:46:42 -07:00
Nick Hill	fc1f67715d	[BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-03-16 14:53:34 -07:00
Lily Liu	d1ad2a57af	[V1] [Spec Decode] Fix ngram tests (#14878 )	2025-03-16 00:29:22 -07:00
Isotr0py	def232e122	[VLM] Clean up Phi-4-MM ViT implementation (#14812 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2025-03-15 18:53:52 -07:00
Rémi Delacourt	61c6a5a796	[VLM] Merged multi-modal processor for Pixtral (#12211 ) Signed-off-by: remi <remi@mistral.ai> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-15 06:28:27 -07:00
Jun Duan	74bc397b0a	[Core] Expose API endpoint `/is_sleeping` (#14312 ) Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>	2025-03-15 06:28:14 -07:00
Cyrus Leung	3556a41434	[VLM] Limit multimodal input cache by memory (#14805 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-15 02:52:05 -07:00
Jee Jee Li	e0fdfa1608	[CI/Build] Delete LoRA bias test (#14849 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-03-14 22:09:25 -07:00
Lucas Wilkinson	5952d8ab61	[Attention] Get rid of mla cache alignment (#14842 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-03-15 05:08:25 +00:00
Li, Jiang	a2ae496589	[CPU] Support FP8 KV cache (#14741 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-03-14 22:07:36 -07:00
Robert Shaw	d4d93db2c5	[V1] V1 Enablement Oracle (#13726 ) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2025-03-14 22:02:20 -07:00
Michael Goin	14f301b541	Update to torch==2.6.0 (#12721 ) Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-03-14 16:58:30 -04:00
Russell Bryant	46f98893dd	[V1] Fix model parameterization for structured output tests (#14833 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-03-14 20:55:18 +00:00
daniel-salib	73deea2fdb	[Frontend] track server_load (#13950 )	2025-03-14 09:53:17 -07:00
Cyrus Leung	613c5bb945	[Bugfix] Fix Aria test loading (#14823 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-14 09:11:23 -07:00
Cyrus Leung	601bd3268e	[Misc] Clean up type annotation for `SupportsMultiModal` (#14794 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-14 00:59:56 -07:00
Roger Wang	0c2af17c76	[CI] Fix missing example model id in processor test (#14787 ) Signed-off-by: Roger Wang <ywang@roblox.com>	2025-03-14 13:52:15 +08:00
Liangfu Chen	d3d4956261	[Neuron] flatten test parameterization for neuron attention kernels (#14712 )	2025-03-13 20:46:56 -07:00
Varun Sundar Rabindranath	0b1cfa6180	[Kernel] LoRA - Enable CUDAGraphs for V1 (#14626 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-03-13 20:42:04 -07:00
afeldman-nm	02fcaa3d0a	[V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output (#14624 ) Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>	2025-03-13 19:07:34 +00:00
Cyrus Leung	8e9ffd37d6	[Misc] Clean up processor tests (#14771 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-13 18:25:37 +00:00
Cyrus Leung	f53a0586b9	[Bugfix] Fix prompt format of GLM4V (#14539 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-13 11:37:17 +00:00
Cyrus Leung	382403921f	[VLM] Support pan-and-scan for Gemma3 multi-modal processor (#14672 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Roger Wang <ywang@roblox.com>	2025-03-13 02:23:12 -07:00
Jee Jee Li	bd44b812cb	[CI/Build] Delete ultravox LoRA test (#14730 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-03-13 07:57:39 +00:00
Nick Hill	f5d3acd474	[BugFix][V1] Fix parallel sampling finishing/aborts (#14512 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-03-12 10:29:48 -07:00
TJian	916836bbfb	[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. (#14664 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-03-12 09:31:19 -07:00
Woosuk Kwon	c0c25e25fa	[Model] Add support for Gemma 3 (#14660 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-12 08:36:33 -07:00
Li, Jiang	ff47aab056	[CPU] Upgrade CPU backend to torch-2.6 (#13381 ) Signed-off-by: jiang1.li <jiang1.li@intel.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2025-03-12 10:41:13 +00:00
Pavani Majety	debd6bbf09	[Kernel] Add ModelOpt FP4 Checkpoint Support (#12520 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2025-03-12 05:13:11 +00:00
Benjamin Chislett	5c538c37b2	[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing (#14645 ) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>	2025-03-11 22:12:41 -07:00
Szymon Ożóg	e22ee1e7a2	[Kernel] GGUF MoE kernel (#14613 ) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>	2025-03-12 03:33:27 +00:00
Isotr0py	e392d85831	[Core] Refactor `QKVCrossParallelLinear` implementation to support BNB 4-bit quantization (#14545 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-03-11 20:12:52 -07:00
Aaron Pham	77a318bd01	[V1][Core] Support MistralTokenizer for Structured Output (#14625 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz>	2025-03-12 10:40:09 +08:00
Farzad Abdolhosseini	80e78d02ac	[Model] Extend Ultravox to accept audio longer than 30s (#13631 ) Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai>	2025-03-12 10:27:10 +08:00
Joe Runde	47532cd9f4	[core][V1] pluggable scheduler (#14466 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2025-03-12 01:15:15 +00:00
Russell Bryant	4bf82d4b90	[V1] Add regex structured output support with xgrammar (#14590 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-03-11 23:03:44 +08:00
Cyrus Leung	af295e9b01	[Bugfix] Update `--hf-overrides` for `Alibaba-NLP/gte-Qwen2` (#14609 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-11 07:59:43 -07:00
Jeff Daily	a1c8f3796c	dynamic distpatch of fp8 kernels (#14245 ) Signed-off-by: Jeff Daily <jeff.daily@amd.com>	2025-03-11 10:54:56 -04:00
Roger Wang	1fc973c0b5	[V1][Core] Fix memory issue with logits & sampling (#14508 ) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Varun Sundar Rabindranath <3337719+varun-sundar-rabindranath@users.noreply.github.com>	2025-03-11 04:03:41 +00:00
Liangfu Chen	c91b64f749	[neuron] add reshape_and_cache (#14391 )	2025-03-10 18:37:29 -07:00
gnovack	d6123170d5	[Neuron] Add Neuron device communicator for vLLM v1 (#14085 )	2025-03-10 18:37:04 -07:00
Varun Sundar Rabindranath	5ff0d32580	[V1] LoRA - Add triton kernels for V1 (#13096 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-03-10 17:27:53 -04:00
Harry Mellor	3b352a2f92	Correct capitalisation: `VLLM` -> `vLLM` (#14562 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-03-10 16:36:21 +00:00
Szymon Ożóg	89cdaa83e7	[Kernel] Add more dtype support for GGUF kernels (#14043 ) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>	2025-03-10 07:30:04 -07:00
Robert Shaw	5f0b53c6ea	Revert "[V1][Core] Fix memory issue with logits & sampling" (#14504 ) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2025-03-08 17:43:37 -08:00
22quinn	eb8b5eb183	[V1] Support bad_words in sampler (#13376 ) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-03-08 14:50:26 -08:00
Isotr0py	609ef61fea	[Bugfix] Fix profiling OOM and decouple encoder multimodal profiling (#14361 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-03-08 16:52:34 +00:00
Roger Wang	8d5aa466fb	[V1][Core] Fix memory issue with logits & sampling (#13776 ) Signed-off-by: Roger Wang <ywang@roblox.com>	2025-03-08 06:11:04 -08:00
Alexander Matveev	cb8bdfade2	[V1] TPU - Add tensor parallel support via Ray (#13618 ) Signed-off-by: Alexander Matveev <amatveev@redhat.com>	2025-03-08 08:19:38 -05:00
Cyrus Leung	33f227e16b	[CI/Build] Use a fixed seed to avoid flaky tests (#14480 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-03-08 11:30:09 +00:00
Harry Mellor	47512b3200	Default to `generation_config` from model (#12622 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-03-08 14:46:15 +08:00
afeldman-nm	ef64044079	[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC (#13949 )	2025-03-08 01:48:12 +00:00
Nick Hill	8ed5421aaa	[V1] Eagerly remove finished requests from the batch (#14388 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-03-07 10:56:00 -08:00
Jinzhen Lin	d0feea31c7	[Kernel] optimize performance of gptq marlin kernel when n is small (#14138 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-03-07 11:53:38 -05:00
Aaron Pham	80e9afb5bc	[V1][Core] Support for Structured Outputs (#12388 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-03-07 07:19:11 -08:00
மனோஜ்குமார் பழனிச்சாமி	cc10281498	[Misc] Set default value of seed to None (#14274 ) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>	2025-03-07 10:40:01 +00:00
Jee Jee Li	12c29a881f	[Bugfix] Further clean up LoRA test (#14422 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-03-07 10:30:55 +00:00
Ilya Lavrenov	8ca7a71df7	OpenVINO: added CPU-like conditions (#14338 ) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>	2025-03-06 22:24:49 -08:00
Jee Jee Li	ddd1ef66ec	[Bugfix] Fix JambaForCausalLM LoRA (#14370 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-03-06 22:05:47 -08:00
Luka Govedič	e1744502c2	[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object (#14390 ) Signed-off-by: luka <luka@neuralmagic.com>	2025-03-07 05:20:16 +00:00
Himanshu Jaju	cd579352bf	[V1] Do not detokenize if sampling param detokenize is False (#14224 ) Signed-off-by: Himanshu Jaju <hj@mistral.ai> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-03-06 10:40:24 -08:00
Harry Mellor	bf0560bda9	Reinstate `best_of` for V0 (#14356 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-03-06 08:34:22 -08:00
Thomas Parnell	6bd1dd9d26	[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152 )	2025-03-06 07:39:16 -08:00
Nicolò Lucchesi	fa82b93853	[Frontend][Docs] Transcription API streaming (#13301 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-03-06 10:39:35 +00:00
kYLe	1769928079	[Model] Update Paligemma multimodal processing with PromptUpdate (#14015 ) Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2025-03-06 08:31:38 +00:00
Nicolò Lucchesi	5ee10e990d	[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (#11301 )	2025-03-05 20:00:53 -08:00
Varun Sundar Rabindranath	3dbd2d813a	[V1] LoRA - Enable more V1 tests (#14315 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-03-06 11:55:42 +08:00
Lucas Wilkinson	f6bb18fd9a	[BugFix] MLA + V1, illegal memory access and accuracy issues (#14253 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-03-05 17:10:13 -08:00
Lu Fang	53ea6ad830	[V1][Easy] Add empty allowed_token_ids in the v1 sampler test (#14308 ) Signed-off-by: Lu Fang <lufang@fb.com>	2025-03-05 21:41:18 +00:00
Vincent	a4f1ee35d6	Deprecate `best_of` Sampling Parameter in anticipation for vLLM V1 (#13997 ) Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-03-05 20:22:43 +00:00
Robert Shaw	257e200a25	[V1][Frontend] Add Testing For V1 Runtime Parameters (#14159 ) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2025-03-05 14:18:55 +00:00
Benjamin Chislett	32985bed7c	[Frontend] Allow return_tokens_as_token_ids to be passed as a request param (#14066 ) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>	2025-03-05 06:30:40 +00:00
Michael Goin	dae9ec464c	Temporarily disable test_awq_gemm_opcheck (#14251 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-03-05 06:10:35 +00:00
Tyler Michael Smith	72c62eae5f	[V1] EP/TP MoE + DP Attention (#13931 )	2025-03-04 21:27:26 -08:00
Congcong Chen	0a995d5434	[Model] New model support for Phi-4-multimodal-instruct (#14119 )	2025-03-04 20:57:01 -08:00
Nick Hill	5db6b2c961	[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (#13869 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-03-04 15:06:47 +00:00
Travis Johnson	c060b71408	[Model] Add support for GraniteMoeShared models (#13313 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2025-03-04 08:04:52 +08:00
Mark McLoughlin	ae122b1cbd	[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-03-03 19:04:45 +00:00
TJian	848a6438ae	[ROCm] Faster Custom Paged Attention kernels (#12348 )	2025-03-03 09:24:45 -08:00
Cody Yu	f35f8e2242	[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2025-03-03 16:43:14 +08:00
Harry Mellor	cf069aa8aa	Update deprecated Python 3.8 typing (#13971 )	2025-03-02 17:34:51 -08:00
Ce Gao	bf33700ecd	[v0][structured output] Support reasoning output (#12955 ) Signed-off-by: Ce Gao <cegao@tensorchord.ai>	2025-03-02 14:49:42 -05:00
Jee Jee Li	cc5e8f6db8	[Model] Add LoRA support for TransformersModel (#13770 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-03-02 09:17:34 +08:00
YajieWang	6a92ff93e1	[Misc][Kernel]: Add GPTQAllSpark Quantization (#12931 )	2025-02-28 22:30:59 -08:00
Luka Govedič	bd56c983d6	[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902 ) Signed-off-by: luka <luka@neuralmagic.com>	2025-02-28 16:20:11 -07:00
Chen Zhang	28943d36ce	[v1] Move block pool operations to a separate class (#13973 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2025-02-28 20:53:31 +00:00
Chen Zhang	e7bd944e08	[v1] Cleanup the BlockTable in InputBatch (#13977 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-02-28 19:03:16 +00:00
Harry Mellor	4be4b26cb7	Fix entrypoint tests for embedding models (#14052 )	2025-02-28 08:56:44 -08:00
Cyrus Leung	f7bee5c815	[VLM][Bugfix] Enable specifying prompt target via index (#14038 )	2025-02-28 07:35:55 -08:00
Harry Mellor	76c89fcadd	Use smaller embedding model when not testing model specifically (#13891 )	2025-02-28 00:50:43 -08:00
Travis Johnson	73e0225ee9	[Bugfix] Check that number of images matches number of <\|image\|> tokens with mllama (#13911 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2025-02-28 04:00:45 +00:00
Sage Moore	38acae6e97	[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups (#13970 ) Signed-off-by: Sage Moore <sage@neuralmagic.com>	2025-02-27 20:31:47 +00:00
Cyrus Leung	f1579b229d	[VLM] Generalized prompt updates for multi-modal processor (#13964 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-02-27 17:44:25 +00:00
Isotr0py	edf309ebbe	[VLM] Support multimodal inputs for Florence-2 models (#13320 )	2025-02-27 02:06:41 -08:00
Michael Goin	788f284b53	Fix test_block_fp8.py test for MoE (#13915 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-02-27 18:00:00 +08:00
Mark McLoughlin	cd711c48b2	[V1][Metrics] Handle preemptions (#13169 )	2025-02-26 20:04:59 -08:00
Rui Qiao	c9944acbf9	[misc] Rename Ray ADAG to Compiled Graph (#13928 )	2025-02-26 20:03:28 -08:00
Lucas Wilkinson	f95903909f	[Kernel] FlashMLA integration (#13747 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-02-27 10:35:08 +08:00
Wallas Henrique	4cb6fa0a9c	[Bugfix] Backend option to disable xgrammar any_whitespace (#12744 ) Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>	2025-02-26 10:52:34 -08:00
Cyrus Leung	934bb99c71	[Bugfix] Update expected token counts for Ultravox tests (#13895 )	2025-02-26 04:56:50 -08:00
Joe Runde	3f808cc044	[Bugfix] Do not crash V0 engine on input errors (#13101 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2025-02-26 19:07:29 +08:00
Florian Greinacher	215bf150a6	[Bugfix] Handle None parameters in Mistral function calls. (#13786 )	2025-02-26 03:06:21 -08:00
Cyrus Leung	7b700ec8c8	[Bugfix] Add test example for Ultravox v0.5 (#13890 )	2025-02-26 02:31:43 -08:00
Roger Wang	7ca1da020f	[Misc] Fix input processing for Ultravox (#13871 )	2025-02-25 23:56:34 -08:00
Jee Jee Li	5157338ed9	[Misc] Improve LoRA spelling (#13831 )	2025-02-25 23:43:01 -08:00
Harry Mellor	145944cb94	Improve pipeline partitioning (#13839 )	2025-02-25 18:53:56 -08:00
Lily Liu	5629f26df7	[V1][Spec Decode] Change Spec Decode Rejection Sampling API (#13729 )	2025-02-25 18:14:48 -08:00
Michael Goin	07c4353057	[Model] Support Grok1 (#13795 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-02-26 01:07:12 +00:00
Harry Mellor	34e3494e70	Fix failing `MyGemma2Embedding` test (#13820 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-02-25 12:33:03 -08:00
Liangfu Chen	f75aa72732	[Neuron] Add custom_ops for neuron backend (#13246 ) Signed-off-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: George Novack <gnovack@amazon.com> Co-authored-by: Aoyu Zhang <aoyuzhan@amazon.com>	2025-02-25 11:47:49 -08:00
Jee Jee Li	37b6cb4985	[CI/Build] Fix V1 LoRA failure (#13767 )	2025-02-25 02:01:15 -08:00
Gregory Shtrasberg	aabeb2688f	[ROCm][Quantization][Kernel] Using HIP FP8 header (#12593 )	2025-02-25 00:39:59 -08:00
Varun Sundar Rabindranath	03f48b3db6	[Core] LoRA V1 - Add add/pin/list/remove_lora functions (#13705 )	2025-02-25 00:18:02 -08:00
Harry Mellor	cdc1fa12eb	Remove unused kwargs from model definitions (#13555 )	2025-02-24 17:13:52 -08:00
afeldman-nm	befc402d34	[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) (#10980 ) Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-02-24 08:29:41 -08:00
Jongseok Park	781096e385	Expert Parallelism (EP) Support for DeepSeek V2 (#12583 )	2025-02-24 07:33:20 -08:00
Kevin H. Luu	f90a375593	[ci] Add logic to change model to S3 path only when S3 CI env var is on (#13727 ) Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-63-253.us-west-2.compute.internal>	2025-02-24 06:32:11 +00:00
Nick Hill	cbae7af552	[V1][BugFix] Fix engine core client shutdown hangs (#13298 ) Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method. Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context. Signed-off-by: Nick Hill <nhill@redhat.com>	2025-02-23 13:07:43 -08:00
youkaichao	eb24dc4a45	[v1] torchrun compatibility (#13642 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-23 22:47:24 +08:00
Isotr0py	ba5106e519	[LMM] Implement merged multimodal processor for whisper (#13278 )	2025-02-23 01:46:03 -08:00
Kevin H. Luu	2c5e637b57	[ci] Use env var to control whether to use S3 bucket in CI (#13634 )	2025-02-22 19:19:45 -08:00
Sage Moore	558db8083c	[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths (#13095 )	2025-02-22 05:25:41 -08:00
Kaixi Hou	e109e598c7	[NVIDIA] Support nvfp4 cutlass gemm (#13571 )	2025-02-22 05:24:05 -08:00
Keyun Tong	8db1b9d0a1	Support SSL Key Rotation in HTTP Server (#13495 )	2025-02-22 05:17:44 -08:00
Jee Jee Li	105b8ce4c0	[Misc] Reduce LoRA-related static variable (#13166 )	2025-02-22 00:21:30 -08:00
Mark McLoughlin	2cb8c1540e	[Metrics] Add `--show-hidden-metrics-for-version` CLI arg (#13295 )	2025-02-22 00:20:45 -08:00
Mark McLoughlin	1cd981da4f	[V1][Metrics] Support `vllm:cache_config_info` (#13299 )	2025-02-22 00:20:00 -08:00
Lu Fang	bb78fb318e	[v1] Support allowed_token_ids in v1 Sampler (#13210 ) Signed-off-by: Lu Fang <lufang@fb.com>	2025-02-22 14:13:05 +08:00
Keyun Tong	0ffdf8ce0c	[HTTP Server] Make model param optional in request (#13568 )	2025-02-21 21:55:50 -08:00
Lucas Wilkinson	288cc6c234	[Attention] MLA with chunked prefill (#12639 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-21 15:30:12 -08:00
Kevin H. Luu	34ad27fe83	[ci] Fix metrics test model path (#13635 )	2025-02-20 22:12:10 -08:00
Gabriel Marinho	1c3c975766	[FEATURE] Enables /score endpoint for embedding models (#12846 )	2025-02-20 22:09:47 -08:00
Lingfan Yu	33170081f1	[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245 ) Signed-off-by: Lingfan Yu <lingfany@amazon.com>	2025-02-20 17:45:45 -08:00
Joe Runde	bfbc0b32c6	[Frontend] Add backend-specific options for guided decoding (#13505 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2025-02-20 15:07:58 -05:00
Harry Mellor	992e5c3d34	Merge similar examples in `offline_inference` into single `basic` example (#12737 )	2025-02-20 04:53:51 -08:00
Kevin H. Luu	a64a84433d	[2/n][ci] S3: Use full model path (#13564 ) Signed-off-by: <>	2025-02-20 01:20:15 -08:00
Kevin H. Luu	aa1e62d0db	[ci] Fix spec decode test (#13600 )	2025-02-20 16:56:00 +08:00
youkaichao	ba81163997	[core] add sleep and wake up endpoint and v1 support (#12987 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: cennn <2523403608@qq.com> Co-authored-by: cennn <2523403608@qq.com>	2025-02-20 12:41:17 +08:00
Jee Jee Li	512368e34a	[Misc] Qwen2.5 VL support LoRA (#13261 )	2025-02-19 18:37:55 -08:00
Kevin H. Luu	473f51cfd9	[3/n][CI] Load Quantization test models with S3 (#13570 ) Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>	2025-02-20 10:12:30 +08:00
Cyrus Leung	377d10bd14	[VLM][Bugfix] Pass processor kwargs properly on init (#13516 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-02-19 13:13:50 +00:00
Yannick Schnider	423330263b	[Feature] Pluggable platform-specific scheduler (#13161 ) Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com> Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>	2025-02-19 17:16:38 +08:00
Nick Hill	caf7ff4456	[V1][Core] Generic mechanism for handling engine utility (#13060 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-02-19 17:09:22 +08:00
Lucia Fang	f525c0be8b	[Model][Speculative Decoding] DeepSeek MTP spec decode (#12755 ) Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2025-02-19 17:06:23 +08:00
Alex Brooks	983a40a8bb	[Bugfix] Fix Positive Feature Layers in Llava Models (#13514 ) Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>	2025-02-19 08:50:07 +00:00
Kevin H. Luu	d5d214ac7f	[1/n][CI] Load models in CI from S3 instead of HF (#13205 ) Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>	2025-02-19 07:34:59 +00:00
Nick Hill	30172b4947	[V1] Optimize handling of sampling metadata and req_ids list (#13244 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-02-18 12:15:33 -08:00
Murali Andoorveedu	a4d577b379	[V1][Tests] Adding additional testing for multimodal models to V1 (#13308 ) Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>	2025-02-18 09:53:14 -08:00
Liangfu Chen	3809458456	[Bugfix] Fix invalid rotary embedding unit test (#13431 ) Signed-off-by: Liangfu Chen <liangfc@amazon.com>	2025-02-18 11:52:03 +00:00
Michael Goin	b53d79983c	Add outlines fallback when JSON schema has enum (#13449 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-02-18 06:49:41 +00:00
Isotr0py	67ef8f666a	[Model] Enable quantization support for `transformers` backend (#12960 )	2025-02-17 19:52:47 -08:00
Woosuk Kwon	cd4a72a28d	[V1][Spec decode] Move drafter to model runner (#13363 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-17 15:40:12 -08:00
Woosuk Kwon	4c21ce9eba	[V1] Get input tokens from scheduler (#13339 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-17 11:01:07 -08:00
Tyler Michael Smith	1f69c4a892	[Model] Support Mamba2 (Codestral Mamba) (#9292 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com>	2025-02-17 20:17:50 +08:00
shangmingc	46cdd59577	[Feature][Spec Decode] Simplify the use of Eagle Spec Decode (#12304 ) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>	2025-02-16 19:32:26 -08:00
Cyrus Leung	5d2965b7d7	[Bugfix] Fix 2 Node and Spec Decode tests (#13341 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-02-16 22:20:22 +08:00
youkaichao	124776ebd5	[ci] skip failed tests for flashinfer (#13352 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-16 22:09:15 +08:00
wchen61	dc0f7ccf8b	[BugFix] Enhance test_pos_encoding to support execution on multi-devices (#13187 ) Signed-off-by: wchen61 <wchen61@foxmail.com>	2025-02-16 08:59:49 +00:00
Lily Liu	80f63a3966	[V1][Spec Decode] Ngram Spec Decode (#12193 ) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2025-02-15 18:05:11 -08:00
Cody Yu	9206b3d7ec	[V1][PP] Run engine busy loop with batch queue (#13064 )	2025-02-15 03:59:01 -08:00
Mark McLoughlin	2ad1bc7afe	[V1][Metrics] Add iteration_tokens_total histogram from V0 (#13288 )	2025-02-15 03:56:19 -08:00
Woosuk Kwon	e7eea5a520	[V1][CI] Fix failed v1-test because of min_p (#13316 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-14 17:29:51 -08:00
Aoyu	a12934d3ec	[V1][Core] min_p sampling support (#13191 ) Signed-off-by: Aoyu <aoyuzhan@amazon.com> Co-authored-by: Aoyu <aoyuzhan@amazon.com>	2025-02-14 15:50:05 -08:00
Joe Runde	3bcb8c75da	[Core] Reduce TTFT with concurrent partial prefills (#10235 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2025-02-14 15:36:07 -08:00
Michael Goin	5e5c8e091e	[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (#13236 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-02-14 12:53:42 -08:00
Lu Fang	6224a9f620	Support logit_bias in v1 Sampler (#13079 )	2025-02-14 04:34:59 -08:00
Alexander Matveev	45f90bcbba	[WIP] TPU V1 Support Refactored (#13049 )	2025-02-14 00:21:53 -08:00
Kero Liang	b0ccfc565a	[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (#13126 )	2025-02-13 22:39:20 -08:00
Varun Sundar Rabindranath	cbc40128eb	[V1] LoRA - Enable Serving Usecase (#12883 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-02-14 14:21:12 +08:00
Harry Mellor	f2b20fe491	Consolidate Llama model usage in tests (#13094 )	2025-02-13 22:18:03 -08:00
Tyler Michael Smith	09545c0a94	[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250 )	2025-02-13 20:19:25 -08:00
Tyler Michael Smith	c1e37bf71b	[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-14 00:01:14 +00:00
Nicolò Lucchesi	d84cef76eb	[Frontend] Add `/v1/audio/transcriptions` OpenAI API endpoint (#12909 )	2025-02-13 07:23:45 -08:00
Vaibhav Jain	37dfa60037	[Bugfix] Missing Content Type returns 500 Internal Server Error (#13193 )	2025-02-13 06:52:22 -08:00
Cyrus Leung	1bc3b5e71b	[VLM] Separate text-only and vision variants of the same model architecture (#13157 )	2025-02-13 06:19:15 -08:00
Cyrus Leung	c9d3ecf016	[VLM] Merged multi-modal processor for Molmo (#12966 )	2025-02-13 04:34:00 -08:00
Rui Qiao	9605c1256e	[V1][core] Implement pipeline parallel on Ray (#12996 )	2025-02-13 08:02:46 +00:00
LikeSundayLikeRain	04f50ad9d1	[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (#13097 )	2025-02-12 23:11:26 -08:00
Isotr0py	bc55d13070	[VLM] Implement merged multimodal processor for Mllama (#11427 )	2025-02-12 20:26:21 -08:00
Kaixi Hou	4fc5c23bb6	[NVIDIA] Support nvfp4 quantization (#12784 )	2025-02-12 19:51:51 -08:00
Michael Goin	14b7899d10	[CI] Fix failing FP8 cpu offload test (#13170 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-02-12 19:16:06 +00:00
Qubitium-ModelCloud	36a08630e8	[CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per module override/control (#7086 )	2025-02-12 09:19:43 -08:00
Jee Jee Li	82cabf53a3	[Misc] Delete unused LoRA modules (#13151 )	2025-02-12 08:58:24 -08:00
Rafael Vasquez	314cfade02	[Frontend] Generate valid tool call IDs when using `tokenizer-mode=mistral` (#12332 )	2025-02-12 08:29:56 -08:00
Lingfan Yu	e92694b6fe	[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921 ) Signed-off-by: Lingfan Yu <lingfany@amazon.com>	2025-02-11 21:12:37 -08:00
Christian Pinto	974dfd4971	[Model] IBM/NASA Prithvi Geospatial model (#12830 )	2025-02-11 20:34:30 -08:00
Keyun Tong	3ee696a63d	[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (#12518 ) Signed-off-by: Keyun Tong <tongkeyun@gmail.com>	2025-02-12 12:25:58 +08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	6c4dbe23eb	[BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (#12962 ) Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-02-12 00:21:50 +08:00
Mark McLoughlin	75e6e14516	[V1][Metrics] Add several request timing histograms (#12644 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-02-11 10:14:00 -05:00
மனோஜ்குமார் பழனிச்சாமி	110f59a33e	[Bugfix] fix flaky test (#13089 ) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>	2025-02-11 14:41:20 +00:00
Cody Yu	41c5dd45b9	[V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592 )	2025-02-11 08:27:25 +00:00
Ce Gao	fc6485d277	[Bugfix]: Reasoning output bug according to the chat template change (#13025 ) Signed-off-by: Ce Gao <cegao@tensorchord.ai>	2025-02-11 15:49:03 +08:00
Varun Sundar Rabindranath	78a141d768	[Misc] LoRA - Refactor Punica ops tests (#12970 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-02-11 07:26:03 +00:00
Florian Greinacher	cb080f32e3	[Bugfix] Support missing tool parameters in mistral tokenizer (#12884 ) Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com>	2025-02-11 03:33:33 +00:00
Farzad Abdolhosseini	08b2d845d6	[Model] Ultravox Model: Support v0.5 Release (#12912 ) Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai>	2025-02-10 22:02:48 +00:00
மனோஜ்குமார் பழனிச்சாமி	2ae889052c	Fix seed parameter behavior in vLLM (#13007 ) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>	2025-02-10 23:26:50 +08:00
Cyrus Leung	51f0b5f7f6	[Bugfix] Clean up and fix multi-modal processors (#13012 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-02-10 10:45:21 +00:00
youkaichao	b2496bb07f	[core] fix sleep mode and pytorch checkpoint compatibility (#13001 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-10 13:03:43 +08:00
youkaichao	59fff4a01a	[core] improve error handling when wake up from sleep mode (#12981 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-10 09:38:57 +08:00
youkaichao	cf797aa856	[core] port pynvml into vllm codebase (#12963 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-09 15:00:00 +08:00
Jee Jee Li	86222a3dab	[VLM] Merged multi-modal processor for GLM4V (#12449 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-02-08 20:32:16 +00:00
youkaichao	91dd8f7aa6	[bugfix] respect distributed_executor_backend in world_size=1 (#12934 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-08 16:17:08 +08:00
Woosuk Kwon	3243158336	[V1] Move KV block hashes from Request to KVCacheManager (#12922 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-07 19:14:10 -08:00
TJian	eaa92d4437	[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (#12501 )	2025-02-07 08:13:43 -08:00
afeldman-nm	0630d4537a	[V1] Logprobs and prompt logprobs support (#9880 ) This PR is adding support for sample logprobs & prompt logprobs to vLLM v1. New behavior: - During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order. - In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized. - During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.) - Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer. Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-02-07 07:26:20 -08:00
Cyrus Leung	ce26b16268	[Misc] Remove unnecessary detokenization in multimodal processing (#12868 )	2025-02-07 06:21:17 -08:00
Maximilien de Bayser	6e1fc61f0f	Prevent unecessary requests to huggingface hub (#12837 )	2025-02-06 21:37:41 -08:00
Yu Chin Fabian Lim	aff404571b	Add Bamba Model (#10909 ) Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-06 15:22:42 -08:00
Varun Sundar Rabindranath	467a96a541	[V1] LoRA Support (#10957 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-02-06 09:32:51 -08:00
youkaichao	09b95e36ab	[torch.compile] PyTorch 2.6 and nightly compatibility (#12393 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-07 01:09:07 +08:00
Isotr0py	85ac82d228	[Kernel] Make rotary_embedding ops more flexible with input shape (#12777 )	2025-02-06 08:46:13 -08:00
Dipika Sikka	7ca9934fe7	[Misc] Update w2 scale loading for GPTQMarlinMoE (#12757 )	2025-02-06 01:02:14 -08:00
Lu Fang	56534cd577	[Bugfix] Fix the test_ultravox.py's license (#12806 ) Signed-off-by: Lu Fang <lufang@fb.com>	2025-02-06 13:25:54 +08:00
Sumit Vij	d88506dda4	[Model] LoRA Support for Ultravox model (#11253 )	2025-02-05 19:54:13 -08:00
Cyrus Leung	75404d041b	[VLM] Update compatibility with transformers 4.49	2025-02-05 19:09:45 -08:00
Roger Wang	bf3b79efb8	[VLM] Qwen2.5-VL	2025-02-05 13:31:38 -08:00
Rahul Tuli	3b2005e1db	Add: Support for Sparse24Bitmask Compressed Models	2025-02-05 13:30:43 -08:00
Lucas Wilkinson	75e94309e8	[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676 ) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-02-04 18:22:24 -08:00
Mark McLoughlin	233df6f5c4	[V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-02-04 19:46:54 -05:00
Cyrus Leung	18016a5e62	[Bugfix] Fix CI failures for InternVL and Mantis models (#12728 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-02-04 23:54:23 +08:00
Isotr0py	815079de8e	[VLM] merged multimodal processor and V1 support for idefics3 (#12660 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2025-02-04 20:00:51 +08:00
Woosuk Kwon	18a88fcccc	[V1] Remove scheduling constraint on partial requests (#12674 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-04 02:43:58 -08:00
Cyrus Leung	d1ca7df84d	[VLM] Merged multi-modal processor for InternVL-based models (#12553 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2025-02-04 16:44:52 +08:00
Hongxia Yang	c36ac98d01	[AMD][ROCm] Enable DeepSeek model on ROCm (#12662 ) Signed-off-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>	2025-02-04 08:24:11 +00:00
Thomas Parnell	bb392af434	[Doc] Replace ibm-fms with ibm-ai-platform (#12709 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2025-02-04 07:05:04 +00:00
Russell Bryant	73b35cca7f	[Core] Improve hash collision avoidance in prefix caching (#12621 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-03 16:28:20 -08:00
Cody Yu	5095e96606	[V1] Revert `uncache_blocks` and support recaching full blocks (#12415 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2025-02-03 15:04:53 -08:00
Cody Yu	cf58b9c4ca	[MISC] Remove model input dumping when exception (#12582 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2025-02-03 13:34:16 -08:00
Russell Bryant	33e0602e59	[Misc] Fix improper placement of SPDX header in scripts (#12694 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-03 11:16:59 -08:00
Arthur	a1a2aaadb9	[Model]: Add `transformers` backend support (#11330 ) # Adds support for `transformers` as a backend Following https://github.com/huggingface/transformers/pull/35235, a bunch of models should already be supported, we are ramping up support for more models. Thanks @Isotr0py for the TP support, and @hmellor for his help as well! This includes: - `trust_remote_code=True` support: any model on the hub, if it implements attention the correct way can be natively supported!! - tensor parallel support --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2025-02-03 21:30:38 +08:00
youkaichao	20579c0fae	make sure mistral_common not imported for non-mistral models (#12669 ) When people use deepseek models, they find that they need to solve cv2 version conflict, see https://zhuanlan.zhihu.com/p/21064432691 . I added the check, and make all imports of `cv2` lazy. --------- Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-03 13:40:25 +08:00
Russell Bryant	e489ad7a21	[Misc] Add SPDX-License-Identifier headers to python source files (#12628 ) - Add SPDX license headers to python source files - Check for SPDX headers using pre-commit commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745 Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:18:24 2025 -0500 Add SPDX license headers to python source files This commit adds SPDX license headers to python source files as recommended to the project by the Linux Foundation. These headers provide a concise way that is both human and machine readable for communicating license information for each source file. It helps avoid any ambiguity about the license of the code and can also be easily used by tools to help manage license compliance. The Linux Foundation runs license scans against the codebase to help ensure we are in compliance with the licenses of the code we use, including dependencies. Having these headers in place helps that tool do its job. More information can be found on the SPDX site: - https://spdx.dev/learn/handling-license-info/ Signed-off-by: Russell Bryant <rbryant@redhat.com> commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:36:32 2025 -0500 Check for SPDX headers using pre-commit Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-02 11:58:18 -08:00
Shawn Du	f8ece6e17f	[Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608 ) As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>	2025-02-02 16:40:58 +08:00
Chen Zhang	89003c4082	[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603 ) This pr adds extra key to block hash, to generate different hash value for two blocks with the same token string but different extra_keys in their parent blocks. For example, it can generate different hash value for the second block of the following two requests: ```python request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash1", "hash2"], ) request2 = make_request( request_id=1, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash3", "hash2"], ) ``` --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-31 13:13:04 -08:00
Lucas Wilkinson	cabaf4eff3	[Attention] MLA decode optimizations (#12528 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-01-30 23:49:37 -08:00
Lucas Wilkinson	9798b2fb00	[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) scaling (#11868 )	2025-01-30 18:33:00 -08:00
Mark McLoughlin	f17f1d4608	[V1][Metrics] Add GPU cache usage % gauge (#12561 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-01-29 18:31:01 -08:00
Jinzhen Lin	27b78c73ca	[Kernel] add triton fused moe kernel for gptq/awq (#12185 )	2025-01-29 09:07:09 -05:00
Yanyi Liu	ff7424f491	[Frontend] Support override generation config in args (#12409 ) Signed-off-by: liuyanyi <wolfsonliu@163.com>	2025-01-29 01:41:01 -08:00
Alphi	d93bf4da85	[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069 ) Signed-off-by: hzh <hezhihui_thu@163.com> Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by: Akshat Tripathi <akshat@krai.ai> Signed-off-by: Oleg Mosalov <oleg@krai.ai> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu> Signed-off-by: Chenguang Li <757486878@qq.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Shanshan Shen <467638484@qq.com> Signed-off-by: elijah <f1renze.142857@gmail.com> Signed-off-by: Yikun <yikunkero@gmail.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com> Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: sixgod <evethwillbeok@outlook.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com> Co-authored-by: Oleg Mosalov <oleg@krai.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com> Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com> Co-authored-by: Concurrensee <yida.wu@amd.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2025-01-29 09:24:59 +00:00
Travis Johnson	036ca94c25	[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: Wallas Santos <wallashss@ibm.com>	2025-01-29 08:54:35 +00:00
Mark McLoughlin	46fb056749	[V1][Metrics] Add TTFT and TPOT histograms (#12530 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-01-29 04:11:16 +00:00
Ce Gao	a7e3eba66f	[Frontend] Support reasoning content for deepseek r1 (#12473 ) Signed-off-by: Ce Gao <cegao@tensorchord.ai> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Michael Goin <mgoin@redhat.com>	2025-01-29 11:38:08 +08:00
Mark McLoughlin	c386c43ca3	[V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-01-28 22:07:22 +00:00
Mark McLoughlin	3fd1fb63ef	[V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-01-28 16:38:38 +00:00
Cyrus Leung	8f58a51358	[VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-28 16:25:05 +00:00
Gabriel Marinho	0f465ab533	[FEATURE] Enables offline /score for embedding models (#12021 ) Signed-off-by: Gabriel Marinho <gmarinho@ibm.com>	2025-01-28 11:30:13 +08:00
Liangfu Chen	ddee88d0ff	[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (#11277 ) Signed-off-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: Jiangfei Duan <jfduan@outlook.com>	2025-01-27 17:31:16 -08:00
Harry Mellor	823ab79633	Update `pre-commit` hooks (#12475 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-27 17:23:08 -07:00
Nicolò Lucchesi	6116ca8cd7	[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logprobs` with ChunkedPrefill (#10132 ) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wallashss <wallashss@ibm.com> Co-authored-by: wallashss <wallashss@ibm.com>	2025-01-27 13:38:35 -08:00
Bowen Wang	2bc3fbba0c	[FlashInfer] Upgrade to 0.2.0 (#11194 ) Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2025-01-27 18:19:24 +00:00
Woosuk Kwon	3f1fc7425a	[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-01-27 09:40:04 -08:00
Mark McLoughlin	01ba927040	[V1][Metrics] Add initial Prometheus logger (#12416 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-01-27 12:26:28 -05:00
Pooya Davoodi	0cc6b383d7	[Frontend] Support scores endpoint in run_batch (#12430 ) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>	2025-01-27 04:30:17 +00:00
Kyle Mistele	0034b09ceb	[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376 ) Signed-off-by: Kyle Mistele <kyle@mistele.com>	2025-01-26 19:58:45 -07:00
Tyler Michael Smith	72f4880425	[Bugfix/CI] Fix broken kernels/test_mha.py (#12450 )	2025-01-26 10:39:03 -08:00
Tyler Michael Smith	aa2cd2c43d	[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2025-01-26 19:59:58 +08:00
Matthew Hendrey	9ddc35220b	[Frontend] generation_config.json for maximum tokens(#12242 ) Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-01-26 19:59:25 +08:00
Isotr0py	f1fc0510df	[Misc] Add FA2 support to ViT MHA layer (#12355 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-25 15:07:35 +08:00
Cyrus Leung	df5dafaa5b	[Misc] Remove deprecated code (#12383 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-24 14:45:20 -05:00
Lucas Wilkinson	ab5bbf5ae3	[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-01-24 15:27:59 +00:00
Nick Hill	24b0205f58	[V1][Frontend] Coalesce bunched `RequestOutput`s (#12298 ) Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2025-01-23 17:17:41 -08:00
Gregory Shtrasberg	e97f802b2d	[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com>	2025-01-23 18:04:03 +00:00
Lucas Wilkinson	978b45f399	[Kernel] Flash Attention 3 Support (#12093 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-01-23 06:45:48 -08:00
Cody Yu	f0ef37233e	[V1] Add `uncache_blocks` (#12333 )	2025-01-23 04:19:21 +00:00
rasmith	68c4421b6d	[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2025-01-23 00:10:37 +00:00
Cody Yu	7206ce4ce1	[Core] Support `reset_prefix_cache` (#12284 )	2025-01-22 18:52:27 +00:00
youkaichao	68ad4e3a8d	[Core] Support fully transparent sleep mode (#11743 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-01-22 14:39:32 +08:00
Kevin H. Luu	64ea24d0b3	[ci/lint] Add back default arg for pre-commit (#12279 ) Signed-off-by: kevin <kevin@anyscale.com>	2025-01-22 01:15:27 +00:00
Cyrus Leung	df76e5af26	[VLM] Simplify post-processing of replacement info (#12269 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-21 16:48:13 -08:00
Adrian Cole	347eeebe3b	[Misc] Remove experimental dep from tracing.py (#12007 ) Signed-off-by: Adrian Cole <adrian.cole@elastic.co>	2025-01-21 11:51:55 -08:00
Andy Lo	18fd4a8331	[Bugfix] Multi-sequence broken (#11898 ) Signed-off-by: Andy Lo <andy@mistral.ai>	2025-01-21 11:51:35 -08:00
Ricky Xu	132a132100	[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907 ) Signed-off-by: rickyx <rickyx@anyscale.com>	2025-01-21 11:51:13 -08:00
Nicolò Lucchesi	5fe6bf29d6	[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-01-21 12:23:14 +08:00
Cyrus Leung	18572e3384	[Bugfix] Fix `HfExampleModels.find_hf_info` (#12223 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-20 15:35:36 +00:00
Cyrus Leung	b37d82791e	[Model] Upgrade Aria to transformers 4.48 (#12203 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-20 17:58:48 +08:00
Cyrus Leung	59a0192fb9	[Core] Interface for accessing model from `VllmRunner` (#10353 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-20 15:00:59 +08:00
Isotr0py	83609791d2	[Model] Add Qwen2 PRM model support (#12202 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-20 14:59:46 +08:00
Martin Gleize	bbe5f9de7d	[Model] Support for fairseq2 Llama (#11442 ) Signed-off-by: Martin Gleize <mgleize@meta.com> Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>	2025-01-19 10:40:40 -08:00
Roger Wang	81763c58a0	[V1] Add V1 support of Qwen2-VL (#12128 ) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: imkero <kerorek@outlook.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-19 19:52:13 +08:00
yancong	32eb0da808	[Misc] Support register quantization method out-of-tree (#11969 )	2025-01-18 16:13:16 -08:00
Isotr0py	02798ecabe	[Model] Port deepseek-vl2 processor, remove dependency (#12169 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-18 13:59:39 +08:00
youkaichao	da02cb4b27	[core] further polish memory profiling (#12126 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-01-18 12:25:08 +08:00
Wallas Henrique	58fd57ff1d	[Bugfix] Fix score api for missing max_model_len validation (#12119 ) Signed-off-by: Wallas Santos <wallashss@ibm.com>	2025-01-17 16:24:22 +00:00
youkaichao	87a0c076af	[core] allow callable in collective_rpc (#12151 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-01-17 20:47:01 +08:00
Jee Jee Li	07934cc237	[Misc][LoRA] Improve the readability of LoRA error messages (#12102 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-01-17 19:32:28 +08:00
Chen Zhang	69d765f5a5	[V1] Move more control of kv cache initialization from model_executor to EngineCore (#11960 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2025-01-17 07:39:35 +00:00
Isotr0py	d75ab55f10	[Misc] Add deepseek_vl2 chat template (#12143 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-17 06:34:48 +00:00
Isotr0py	62b06ba23d	[Model] Add support for deepseek-vl2-tiny model (#12068 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-16 17:14:48 +00:00
Roger Wang	874f7c292a	[Bugfix] Fix max image feature size for Llava-one-vision (#12104 ) Signed-off-by: Roger Wang <ywang@roblox.com>	2025-01-16 14:54:06 +00:00
youkaichao	bf53e0c70b	Support torchrun and SPMD-style offline inference (#12071 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-01-16 19:58:53 +08:00
Isotr0py	dd7c9ad870	[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12067 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-16 10:11:54 +00:00
Joe Runde	edce722eaa	[Bugfix] use right truncation for non-generative tasks (#12050 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2025-01-16 00:31:01 +08:00
kewang-xlnx	de0526f668	[Misc][Quark] Upstream Quark format to VLLM (#10765 ) Signed-off-by: kewang-xlnx <kewang@xilinx.com> Signed-off-by: kewang2 <kewang2@amd.com> Co-authored-by: kewang2 <kewang2@amd.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2025-01-15 11:05:15 -05:00
RunningLeon	97eb97b5a4	[Model]: Support internlm3 (#12037 )	2025-01-15 11:35:17 +00:00
wangxiyuan	3adf0ffda8	[Platform] Do not raise error if _Backend is not found (#12023 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-01-15 10:14:15 +00:00
Chen Zhang	994fc655b7	[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (#12003 )	2025-01-15 07:55:30 +00:00
youkaichao	ad34c0df0f	[core] platform agnostic executor via collective_rpc (#11256 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-01-15 13:45:21 +08:00
Elfie Guo	0794e7446e	[Misc] Add multipstep chunked-prefill support for FlashInfer (#10467 )	2025-01-15 12:47:49 +08:00
Jee Jee Li	42f5e7c52a	[Kernel] Support MulAndSilu (#11624 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-01-15 02:29:53 +00:00
Cyrus Leung	bb354e6b2d	[Bugfix] Fix various bugs in multi-modal processor (#12031 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-14 12:16:11 +00:00
Yangcheng Li	f7b3ba82c3	[MISC] fix typo in kv transfer send recv test (#11983 )	2025-01-13 05:07:48 +00:00
Robert Shaw	619ae268c3	[V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (#11973 ) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2025-01-13 04:54:10 +00:00
Isotr0py	d14e98d924	[Model] Support GGUF models newly added in `transformers` 4.46.0 (#9685 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2025-01-13 00:13:44 +00:00
Robert Shaw	9597a095f2	[V1][Core][1/n] Logging and Metrics (#11962 ) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2025-01-12 21:02:02 +00:00
Avshalom Manevich	263a870ee1	[Hardware][TPU] workaround fix for MoE on TPU (#11764 )	2025-01-12 10:53:51 -05:00
Akshat Tripathi	8bddb73512	[Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100 ) Signed-off-by: Akshat Tripathi <akshat@krai.ai> Signed-off-by: Oleg Mosalov <oleg@krai.ai> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Oleg Mosalov <oleg@krai.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2025-01-12 13:01:52 +00:00
Isotr0py	f967e51f38	[Model] Initialize support for Deepseek-VL2 models (#11578 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2025-01-12 00:17:24 -08:00
Nicolò Lucchesi	d697dc01b4	[Bugfix] Fix RobertaModel loading (#11940 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-01-11 14:05:09 +00:00
Cyrus Leung	a991f7d508	[Doc] Basic guide for writing unit tests for new models (#11951 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-11 21:27:24 +08:00
Cyrus Leung	7a3a83e3b8	[CI/Build] Move model-specific multi-modal processing tests (#11934 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-11 13:50:05 +08:00
youkaichao	899136b857	[ci] fix broken distributed-tests-4-gpus (#11937 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-01-11 09:07:24 +08:00
Li, Jiang	aa1e77a19c	[Hardware][CPU] Support MOE models on x86 CPU (#11831 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-01-10 11:07:58 -05:00
Harry Mellor	482cdc494e	[Doc] Rename offline inference examples (#11927 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-10 23:50:29 +08:00
youkaichao	241ad7b301	[ci] Fix sampler tests (#11922 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-01-10 20:45:33 +08:00
Harry Mellor	d85c47d6ad	Replace "online inference" with "online serving" (#11923 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-10 12:05:56 +00:00
Joe Runde	ac2f3f7fee	[Bugfix] Validate lora adapters to avoid crashing server (#11727 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2025-01-10 15:56:36 +08:00
Chen Zhang	cf5f000d21	[torch.compile] Hide KV cache behind torch.compile boundary (#11677 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-10 13:14:42 +08:00
Cyrus Leung	b844b99ad3	[VLM] Enable tokenized inputs for merged multi-modal processor (#11900 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-10 03:24:00 +00:00
Cyrus Leung	9a228348d2	[Misc] Provide correct Pixtral-HF chat template (#11891 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-09 10:19:37 -07:00
youkaichao	bd82872211	[ci]try to fix flaky multi-step tests (#11894 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-01-09 14:47:29 +00:00
wangxiyuan	405eb8e396	[platform] Allow platform specify attention backend (#11609 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-01-09 21:46:50 +08:00
Cyrus Leung	0bd1ff4346	[Bugfix] Override dunder methods of placeholder modules (#11882 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-09 09:02:53 +00:00
Maximilien de Bayser	1fe554bac3	treat do_lower_case in the same way as the sentence-transformers library (#11815 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2025-01-09 11:05:43 +08:00
Tyler Michael Smith	615e4a5401	[CI] Turn on basic correctness tests for V1 (#10864 )	2025-01-08 21:20:44 -05:00
Robert Shaw	56fe4c297c	[TPU][Quantization] TPU `W8A8` (#11785 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-01-08 19:33:29 +00:00
Harry Mellor	aba8d6ee00	[Doc] Move examples into categories (#11840 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-08 13:09:53 +00:00
Cyrus Leung	2a0596bc48	[VLM] Reorganize profiling/processing-related code (#11812 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-08 18:59:58 +08:00
youkaichao	889e662eae	[misc] improve memory profiling (#11809 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2025-01-08 06:36:03 +00:00
Cyrus Leung	8f37be38eb	[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (#11800 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-07 18:25:02 +08:00
Jee Jee Li	b278557935	[Kernel][LoRA]Punica prefill kernels fusion (#11234 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Abatom <abzhonghua@gmail.com> Co-authored-by: Zhonghua Deng <abatom@163.com>	2025-01-07 04:01:39 +00:00
Cyrus Leung	08fb75c72e	[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (#11772 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-07 01:10:54 +00:00
Roger Wang	91b361ae89	[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (#11685 ) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-06 19:58:16 +00:00
Chen Zhang	e20c92bb61	[Kernel] Move attn_type to Attention.__init__() (#11690 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-07 00:11:28 +08:00
Jee Jee Li	32c9eff2ff	[Bugfix][V1] Fix molmo text-only inputs (#11676 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-01-06 15:22:25 +00:00
Cyrus Leung	996357e480	[VLM] Separate out profiling-related logic (#11746 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-06 16:02:21 +08:00
Rui Qiao	022c5c6944	[V1] Refactor get_executor_cls (#11754 )	2025-01-06 07:59:16 +00:00
cennn	9e764e7b10	[distributed] remove pynccl's redundant change_state (#11749 )	2025-01-06 09:05:48 +08:00
cennn	635b897246	[distributed] remove pynccl's redundant stream (#11744 )	2025-01-05 23:09:11 +08:00
Jee Jee Li	47831430cc	[Bugfix][V1] Fix test_kv_cache_utils.py (#11738 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-01-04 16:07:59 +00:00
Cyrus Leung	ba214dffbe	[Bugfix] Fix precision error in LLaVA-NeXT (#11735 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-04 23:45:57 +08:00
Cyrus Leung	eed11ebee9	[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision (#11717 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-04 11:40:53 +00:00
Yan Burman	300acb8347	[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture (#11233 ) Signed-off-by: Yan Burman <yanburman@users.noreply.github.com> Signed-off-by: Ido Asraff <idoa@atero.ai>	2025-01-04 14:50:16 +08:00
xcnick	d91457d529	[V1] Add kv cache utils tests. (#11513 ) Signed-off-by: xcnick <xcnick0412@gmail.com>	2025-01-04 14:49:46 +08:00
Robert Shaw	80c751e7f6	[V1] Simplify Shutdown (#11659 )	2025-01-03 17:25:38 +00:00
Aurick Qiao	e1a5c2f0a1	[Model] Whisper model implementation (#11280 ) Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>	2025-01-03 16:39:19 +08:00
Cyrus Leung	8c38ee7007	[VLM] Merged multi-modal processor for LLaVA-NeXT (#11682 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-02 16:39:27 +00:00
Cyrus Leung	a115ac46b5	[VLM] Move supported limits and max tokens to merged multi-modal processor (#11669 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2025-01-01 15:44:42 +00:00
Woosuk Kwon	73001445fb	[V1] Implement Cascade Attention (#11635 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-01-01 21:56:46 +09:00
Jee Jee Li	11d8a091c6	[Misc] Optimize Qwen2-VL LoRA test (#11663 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-01-01 14:42:23 +08:00
Cyrus Leung	365801fedd	[VLM] Add max-count checking in data parser for single image models (#11661 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-12-31 22:15:21 -08:00
Joe Runde	4db72e57f6	[Bugfix][Refactor] Unify model management in frontend (#11660 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2025-01-01 02:21:51 +00:00
Roger Wang	e7c7c5e822	[V1][VLM] V1 support for selected single-image models. (#11632 ) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Isotr0py <2037008807@qq.com>	2024-12-31 21:17:22 +00:00
Chen Zhang	8c3230d8c1	[V1] Simpify vision block hash for prefix caching by removing offset from hash (#11646 )	2024-12-31 08:56:01 +00:00
sakunkun	2c5718809b	[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. (#11565 )	2024-12-31 06:29:04 +00:00
John Giorgi	82c49d3260	[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (#6909 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-30 22:15:58 -08:00
Michael Goin	74fa1d123c	[Bugfix] Fix OpenAI parallel sampling when using xgrammar (#11637 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-31 03:43:54 +00:00
youkaichao	b12e87f942	[platforms] enable platform plugins (#11602 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-30 20:24:45 +08:00
youkaichao	3682e33f9f	[v1] fix compilation cache (#11598 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-30 04:24:12 +00:00
Robert Shaw	4fb8e329fd	[V1] [5/N] API Server: unify `Detokenizer` and `EngineCore` input (#11545 ) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2024-12-28 20:51:57 +00:00
youkaichao	328841d002	[bugfix] interleaving sliding window for cohere2 model (#11583 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-28 16:55:42 +00:00
Isotr0py	d34be24bb1	[Model] Support InternLM2 Reward models (#11571 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-12-28 06:14:10 +00:00
Robert Shaw	df04dffade	[V1] [4/N] API Server: ZMQ/MP Utilities (#11541 )	2024-12-28 01:45:08 +00:00
ErezSC42	55509c2114	[MODEL] LoRA support for Jamba model (#11209 ) Signed-off-by: Erez Schwartz <erezs@ai21.com>	2024-12-27 17:58:21 +00:00
Cyrus Leung	101418096f	[VLM] Support caching in merged multi-modal processor (#11396 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-27 17:22:48 +00:00
Cyrus Leung	7af553ea30	[Misc] Abstract the logic for reading and writing media content (#11527 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-27 19:21:23 +08:00
Robert Shaw	46d4359450	[CI] Fix broken CI (#11543 )	2024-12-26 18:49:16 -08:00
Woosuk Kwon	371d04d39b	[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (#11394 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-12-27 09:32:38 +09:00
Michael Goin	2072924d14	[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523 ) Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-26 15:33:30 -08:00
Cyrus Leung	eec906d811	[Misc] Add placeholder module (#11501 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-26 13:12:51 +00:00
sroy745	dcb1a944d4	[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler (#10681 ) Signed-off-by: Sourashis Roy <sroy@roblox.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-12-26 19:02:58 +09:00
Jee Jee Li	aa25985bd1	[Misc][LoRA] Fix LoRA weight mapper (#11495 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-26 15:52:48 +08:00
Cyrus Leung	51a624bf02	[Misc] Move some multimodal utils to modality-specific modules (#11494 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-26 04:23:20 +00:00
Jiaxin Shan	fc601665eb	[Misc] Update disaggregation benchmark scripts and test logs (#11456 ) Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>	2024-12-25 06:58:48 +00:00
Rui Qiao	9832e5572a	[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (#11472 )	2024-12-24 19:49:46 -08:00
Cyrus Leung	3f3e92e1f2	[Model] Automatic conversion of classification and reward models (#11469 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-24 18:22:22 +00:00
Jee Jee Li	196c34b0ac	[Misc] Move weights mapper (#11443 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-24 13:05:25 +00:00
Jee Jee Li	b1b1038fbd	[Bugfix] Fix Qwen2-VL LoRA weight loading (#11430 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-24 09:56:10 +00:00
Cyrus Leung	9edca6bf8f	[Frontend] Online Pooling API (#11457 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-24 17:54:30 +08:00
Rui Qiao	a491d6f535	[V1] TP Ray executor (#11107 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-12-23 23:00:12 +00:00
Michael Goin	63afbe9215	[CI] Expand OpenAI test_chat.py guided decoding tests (#11048 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-23 18:35:38 +00:00
Dipika Sikka	8cef6e02dc	[Misc] add w8a8 asym models (#11075 )	2024-12-23 13:33:20 -05:00
Michael Goin	5bfb30a529	[Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF (#11389 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-23 23:06:20 +08:00
Jason T. Greene	f1d1bf6288	[Bugfix] Fix fully sharded LoRAs with Mixtral (#11390 ) Signed-off-by: Jason Greene <jason.greene@redhat.com>	2024-12-22 23:25:10 +08:00
Roger Wang	29c748930e	[CI] Fix flaky entrypoint tests (#11403 ) Signed-off-by: Roger Wang <ywang@roblox.com>	2024-12-21 21:08:44 -08:00
omer-dayan	995f56236b	[Core] Loading model from S3 using RunAI Model Streamer as optional loader (#10192 ) Signed-off-by: OmerD <omer@run.ai>	2024-12-20 16:46:24 +00:00
Wallas Henrique	86c2d8fd1c	[Bugfix] Fix spec decoding when seed is none in a batch (#10863 ) Signed-off-by: Wallas Santos <wallashss@ibm.com>	2024-12-20 05:15:31 +00:00
Isotr0py	e24113a8fe	[Model] Refactor Qwen2-VL to use merged multimodal processor (#11258 ) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-19 16:28:00 +00:00
Yehoshua Cohen	6c7f881541	[Model] Add JambaForSequenceClassification model (#10860 ) Signed-off-by: Yehoshua Cohen <yehoshuaco@ai21.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Yehoshua Cohen <yehoshuaco@ai21.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-19 22:48:06 +08:00
Yanyi Liu	5aef49806d	[Feature] Add load generation config from model (#11164 ) Signed-off-by: liuyanyi <wolfsonliu@163.com> Signed-off-by: Yanyi Liu <wolfsonliu@163.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-12-19 10:50:38 +00:00
Cyrus Leung	6142ef0ada	[VLM] Merged multimodal processor for Qwen2-Audio (#11303 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-19 06:14:17 +00:00
Michael Goin	a30482f054	[CI] Expand test_guided_generate to test all backends (#11313 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-19 04:00:38 +00:00
Travis Johnson	17ca964273	[Model] IBM Granite 3.1 (#11307 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-12-19 11:27:24 +08:00
Tyler Michael Smith	5a9da2e6e9	[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-12-19 02:43:30 +00:00
Joe Runde	ca5f54a9b9	[Bugfix] fix minicpmv test (#11304 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-12-18 10:34:26 -08:00
Isotr0py	996aa70f00	[Bugfix] Fix broken phi3-v mm_processor_kwargs tests (#11263 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-12-18 10:16:40 -08:00
Dipika Sikka	60508ffda9	[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995 ) Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com> Co-authored-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2024-12-18 09:57:16 -05:00
Wallas Henrique	8b79f9e107	[Bugfix] Fix guided decoding with tokenizer mode mistral (#11046 )	2024-12-17 22:34:08 -08:00
Cody Yu	bf8717ebae	[V1] Prefix caching for vision language models (#11187 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2024-12-17 16:37:59 -08:00
Michael Goin	c77eb8a33c	[Bugfix] Set temperature=0.7 in test_guided_choice_chat (#11264 )	2024-12-17 16:34:06 -08:00
Joe Runde	2d1b9baa8f	[Bugfix] Fix request cancellation without polling (#11190 )	2024-12-17 12:26:32 -08:00
kYLe	66d4b16724	[Frontend] Add OpenAI API support for input_audio (#11027 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-16 22:09:58 -08:00
Michael Goin	0064f697d3	[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse (#10935 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-17 11:39:58 +08:00
youkaichao	551603feff	[core] overhaul memory profiling and fix backward compatibility (#10511 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-16 13:32:25 -08:00
Isotr0py	d927dbcd88	[Model] Refactor Ultravox to use merged input processor (#11198 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-12-16 10:09:53 +00:00
Jani Monoses	bddbbcb132	[Model] Support Cohere2ForCausalLM (Cohere R7B) (#11203 )	2024-12-16 09:56:19 +00:00
Cyrus Leung	b10609e6a1	[Misc] Clean up multi-modal processor (#11207 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-15 06:30:28 +00:00
Cyrus Leung	93abf23a64	[VLM] Fully dynamic prompt replacement in merged input processor (#11199 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-14 17:52:18 +00:00
Brad Hilton	9c3dadd1c9	[Frontend] Add `logits_processors` as an extra completion argument (#11150 ) Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com>	2024-12-14 16:46:42 +00:00
Cyrus Leung	0920ab9131	[Doc] Reorganize online pooling APIs (#11172 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-14 00:22:22 +08:00
Sungjae Lee	c31d4a57a6	[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching (#8240 )	2024-12-13 07:51:25 -08:00
Cyrus Leung	eeec9e3390	[Frontend] Separate pooling APIs in offline inference (#11129 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-13 10:40:07 +00:00
youkaichao	be39e3cd18	[core] clean up cudagraph batchsize padding logic (#10996 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-13 06:57:50 +00:00
Pooya Davoodi	1efce68605	[Bugfix] Use runner_type instead of task in GritLM (#11144 ) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>	2024-12-13 04:09:53 +00:00
Luka Govedič	30870b4f66	[torch.compile] Dynamic fp8 + rms_norm fusion (#10906 ) Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-12-13 03:19:23 +00:00
Cody Yu	78ed8f57d8	[Misc][V1] Fix type in v1 prefix caching (#11151 )	2024-12-13 00:57:40 +00:00
Jiaxin Shan	85362f028c	[Misc][LoRA] Ensure Lora Adapter requests return adapter name (#11094 ) Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-12 09:25:16 +00:00
youkaichao	62de37a38e	[core][distributed] initialization from StatelessProcessGroup (#10986 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-12 09:04:19 +00:00
Pooya Davoodi	1da8f0e1dd	[Model] Add support for embedding model GritLM (#10816 ) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>	2024-12-12 06:39:16 +00:00
Alexander Matveev	4e11683368	[V1] VLM preprocessor hashing (#11020 ) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alexander Matveev <alexm@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-12-12 00:55:30 +00:00
Cyrus Leung	d1e21a979b	[CI/Build] Split up VLM tests (#11083 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-12 06:18:16 +08:00
Cyrus Leung	8f10d5e393	[Misc] Split up pooling tasks (#10820 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-11 01:28:00 -08:00
Cyrus Leung	2e33fe4191	[CI/Build] Check transformers v4.47 (#10991 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-11 05:02:02 +00:00
Mor Zusman	ffa48c9146	[Model] PP support for Mamba-like models (#10992 ) Signed-off-by: mzusman <mor.zusmann@gmail.com>	2024-12-10 21:53:37 -05:00
Aurick Qiao	d5c5154fcf	[Misc] LoRA + Chunked Prefill (#9057 )	2024-12-11 10:09:20 +08:00
Jee Jee Li	d05f88679b	[Misc][LoRA] Add PEFTHelper for LoRA (#11003 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-10 11:12:01 +00:00
youkaichao	ebf778061d	monitor metrics of tokens per step using cudagraph batchsizes (#11031 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-09 22:35:36 -08:00
Tyler Michael Smith	28b3a1c7e5	[V1] Multiprocessing Tensor Parallel Support for v1 (#9856 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-12-10 06:28:14 +00:00
Isotr0py	a811dd6608	[Model] merged input processor for Phi-3-Vision models (#10977 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-12-09 12:55:10 -08:00
Jee Jee Li	ca871491ed	[Misc][LoRA] Abstract PunicaWrapper (#10955 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-09 12:54:44 -08:00
youkaichao	fd57d2b534	[torch.compile] allow candidate compile sizes (#10984 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-08 11:05:21 +00:00
zhou fan	78029b34ed	[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928 ) Signed-off-by: xffxff <1247714429@qq.com>	2024-12-08 01:21:18 +08:00
Cyrus Leung	c889d5888b	[Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-07 17:20:49 +00:00
Cyrus Leung	39e227c7ae	[Model] Update multi-modal processor to support Mantis(LLaVA) model (#10711 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-07 17:10:05 +00:00
Cyrus Leung	955fa9533a	[3/N] Support and implement merged input processor for LLaVA model (#10676 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-12-07 00:50:58 -08:00
Cyrus Leung	222f5b082a	[CI/Build] Fix broken multimodal test (#10950 )	2024-12-06 10:41:23 +00:00
youkaichao	9743d64e4e	[ci][build] add tests for python only compilation (#10915 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-05 08:54:47 -08:00
Isotr0py	998eeafe58	[CI/Build] Bump test transformers version (#10106 ) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-05 16:05:52 +00:00
Jee Jee Li	571da8fc43	[Misc][LoRA] Clean up the function interface of Punica (#10917 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-05 13:22:28 +00:00
Michael Goin	8d370e91cb	[Bugfix] Fallback to outlines for complex json schemas (#10899 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-12-05 11:14:06 +08:00
Xin Yang	01d079fd8e	[LoRA] Change lora_tokenizers capacity (#10796 ) Signed-off-by: Xin Yang <xyang19@gmail.com>	2024-12-04 17:40:16 +00:00
Tyler Michael Smith	d2bd88b122	[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-12-04 03:23:21 +00:00
Alexander Matveev	3bc94cab69	[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#10640 ) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-12-03 10:33:10 +00:00
Aaron Pham	9323a3153b	[Core][Performance] Add XGrammar support for guided decoding and set it as default (#10785 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-12-03 15:17:00 +08:00
youkaichao	dc5ce861bf	[torch.compile] remove compilation_context and simplify code (#10838 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-03 06:19:02 +00:00
youkaichao	21fe7b481a	[core][distributed] add pynccl broadcast (#10843 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-03 04:53:23 +00:00
Jee Jee Li	b45f0d7946	[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-12-02 17:53:36 +00:00
zhou fan	ef31eabc68	[Model]: add some tests for aria model (#10770 ) Signed-off-by: xffxff <1247714429@qq.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2024-12-02 05:36:36 +00:00
Woosuk Kwon	073a4bd1c0	[Kernel] Use `out` arg in flash_attn_varlen_func (#10811 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-12-01 17:55:39 -08:00
Kuntai Du	0590ec3fd9	[Core] Implement disagg prefill by StatelessProcessGroup (#10502 ) This PR provides initial support for single-node disaggregated prefill in 1P1D scenario. Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Co-authored-by: ApostaC <yihua98@uchicago.edu> Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn>	2024-12-01 19:01:00 -06:00
Cyrus Leung	d2f058e76c	[Misc] Rename embedding classes to pooling (#10801 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-01 14:36:51 +08:00
Cyrus Leung	133707123e	[Model] Replace embedding models with pooling adapter (#10769 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-12-01 08:02:54 +08:00
Cyrus Leung	fa6ecb9aa7	[Model] Clean up MiniCPMV (#10751 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-29 04:47:06 +00:00
sixgod	5fc5ce0fe4	[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-11-28 14:53:31 +00:00
Woosuk Kwon	a79b122400	[V1] Do not allocate beyond the max_model_len (#10730 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-11-28 00:13:15 -08:00
Ricky Xu	d9b4b3f069	[Bug][CLI] Allow users to disable prefix caching explicitly (#10724 ) Signed-off-by: rickyx <rickyx@anyscale.com>	2024-11-27 23:59:28 -08:00
tomeras91	395b1c7454	[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server (#10635 ) Signed-off-by: Tomer Asida <tomera@ai21.com>	2024-11-27 13:21:10 -08:00
Mor Zusman	197b4484a3	[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705 ) Signed-off-by: mzusman <mor.zusmann@gmail.com>	2024-11-27 19:02:27 +00:00
youkaichao	308cc5e21e	[ci] fix slow tests (#10698 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-27 09:26:14 -08:00
shunxing12345	1209261e93	[Model] Support telechat2 (#10311 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn> Co-authored-by: Isotr0py <2037008807@qq.com>	2024-11-27 11:32:35 +00:00
jeongin601	1bf905ddaa	[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. (#10198 ) Signed-off-by: jeongin601 <0200angela@gmail.com> Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com>	2024-11-27 05:07:30 +00:00
Chendi.Xue	0a71900bc9	Remove hard-dependencies of Speculative decode to CUDA workers (#10587 ) Signed-off-by: Chendi Xue <chendi.xue@intel.com>	2024-11-26 17:57:11 -08:00
Murali Andoorveedu	db66e018ea	[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232 ) Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com> Signed-off-by: Sourashis Roy <sroy@roblox.com> Co-authored-by: Sourashis Roy <sroy@roblox.com>	2024-11-26 09:11:16 -08:00
youkaichao	334d64d1e8	[ci] add vllm_test_utils (#10659 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-26 00:20:04 -08:00
Sage Moore	9a88f89799	custom allreduce + torch.compile (#10121 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-11-25 22:00:16 -08:00
Ricky Xu	519e8e4182	[v1] EngineArgs for better config handling for v1 (#10382 ) Signed-off-by: rickyx <rickyx@anyscale.com>	2024-11-25 21:09:43 -08:00
Shane A	9db713a1dc	[Model] Add OLMo November 2024 model (#10503 )	2024-11-25 17:26:40 -05:00
zhou fan	b1d920531f	[Model]: Add support for Aria model (#10514 ) Signed-off-by: xffxff <1247714429@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2024-11-25 18:10:55 +00:00
Wallas Henrique	c27df94e1f	[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850 ) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-11-25 12:23:32 -05:00
Chauncey	d04b13a380	[Bug]: Authorization ignored when root_path is set (#10606 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2024-11-25 16:21:41 +00:00
Cyrus Leung	ed46f14321	[Model] Support `is_causal` HF config field for Qwen2 model (#10621 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-25 09:51:20 +00:00
youkaichao	05d1f8c9c6	[misc] move functions to config.py (#10624 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-25 09:27:30 +00:00
youkaichao	25d806e953	[misc] add torch.compile compatibility check (#10618 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-24 23:40:08 -08:00
youkaichao	571841b7fc	[torch.compile] support encoder based models (#10613 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-25 05:24:33 +00:00
Maximilien de Bayser	214efc2c3c	Support Cross encoder models (#10400 ) Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Flavia Beo <flavia.beo@ibm.com> Co-authored-by: Flavia Beo <flavia.beo@ibm.com>	2024-11-24 18:56:20 -08:00
Jee Jee Li	1700c543a5	[Bugfix] Fix LoRA weight sharding (#10450 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-11-23 17:23:17 -08:00
Cyrus Leung	c8acd80548	[2/N] handling placeholders in merged multi-modal processor (#10485 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-22 21:25:09 -08:00
Ricky Xu	4634a89d18	Prefix Cache Aware Scheduling [1/n] (#10128 ) Signed-off-by: rickyx <rickyx@anyscale.com>	2024-11-22 21:15:55 -08:00
Varun Vinayak Shenoy	7d8ffb344f	[Bugfix] Internal Server Error when tool_choice is incorrect. (#10567 ) Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com>	2024-11-22 21:13:29 -08:00
youkaichao	4aba6e3d1a	[core] gemma2 full context length support (#10584 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-22 20:13:54 -08:00
Tyler Michael Smith	978b39744b	[Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432 )	2024-11-22 22:14:03 -05:00
Travis Johnson	9195dbdbca	[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use (#10164 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-11-23 10:17:38 +08:00
Ricky Xu	97814fbf0f	[v1] Refactor KVCacheManager for more hash input than token ids (#10507 ) Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-11-22 23:27:25 +00:00
youkaichao	eebad39f26	[torch.compile] support all attention backends (#10558 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-22 14:04:42 -08:00
youkaichao	db100c5cde	[bugfix] fix full graph tests (#10581 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-22 10:02:14 -08:00
youkaichao	33e0a2540a	[9/N] torch.compile LLM usage (#10552 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-21 19:13:31 -08:00
youkaichao	7560ae5caf	[8/N] enable cli flag without a space (#10529 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-21 12:30:42 -08:00
Jee Jee Li	2385b60d83	[Kernel] Register punica ops directly (#10522 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-11-21 09:18:11 -08:00
Chauncey	da7e702c6f	[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2024-11-21 16:24:32 +00:00
Isotr0py	d5ec121f95	[Model] Expose `dynamic_image_size` as mm_processor_kwargs for InternVL2 models (#10518 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-21 14:20:08 +00:00
Luka Govedič	8b0fe06c89	[torch.compile] Inductor code caching fix (#10273 ) Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: Luka Govedic <luka.govedic@gmail.com>	2024-11-20 21:44:57 -08:00
Pavani Majety	6c1208d083	[Core] Add Sliding Window Support with Flashinfer (#10462 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2024-11-20 19:56:47 -08:00
youkaichao	388ee3de66	[torch.compile] limit inductor threads and lazy import quant (#10482 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-20 18:36:33 -08:00
Guillaume Calmettes	c68f7ede6a	[Bugfix]: allow extra fields in requests to openai compatible server (#10463 ) Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>	2024-11-20 16:42:21 -05:00
youkaichao	0cd3d9717e	[7/N] torch.compile, reduce compilation time (#10460 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-20 11:20:38 -08:00
Li, Jiang	63f1fde277	[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2024-11-20 10:57:39 +00:00
Lucas Wilkinson	d200972e7f	[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-11-19 19:40:33 -08:00
ElizaWszola	b00b33d77e	[Model][Quantization] HQQ support through Marlin kernel expansion (#9766 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com>	2024-11-19 13:31:12 -08:00
youkaichao	803f37eaaa	[6/N] torch.compile rollout to users (#10437 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-19 10:09:03 -08:00
Mengqing Cao	8c1fb50705	[Platform][Refactor] Extract func `get_default_attn_backend` to `Platform` (#10358 ) Signed-off-by: Mengqing Cao <cmq0113@163.com>	2024-11-19 11:22:26 +08:00
Lucas Wilkinson	96d999fbe8	[Kernel] Initial Machete W4A8 support + Refactors (#9855 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-11-18 12:59:29 -07:00
Yan Ma	6b2d25efc7	[Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107 ) Signed-off-by: yan ma <yan.ma@intel.com>	2024-11-18 11:18:05 -07:00
lkchen	c7dec926f6	[VLM] Report multi_modal_placeholders in output (#10407 ) Signed-off-by: Linkun Chen <lkchen+anyscale@github.com>	2024-11-18 16:06:16 +08:00
youkaichao	4fd9375028	[2/N][torch.compile] make compilation cfg part of vllm cfg (#10383 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-16 18:02:14 -08:00
电脑星人	361c29e174	[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled (#10388 ) Signed-off-by: imkero <kerorek@outlook.com>	2024-11-17 02:10:00 +08:00
Cyrus Leung	32e46e000f	[Frontend] Automatic detection of chat content format from AST (#9919 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-16 13:35:40 +08:00
ElizaWszola	79ee45b428	[Misc] Bump up test_fused_moe tolerance (#10364 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com>	2024-11-15 16:31:18 +00:00
Cyrus Leung	b311efd0bd	[Misc] Fix import error in tensorizer tests and cleanup some code (#10349 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-15 09:34:17 +00:00
Cyrus Leung	2ac6d0e75b	[Misc] Consolidate pooler config overrides (#10351 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-15 06:59:00 +00:00
Cyrus Leung	b40cf6402e	[Model] Support Qwen2 embeddings and use tags to select model tests (#10184 )	2024-11-14 20:23:09 -08:00
Luka Govedič	bf2ddc6610	[bugfix] Fix static asymmetric quantization case (#10334 ) Signed-off-by: Daniël de Kok <me@danieldk.eu> Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-11-15 09:35:11 +08:00
Cyrus Leung	972112d82f	[Bugfix] Fix unable to load some models (#10312 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-14 16:55:54 -08:00
Patrick von Platen	11cd1ae6ad	[Tool parsing] Improve / correct mistral tool parsing (#10333 )	2024-11-15 00:42:49 +00:00
Maximilien de Bayser	4a18fd14ba	Support Roberta embedding models (#9387 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Flavia Beo <flavia.beo@ibm.com> Co-authored-by: Flavia Beo <flavia.beo@ibm.com>	2024-11-14 21:23:29 +00:00
youkaichao	29f3ef26a3	[ci][distributed] disable hanging tests (#10317 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-14 00:23:39 -08:00
Mike Depinet	f67ce05d0b	[Frontend] Pythonic tool parser (#9859 ) Signed-off-by: Mike Depinet <mike@fixie.ai>	2024-11-14 04:14:34 +00:00
Isotr0py	15bb8330aa	[Bugfix] Fix tensor parallel for qwen2 classification model (#10297 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-14 10:54:59 +08:00
HoangCongDuc	ac49b59d8b	[Bugfix] bitsandbytes models fail to run pipeline parallel (#10200 ) Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com>	2024-11-13 09:56:39 -07:00
Cyrus Leung	0b8bb86bf1	[1/N] Initial prototype for multi-modal processor (#10044 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-13 12:39:03 +00:00
Austin Veselka	1b886aa104	[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 (#9944 ) Signed-off-by: FurtherAI <austin.veselka@lighton.ai> Co-authored-by: FurtherAI <austin.veselka@lighton.ai>	2024-11-13 08:28:13 +00:00
电脑星人	3945c82346	[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions (#10221 ) Signed-off-by: imkero <kerorek@outlook.com>	2024-11-13 07:07:22 +00:00
youkaichao	0d4ea3fb5c	[core][distributed] use tcp store directly (#10275 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-12 17:36:08 -08:00
Woosuk Kwon	112fa0bbe5	[V1] Fix CI tests on V1 engine (#10272 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-11-12 16:17:20 -08:00
Umesh	8a06428c70	[LoRA] Adds support for bias in LoRA (#5733 ) Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com> Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com>	2024-11-12 11:08:40 -08:00
sroy745	b41fb9d3b1	[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers (#9982 ) Signed-off-by: Sourashis Roy <sroy@roblox.com>	2024-11-12 10:53:57 -08:00
zifeitong	47db6ec831	[Frontend] Add per-request number of cached token stats (#10174 )	2024-11-12 16:42:28 +00:00
Jee Jee Li	7f5edb5900	[Misc][LoRA] Replace hardcoded cuda device with configurable argument (#10223 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-11-12 11:10:15 +08:00
youkaichao	eea55cca5b	[1/N] torch.compile user interface design (#10237 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-11 18:01:06 -08:00
Robert Shaw	6ace6fba2c	[V1] `AsyncLLM` Implementation (#9826 ) Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-11-11 23:05:38 +00:00
youkaichao	8a7fe47d32	[misc][distributed] auto port selection and disable tests (#10226 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-11 11:54:59 -08:00
youkaichao	330e82d34a	[v1][torch.compile] support managing cudagraph buffer (#10203 ) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-11-11 11:10:27 -08:00
youkaichao	e6de9784d2	[core][distributed] add stateless process group (#10216 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-11 09:02:14 -08:00
Jee Jee Li	36e4acd02a	[LoRA][Kernel] Remove the unused libentry module (#10214 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-11-11 09:43:23 +00:00
Isotr0py	58170d6503	[Hardware][CPU] Add embedding models support for CPU backend (#10193 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-11 08:54:28 +00:00
youkaichao	73b9083e99	[misc] improve cloudpickle registration and tests (#10202 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-11 00:10:53 +00:00
Cyrus Leung	51c2e1fcef	[CI/Build] Split up models tests (#10069 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-09 11:39:14 -08:00
Krishna Mandal	b09895a618	[Frontend][Core] Override HF `config.json` via CLI (#5836 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-09 16:19:27 +00:00
bnellnm	f192aeba74	[Bugfix] Enable some fp8 and quantized fullgraph tests (#10171 ) Signed-off-by: Bill Nell <bill@neuralmagic.com>	2024-11-09 08:01:27 +00:00
Isotr0py	47672f38b5	[CI/Build] Fix VLM broadcast tests `tensor_parallel_size` passing (#10161 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2024-11-09 04:02:59 +00:00
Cyrus Leung	e0191a95d8	[0/N] Rename `MultiModalInputs` to `MultiModalKwargs` (#10040 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-09 11:31:02 +08:00
rasmith	127c07480e	[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2024-11-08 19:59:22 -05:00

... 10 11 12 13 14 ...

2143 Commits