Mark McLoughlin
46fb056749
[V1][Metrics] Add TTFT and TPOT histograms ( #12530 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-29 04:11:16 +00:00
Ce Gao
a7e3eba66f
[Frontend] Support reasoning content for deepseek r1 ( #12473 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
2025-01-29 11:38:08 +08:00
Mark McLoughlin
c386c43ca3
[V1][Metrics] Add per-request prompt/generation_tokens histograms ( #12516 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-28 22:07:22 +00:00
Mark McLoughlin
3fd1fb63ef
[V1][Metrics] Hook up IterationStats for Prometheus metrics ( #12478 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-28 16:38:38 +00:00
Cyrus Leung
8f58a51358
[VLM] Merged multi-modal processor and V1 support for Qwen-VL ( #12504 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-28 16:25:05 +00:00
Gabriel Marinho
0f465ab533
[FEATURE] Enables offline /score for embedding models ( #12021 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com>
2025-01-28 11:30:13 +08:00
Liangfu Chen
ddee88d0ff
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache ( #11277 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: Jiangfei Duan <jfduan@outlook.com>
2025-01-27 17:31:16 -08:00
Harry Mellor
823ab79633
Update `pre-commit` hooks ( #12475 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-27 17:23:08 -07:00
Nicolò Lucchesi
6116ca8cd7
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logprobs` with ChunkedPrefill ( #10132 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
2025-01-27 13:38:35 -08:00
Bowen Wang
2bc3fbba0c
[FlashInfer] Upgrade to 0.2.0 ( #11194 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-27 18:19:24 +00:00
Woosuk Kwon
3f1fc7425a
[V1][CI/Test] Do basic test for top-p & top-k sampling ( #12469 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-27 09:40:04 -08:00
Mark McLoughlin
01ba927040
[V1][Metrics] Add initial Prometheus logger ( #12416 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-27 12:26:28 -05:00
Pooya Davoodi
0cc6b383d7
[Frontend] Support scores endpoint in run_batch ( #12430 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2025-01-27 04:30:17 +00:00
Kyle Mistele
0034b09ceb
[Frontend] Rerank API (Jina- and Cohere-compatible API) ( #12376 )
...
Signed-off-by: Kyle Mistele <kyle@mistele.com>
2025-01-26 19:58:45 -07:00
Tyler Michael Smith
72f4880425
[Bugfix/CI] Fix broken kernels/test_mha.py ( #12450 )
2025-01-26 10:39:03 -08:00
Tyler Michael Smith
aa2cd2c43d
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 ( #12417 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-26 19:59:58 +08:00
Matthew Hendrey
9ddc35220b
[Frontend] generation_config.json for maximum tokens( #12242 )
...
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-26 19:59:25 +08:00
Isotr0py
f1fc0510df
[Misc] Add FA2 support to ViT MHA layer ( #12355 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-25 15:07:35 +08:00
Cyrus Leung
df5dafaa5b
[Misc] Remove deprecated code ( #12383 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-24 14:45:20 -05:00
Lucas Wilkinson
ab5bbf5ae3
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build ( #12375 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 15:27:59 +00:00
Nick Hill
24b0205f58
[V1][Frontend] Coalesce bunched `RequestOutput`s ( #12298 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2025-01-23 17:17:41 -08:00
Gregory Shtrasberg
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation ( #11906 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2025-01-23 18:04:03 +00:00
Lucas Wilkinson
978b45f399
[Kernel] Flash Attention 3 Support ( #12093 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-23 06:45:48 -08:00
Cody Yu
f0ef37233e
[V1] Add `uncache_blocks` ( #12333 )
2025-01-23 04:19:21 +00:00
rasmith
68c4421b6d
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD ( #12282 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-01-23 00:10:37 +00:00
Cody Yu
7206ce4ce1
[Core] Support `reset_prefix_cache` ( #12284 )
2025-01-22 18:52:27 +00:00
youkaichao
68ad4e3a8d
[Core] Support fully transparent sleep mode ( #11743 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-22 14:39:32 +08:00
Kevin H. Luu
64ea24d0b3
[ci/lint] Add back default arg for pre-commit ( #12279 )
...
Signed-off-by: kevin <kevin@anyscale.com>
2025-01-22 01:15:27 +00:00
Cyrus Leung
df76e5af26
[VLM] Simplify post-processing of replacement info ( #12269 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 16:48:13 -08:00
Adrian Cole
347eeebe3b
[Misc] Remove experimental dep from tracing.py ( #12007 )
...
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
2025-01-21 11:51:55 -08:00
Andy Lo
18fd4a8331
[Bugfix] Multi-sequence broken ( #11898 )
...
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-01-21 11:51:35 -08:00
Ricky Xu
132a132100
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types ( #10907 )
...
Signed-off-by: rickyx <rickyx@anyscale.com>
2025-01-21 11:51:13 -08:00
Nicolò Lucchesi
5fe6bf29d6
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 ( #12230 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-01-21 12:23:14 +08:00
Cyrus Leung
18572e3384
[Bugfix] Fix `HfExampleModels.find_hf_info` ( #12223 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 15:35:36 +00:00
Cyrus Leung
b37d82791e
[Model] Upgrade Aria to transformers 4.48 ( #12203 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 17:58:48 +08:00
Cyrus Leung
59a0192fb9
[Core] Interface for accessing model from `VllmRunner` ( #10353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 15:00:59 +08:00
Isotr0py
83609791d2
[Model] Add Qwen2 PRM model support ( #12202 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-20 14:59:46 +08:00
Martin Gleize
bbe5f9de7d
[Model] Support for fairseq2 Llama ( #11442 )
...
Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
2025-01-19 10:40:40 -08:00
Roger Wang
81763c58a0
[V1] Add V1 support of Qwen2-VL ( #12128 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-19 19:52:13 +08:00
yancong
32eb0da808
[Misc] Support register quantization method out-of-tree ( #11969 )
2025-01-18 16:13:16 -08:00
Isotr0py
02798ecabe
[Model] Port deepseek-vl2 processor, remove dependency ( #12169 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-18 13:59:39 +08:00
youkaichao
da02cb4b27
[core] further polish memory profiling ( #12126 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-18 12:25:08 +08:00
Wallas Henrique
58fd57ff1d
[Bugfix] Fix score api for missing max_model_len validation ( #12119 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2025-01-17 16:24:22 +00:00
youkaichao
87a0c076af
[core] allow callable in collective_rpc ( #12151 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-17 20:47:01 +08:00
Jee Jee Li
07934cc237
[Misc][LoRA] Improve the readability of LoRA error messages ( #12102 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-17 19:32:28 +08:00
Chen Zhang
69d765f5a5
[V1] Move more control of kv cache initialization from model_executor to EngineCore ( #11960 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-17 07:39:35 +00:00
Isotr0py
d75ab55f10
[Misc] Add deepseek_vl2 chat template ( #12143 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-17 06:34:48 +00:00
Isotr0py
62b06ba23d
[Model] Add support for deepseek-vl2-tiny model ( #12068 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-16 17:14:48 +00:00
Roger Wang
874f7c292a
[Bugfix] Fix max image feature size for Llava-one-vision ( #12104 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-16 14:54:06 +00:00
youkaichao
bf53e0c70b
Support torchrun and SPMD-style offline inference ( #12071 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-16 19:58:53 +08:00
Isotr0py
dd7c9ad870
[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 ( #12067 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-16 10:11:54 +00:00
Joe Runde
edce722eaa
[Bugfix] use right truncation for non-generative tasks ( #12050 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-01-16 00:31:01 +08:00
kewang-xlnx
de0526f668
[Misc][Quark] Upstream Quark format to VLLM ( #10765 )
...
Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-15 11:05:15 -05:00
RunningLeon
97eb97b5a4
[Model]: Support internlm3 ( #12037 )
2025-01-15 11:35:17 +00:00
wangxiyuan
3adf0ffda8
[Platform] Do not raise error if _Backend is not found ( #12023 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-01-15 10:14:15 +00:00
Chen Zhang
994fc655b7
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager ( #12003 )
2025-01-15 07:55:30 +00:00
youkaichao
ad34c0df0f
[core] platform agnostic executor via collective_rpc ( #11256 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-15 13:45:21 +08:00
Elfie Guo
0794e7446e
[Misc] Add multipstep chunked-prefill support for FlashInfer ( #10467 )
2025-01-15 12:47:49 +08:00
Jee Jee Li
42f5e7c52a
[Kernel] Support MulAndSilu ( #11624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-15 02:29:53 +00:00
Cyrus Leung
bb354e6b2d
[Bugfix] Fix various bugs in multi-modal processor ( #12031 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-14 12:16:11 +00:00
Yangcheng Li
f7b3ba82c3
[MISC] fix typo in kv transfer send recv test ( #11983 )
2025-01-13 05:07:48 +00:00
Robert Shaw
619ae268c3
[V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction ( #11973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-13 04:54:10 +00:00
Isotr0py
d14e98d924
[Model] Support GGUF models newly added in `transformers` 4.46.0 ( #9685 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-13 00:13:44 +00:00
Robert Shaw
9597a095f2
[V1][Core][1/n] Logging and Metrics ( #11962 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-12 21:02:02 +00:00
Avshalom Manevich
263a870ee1
[Hardware][TPU] workaround fix for MoE on TPU ( #11764 )
2025-01-12 10:53:51 -05:00
Akshat Tripathi
8bddb73512
[Hardware][CPU] Multi-LoRA implementation for the CPU backend ( #11100 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-12 13:01:52 +00:00
Isotr0py
f967e51f38
[Model] Initialize support for Deepseek-VL2 models ( #11578 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-12 00:17:24 -08:00
Nicolò Lucchesi
d697dc01b4
[Bugfix] Fix RobertaModel loading ( #11940 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-01-11 14:05:09 +00:00
Cyrus Leung
a991f7d508
[Doc] Basic guide for writing unit tests for new models ( #11951 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-11 21:27:24 +08:00
Cyrus Leung
7a3a83e3b8
[CI/Build] Move model-specific multi-modal processing tests ( #11934 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-11 13:50:05 +08:00
youkaichao
899136b857
[ci] fix broken distributed-tests-4-gpus ( #11937 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-11 09:07:24 +08:00
Li, Jiang
aa1e77a19c
[Hardware][CPU] Support MOE models on x86 CPU ( #11831 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-01-10 11:07:58 -05:00
Harry Mellor
482cdc494e
[Doc] Rename offline inference examples ( #11927 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-10 23:50:29 +08:00
youkaichao
241ad7b301
[ci] Fix sampler tests ( #11922 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-10 20:45:33 +08:00
Harry Mellor
d85c47d6ad
Replace "online inference" with "online serving" ( #11923 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-10 12:05:56 +00:00
Joe Runde
ac2f3f7fee
[Bugfix] Validate lora adapters to avoid crashing server ( #11727 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-10 15:56:36 +08:00
Chen Zhang
cf5f000d21
[torch.compile] Hide KV cache behind torch.compile boundary ( #11677 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-10 13:14:42 +08:00
Cyrus Leung
b844b99ad3
[VLM] Enable tokenized inputs for merged multi-modal processor ( #11900 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-10 03:24:00 +00:00
Cyrus Leung
9a228348d2
[Misc] Provide correct Pixtral-HF chat template ( #11891 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-09 10:19:37 -07:00
youkaichao
bd82872211
[ci]try to fix flaky multi-step tests ( #11894 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-09 14:47:29 +00:00
wangxiyuan
405eb8e396
[platform] Allow platform specify attention backend ( #11609 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-01-09 21:46:50 +08:00
Cyrus Leung
0bd1ff4346
[Bugfix] Override dunder methods of placeholder modules ( #11882 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-09 09:02:53 +00:00
Maximilien de Bayser
1fe554bac3
treat do_lower_case in the same way as the sentence-transformers library ( #11815 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-01-09 11:05:43 +08:00
Tyler Michael Smith
615e4a5401
[CI] Turn on basic correctness tests for V1 ( #10864 )
2025-01-08 21:20:44 -05:00
Robert Shaw
56fe4c297c
[TPU][Quantization] TPU `W8A8` ( #11785 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-08 19:33:29 +00:00
Harry Mellor
aba8d6ee00
[Doc] Move examples into categories ( #11840 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-08 13:09:53 +00:00
Cyrus Leung
2a0596bc48
[VLM] Reorganize profiling/processing-related code ( #11812 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 18:59:58 +08:00
youkaichao
889e662eae
[misc] improve memory profiling ( #11809 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-08 06:36:03 +00:00
Cyrus Leung
8f37be38eb
[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation ( #11800 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 18:25:02 +08:00
Jee Jee Li
b278557935
[Kernel][LoRA]Punica prefill kernels fusion ( #11234 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Co-authored-by: Zhonghua Deng <abatom@163.com>
2025-01-07 04:01:39 +00:00
Cyrus Leung
08fb75c72e
[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) ( #11772 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 01:10:54 +00:00
Roger Wang
91b361ae89
[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision ( #11685 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 19:58:16 +00:00
Chen Zhang
e20c92bb61
[Kernel] Move attn_type to Attention.__init__() ( #11690 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-07 00:11:28 +08:00
Jee Jee Li
32c9eff2ff
[Bugfix][V1] Fix molmo text-only inputs ( #11676 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-06 15:22:25 +00:00
Cyrus Leung
996357e480
[VLM] Separate out profiling-related logic ( #11746 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 16:02:21 +08:00
Rui Qiao
022c5c6944
[V1] Refactor get_executor_cls ( #11754 )
2025-01-06 07:59:16 +00:00
cennn
9e764e7b10
[distributed] remove pynccl's redundant change_state ( #11749 )
2025-01-06 09:05:48 +08:00
cennn
635b897246
[distributed] remove pynccl's redundant stream ( #11744 )
2025-01-05 23:09:11 +08:00
Jee Jee Li
47831430cc
[Bugfix][V1] Fix test_kv_cache_utils.py ( #11738 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-04 16:07:59 +00:00
Cyrus Leung
ba214dffbe
[Bugfix] Fix precision error in LLaVA-NeXT ( #11735 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-04 23:45:57 +08:00
Cyrus Leung
eed11ebee9
[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision ( #11717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-04 11:40:53 +00:00
Yan Burman
300acb8347
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture ( #11233 )
...
Signed-off-by: Yan Burman <yanburman@users.noreply.github.com>
Signed-off-by: Ido Asraff <idoa@atero.ai>
2025-01-04 14:50:16 +08:00
xcnick
d91457d529
[V1] Add kv cache utils tests. ( #11513 )
...
Signed-off-by: xcnick <xcnick0412@gmail.com>
2025-01-04 14:49:46 +08:00
Robert Shaw
80c751e7f6
[V1] Simplify Shutdown ( #11659 )
2025-01-03 17:25:38 +00:00
Aurick Qiao
e1a5c2f0a1
[Model] Whisper model implementation ( #11280 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
2025-01-03 16:39:19 +08:00
Cyrus Leung
8c38ee7007
[VLM] Merged multi-modal processor for LLaVA-NeXT ( #11682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-02 16:39:27 +00:00
Cyrus Leung
a115ac46b5
[VLM] Move supported limits and max tokens to merged multi-modal processor ( #11669 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-01 15:44:42 +00:00
Woosuk Kwon
73001445fb
[V1] Implement Cascade Attention ( #11635 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-01 21:56:46 +09:00
Jee Jee Li
11d8a091c6
[Misc] Optimize Qwen2-VL LoRA test ( #11663 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-01 14:42:23 +08:00
Cyrus Leung
365801fedd
[VLM] Add max-count checking in data parser for single image models ( #11661 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-31 22:15:21 -08:00
Joe Runde
4db72e57f6
[Bugfix][Refactor] Unify model management in frontend ( #11660 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-01-01 02:21:51 +00:00
Roger Wang
e7c7c5e822
[V1][VLM] V1 support for selected single-image models. ( #11632 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-31 21:17:22 +00:00
Chen Zhang
8c3230d8c1
[V1] Simpify vision block hash for prefix caching by removing offset from hash ( #11646 )
2024-12-31 08:56:01 +00:00
sakunkun
2c5718809b
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. ( #11565 )
2024-12-31 06:29:04 +00:00
John Giorgi
82c49d3260
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) ( #6909 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-30 22:15:58 -08:00
Michael Goin
74fa1d123c
[Bugfix] Fix OpenAI parallel sampling when using xgrammar ( #11637 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-31 03:43:54 +00:00
youkaichao
b12e87f942
[platforms] enable platform plugins ( #11602 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-30 20:24:45 +08:00
youkaichao
3682e33f9f
[v1] fix compilation cache ( #11598 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-30 04:24:12 +00:00
Robert Shaw
4fb8e329fd
[V1] [5/N] API Server: unify `Detokenizer` and `EngineCore` input ( #11545 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-28 20:51:57 +00:00
youkaichao
328841d002
[bugfix] interleaving sliding window for cohere2 model ( #11583 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-28 16:55:42 +00:00
Isotr0py
d34be24bb1
[Model] Support InternLM2 Reward models ( #11571 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-28 06:14:10 +00:00
Robert Shaw
df04dffade
[V1] [4/N] API Server: ZMQ/MP Utilities ( #11541 )
2024-12-28 01:45:08 +00:00
ErezSC42
55509c2114
[MODEL] LoRA support for Jamba model ( #11209 )
...
Signed-off-by: Erez Schwartz <erezs@ai21.com>
2024-12-27 17:58:21 +00:00
Cyrus Leung
101418096f
[VLM] Support caching in merged multi-modal processor ( #11396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-27 17:22:48 +00:00
Cyrus Leung
7af553ea30
[Misc] Abstract the logic for reading and writing media content ( #11527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-27 19:21:23 +08:00
Robert Shaw
46d4359450
[CI] Fix broken CI ( #11543 )
2024-12-26 18:49:16 -08:00
Woosuk Kwon
371d04d39b
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling ( #11394 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-27 09:32:38 +09:00
Michael Goin
2072924d14
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization ( #11523 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-26 15:33:30 -08:00
Cyrus Leung
eec906d811
[Misc] Add placeholder module ( #11501 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-26 13:12:51 +00:00
sroy745
dcb1a944d4
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler ( #10681 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-26 19:02:58 +09:00
Jee Jee Li
aa25985bd1
[Misc][LoRA] Fix LoRA weight mapper ( #11495 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-26 15:52:48 +08:00
Cyrus Leung
51a624bf02
[Misc] Move some multimodal utils to modality-specific modules ( #11494 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-26 04:23:20 +00:00
Jiaxin Shan
fc601665eb
[Misc] Update disaggregation benchmark scripts and test logs ( #11456 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
2024-12-25 06:58:48 +00:00
Rui Qiao
9832e5572a
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor ( #11472 )
2024-12-24 19:49:46 -08:00
Cyrus Leung
3f3e92e1f2
[Model] Automatic conversion of classification and reward models ( #11469 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-24 18:22:22 +00:00
Jee Jee Li
196c34b0ac
[Misc] Move weights mapper ( #11443 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-24 13:05:25 +00:00
Jee Jee Li
b1b1038fbd
[Bugfix] Fix Qwen2-VL LoRA weight loading ( #11430 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-24 09:56:10 +00:00
Cyrus Leung
9edca6bf8f
[Frontend] Online Pooling API ( #11457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-24 17:54:30 +08:00
Rui Qiao
a491d6f535
[V1] TP Ray executor ( #11107 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-12-23 23:00:12 +00:00
Michael Goin
63afbe9215
[CI] Expand OpenAI test_chat.py guided decoding tests ( #11048 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-23 18:35:38 +00:00
Dipika Sikka
8cef6e02dc
[Misc] add w8a8 asym models ( #11075 )
2024-12-23 13:33:20 -05:00
Michael Goin
5bfb30a529
[Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF ( #11389 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-23 23:06:20 +08:00
Jason T. Greene
f1d1bf6288
[Bugfix] Fix fully sharded LoRAs with Mixtral ( #11390 )
...
Signed-off-by: Jason Greene <jason.greene@redhat.com>
2024-12-22 23:25:10 +08:00
Roger Wang
29c748930e
[CI] Fix flaky entrypoint tests ( #11403 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-21 21:08:44 -08:00
omer-dayan
995f56236b
[Core] Loading model from S3 using RunAI Model Streamer as optional loader ( #10192 )
...
Signed-off-by: OmerD <omer@run.ai>
2024-12-20 16:46:24 +00:00
Wallas Henrique
86c2d8fd1c
[Bugfix] Fix spec decoding when seed is none in a batch ( #10863 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-12-20 05:15:31 +00:00
Isotr0py
e24113a8fe
[Model] Refactor Qwen2-VL to use merged multimodal processor ( #11258 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 16:28:00 +00:00
Yehoshua Cohen
6c7f881541
[Model] Add JambaForSequenceClassification model ( #10860 )
...
Signed-off-by: Yehoshua Cohen <yehoshuaco@ai21.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Yehoshua Cohen <yehoshuaco@ai21.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 22:48:06 +08:00
Yanyi Liu
5aef49806d
[Feature] Add load generation config from model ( #11164 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com>
Signed-off-by: Yanyi Liu <wolfsonliu@163.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-12-19 10:50:38 +00:00
Cyrus Leung
6142ef0ada
[VLM] Merged multimodal processor for Qwen2-Audio ( #11303 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 06:14:17 +00:00
Michael Goin
a30482f054
[CI] Expand test_guided_generate to test all backends ( #11313 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-19 04:00:38 +00:00
Travis Johnson
17ca964273
[Model] IBM Granite 3.1 ( #11307 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-12-19 11:27:24 +08:00
Tyler Michael Smith
5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) ( #11311 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-19 02:43:30 +00:00
Joe Runde
ca5f54a9b9
[Bugfix] fix minicpmv test ( #11304 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-12-18 10:34:26 -08:00
Isotr0py
996aa70f00
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests ( #11263 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-18 10:16:40 -08:00
Dipika Sikka
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support ( #10995 )
...
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-18 09:57:16 -05:00
Wallas Henrique
8b79f9e107
[Bugfix] Fix guided decoding with tokenizer mode mistral ( #11046 )
2024-12-17 22:34:08 -08:00
Cody Yu
bf8717ebae
[V1] Prefix caching for vision language models ( #11187 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-17 16:37:59 -08:00
Michael Goin
c77eb8a33c
[Bugfix] Set temperature=0.7 in test_guided_choice_chat ( #11264 )
2024-12-17 16:34:06 -08:00
Joe Runde
2d1b9baa8f
[Bugfix] Fix request cancellation without polling ( #11190 )
2024-12-17 12:26:32 -08:00
kYLe
66d4b16724
[Frontend] Add OpenAI API support for input_audio ( #11027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-16 22:09:58 -08:00
Michael Goin
0064f697d3
[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse ( #10935 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-17 11:39:58 +08:00
youkaichao
551603feff
[core] overhaul memory profiling and fix backward compatibility ( #10511 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-16 13:32:25 -08:00
Isotr0py
d927dbcd88
[Model] Refactor Ultravox to use merged input processor ( #11198 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-16 10:09:53 +00:00
Jani Monoses
bddbbcb132
[Model] Support Cohere2ForCausalLM (Cohere R7B) ( #11203 )
2024-12-16 09:56:19 +00:00
Cyrus Leung
b10609e6a1
[Misc] Clean up multi-modal processor ( #11207 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-15 06:30:28 +00:00
Cyrus Leung
93abf23a64
[VLM] Fully dynamic prompt replacement in merged input processor ( #11199 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-14 17:52:18 +00:00
Brad Hilton
9c3dadd1c9
[Frontend] Add `logits_processors` as an extra completion argument ( #11150 )
...
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com>
2024-12-14 16:46:42 +00:00
Cyrus Leung
0920ab9131
[Doc] Reorganize online pooling APIs ( #11172 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-14 00:22:22 +08:00
Sungjae Lee
c31d4a57a6
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching ( #8240 )
2024-12-13 07:51:25 -08:00
Cyrus Leung
eeec9e3390
[Frontend] Separate pooling APIs in offline inference ( #11129 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-13 10:40:07 +00:00
youkaichao
be39e3cd18
[core] clean up cudagraph batchsize padding logic ( #10996 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-13 06:57:50 +00:00
Pooya Davoodi
1efce68605
[Bugfix] Use runner_type instead of task in GritLM ( #11144 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2024-12-13 04:09:53 +00:00
Luka Govedič
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion ( #10906 )
...
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-13 03:19:23 +00:00
Cody Yu
78ed8f57d8
[Misc][V1] Fix type in v1 prefix caching ( #11151 )
2024-12-13 00:57:40 +00:00
Jiaxin Shan
85362f028c
[Misc][LoRA] Ensure Lora Adapter requests return adapter name ( #11094 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-12 09:25:16 +00:00
youkaichao
62de37a38e
[core][distributed] initialization from StatelessProcessGroup ( #10986 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-12 09:04:19 +00:00
Pooya Davoodi
1da8f0e1dd
[Model] Add support for embedding model GritLM ( #10816 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2024-12-12 06:39:16 +00:00
Alexander Matveev
4e11683368
[V1] VLM preprocessor hashing ( #11020 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-12 00:55:30 +00:00
Cyrus Leung
d1e21a979b
[CI/Build] Split up VLM tests ( #11083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-12 06:18:16 +08:00
Cyrus Leung
8f10d5e393
[Misc] Split up pooling tasks ( #10820 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 01:28:00 -08:00
Cyrus Leung
2e33fe4191
[CI/Build] Check transformers v4.47 ( #10991 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 05:02:02 +00:00
Mor Zusman
ffa48c9146
[Model] PP support for Mamba-like models ( #10992 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-12-10 21:53:37 -05:00
Aurick Qiao
d5c5154fcf
[Misc] LoRA + Chunked Prefill ( #9057 )
2024-12-11 10:09:20 +08:00
Jee Jee Li
d05f88679b
[Misc][LoRA] Add PEFTHelper for LoRA ( #11003 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-10 11:12:01 +00:00
youkaichao
ebf778061d
monitor metrics of tokens per step using cudagraph batchsizes ( #11031 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-09 22:35:36 -08:00
Tyler Michael Smith
28b3a1c7e5
[V1] Multiprocessing Tensor Parallel Support for v1 ( #9856 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-10 06:28:14 +00:00
Isotr0py
a811dd6608
[Model] merged input processor for Phi-3-Vision models ( #10977 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-09 12:55:10 -08:00
Jee Jee Li
ca871491ed
[Misc][LoRA] Abstract PunicaWrapper ( #10955 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-09 12:54:44 -08:00
youkaichao
fd57d2b534
[torch.compile] allow candidate compile sizes ( #10984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 11:05:21 +00:00
zhou fan
78029b34ed
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None ( #10928 )
...
Signed-off-by: xffxff <1247714429@qq.com>
2024-12-08 01:21:18 +08:00
Cyrus Leung
c889d5888b
[Doc] Explicitly state that PP isn't compatible with speculative decoding yet ( #10975 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 17:20:49 +00:00
Cyrus Leung
39e227c7ae
[Model] Update multi-modal processor to support Mantis(LLaVA) model ( #10711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 17:10:05 +00:00
Cyrus Leung
955fa9533a
[3/N] Support and implement merged input processor for LLaVA model ( #10676 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-07 00:50:58 -08:00
Cyrus Leung
222f5b082a
[CI/Build] Fix broken multimodal test ( #10950 )
2024-12-06 10:41:23 +00:00
youkaichao
9743d64e4e
[ci][build] add tests for python only compilation ( #10915 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-05 08:54:47 -08:00
Isotr0py
998eeafe58
[CI/Build] Bump test transformers version ( #10106 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-05 16:05:52 +00:00
Jee Jee Li
571da8fc43
[Misc][LoRA] Clean up the function interface of Punica ( #10917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-05 13:22:28 +00:00
Michael Goin
8d370e91cb
[Bugfix] Fallback to outlines for complex json schemas ( #10899 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-05 11:14:06 +08:00
Xin Yang
01d079fd8e
[LoRA] Change lora_tokenizers capacity ( #10796 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com>
2024-12-04 17:40:16 +00:00
Tyler Michael Smith
d2bd88b122
[CI/Build] Replace mean with torch.all in test_pynccl.py ( #10876 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-04 03:23:21 +00:00
Alexander Matveev
3bc94cab69
[V1] VLM - Run the mm_mapper preprocessor in the frontend process ( #10640 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-03 10:33:10 +00:00
Aaron Pham
9323a3153b
[Core][Performance] Add XGrammar support for guided decoding and set it as default ( #10785 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-12-03 15:17:00 +08:00
youkaichao
dc5ce861bf
[torch.compile] remove compilation_context and simplify code ( #10838 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-03 06:19:02 +00:00
youkaichao
21fe7b481a
[core][distributed] add pynccl broadcast ( #10843 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-03 04:53:23 +00:00
Jee Jee Li
b45f0d7946
[Misc][LoRA] Move the implementation of lora bias to punica.py ( #10829 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-02 17:53:36 +00:00
zhou fan
ef31eabc68
[Model]: add some tests for aria model ( #10770 )
...
Signed-off-by: xffxff <1247714429@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-02 05:36:36 +00:00
Woosuk Kwon
073a4bd1c0
[Kernel] Use `out` arg in flash_attn_varlen_func ( #10811 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-01 17:55:39 -08:00
Kuntai Du
0590ec3fd9
[Core] Implement disagg prefill by StatelessProcessGroup ( #10502 )
...
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Co-authored-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn>
2024-12-01 19:01:00 -06:00
Cyrus Leung
d2f058e76c
[Misc] Rename embedding classes to pooling ( #10801 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-01 14:36:51 +08:00
Cyrus Leung
133707123e
[Model] Replace embedding models with pooling adapter ( #10769 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-01 08:02:54 +08:00
Cyrus Leung
fa6ecb9aa7
[Model] Clean up MiniCPMV ( #10751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-29 04:47:06 +00:00
sixgod
5fc5ce0fe4
[Model] Added GLM-4 series hf format model support vllm==0.6.4 ( #10561 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-11-28 14:53:31 +00:00
Woosuk Kwon
a79b122400
[V1] Do not allocate beyond the max_model_len ( #10730 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 00:13:15 -08:00
Ricky Xu
d9b4b3f069
[Bug][CLI] Allow users to disable prefix caching explicitly ( #10724 )
...
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-27 23:59:28 -08:00
tomeras91
395b1c7454
[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server ( #10635 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-11-27 13:21:10 -08:00
Mor Zusman
197b4484a3
[Bugfix][Mamba] Fix Multistep on Mamba-like models ( #10705 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-11-27 19:02:27 +00:00
youkaichao
308cc5e21e
[ci] fix slow tests ( #10698 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-27 09:26:14 -08:00
shunxing12345
1209261e93
[Model] Support telechat2 ( #10311 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-11-27 11:32:35 +00:00
jeongin601
1bf905ddaa
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. ( #10198 )
...
Signed-off-by: jeongin601 <0200angela@gmail.com>
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com>
2024-11-27 05:07:30 +00:00
Chendi.Xue
0a71900bc9
Remove hard-dependencies of Speculative decode to CUDA workers ( #10587 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2024-11-26 17:57:11 -08:00
Murali Andoorveedu
db66e018ea
[Bugfix] Fix for Spec model TP + Chunked Prefill ( #10232 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
2024-11-26 09:11:16 -08:00
youkaichao
334d64d1e8
[ci] add vllm_test_utils ( #10659 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-26 00:20:04 -08:00
Sage Moore
9a88f89799
custom allreduce + torch.compile ( #10121 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-25 22:00:16 -08:00
Ricky Xu
519e8e4182
[v1] EngineArgs for better config handling for v1 ( #10382 )
...
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-25 21:09:43 -08:00
Shane A
9db713a1dc
[Model] Add OLMo November 2024 model ( #10503 )
2024-11-25 17:26:40 -05:00
zhou fan
b1d920531f
[Model]: Add support for Aria model ( #10514 )
...
Signed-off-by: xffxff <1247714429@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-11-25 18:10:55 +00:00
Wallas Henrique
c27df94e1f
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices ( #9850 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-11-25 12:23:32 -05:00
Chauncey
d04b13a380
[Bug]: Authorization ignored when root_path is set ( #10606 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-25 16:21:41 +00:00
Cyrus Leung
ed46f14321
[Model] Support `is_causal` HF config field for Qwen2 model ( #10621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 09:51:20 +00:00
youkaichao
05d1f8c9c6
[misc] move functions to config.py ( #10624 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 09:27:30 +00:00
youkaichao
25d806e953
[misc] add torch.compile compatibility check ( #10618 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-24 23:40:08 -08:00
youkaichao
571841b7fc
[torch.compile] support encoder based models ( #10613 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 05:24:33 +00:00
Maximilien de Bayser
214efc2c3c
Support Cross encoder models ( #10400 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
2024-11-24 18:56:20 -08:00
Jee Jee Li
1700c543a5
[Bugfix] Fix LoRA weight sharding ( #10450 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-23 17:23:17 -08:00
Cyrus Leung
c8acd80548
[2/N] handling placeholders in merged multi-modal processor ( #10485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-22 21:25:09 -08:00
Ricky Xu
4634a89d18
Prefix Cache Aware Scheduling [1/n] ( #10128 )
...
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-22 21:15:55 -08:00
Varun Vinayak Shenoy
7d8ffb344f
[Bugfix] Internal Server Error when tool_choice is incorrect. ( #10567 )
...
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com>
2024-11-22 21:13:29 -08:00
youkaichao
4aba6e3d1a
[core] gemma2 full context length support ( #10584 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 20:13:54 -08:00
Tyler Michael Smith
978b39744b
[Misc] Add pynccl wrappers for all_gather and reduce_scatter ( #9432 )
2024-11-22 22:14:03 -05:00
Travis Johnson
9195dbdbca
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use ( #10164 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-11-23 10:17:38 +08:00
Ricky Xu
97814fbf0f
[v1] Refactor KVCacheManager for more hash input than token ids ( #10507 )
...
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-11-22 23:27:25 +00:00
youkaichao
eebad39f26
[torch.compile] support all attention backends ( #10558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 14:04:42 -08:00
youkaichao
db100c5cde
[bugfix] fix full graph tests ( #10581 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 10:02:14 -08:00
youkaichao
33e0a2540a
[9/N] torch.compile LLM usage ( #10552 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 19:13:31 -08:00
youkaichao
7560ae5caf
[8/N] enable cli flag without a space ( #10529 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 12:30:42 -08:00
Jee Jee Li
2385b60d83
[Kernel] Register punica ops directly ( #10522 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-21 09:18:11 -08:00
Chauncey
da7e702c6f
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored ( #10180 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-21 16:24:32 +00:00
Isotr0py
d5ec121f95
[Model] Expose `dynamic_image_size` as mm_processor_kwargs for InternVL2 models ( #10518 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-21 14:20:08 +00:00
Luka Govedič
8b0fe06c89
[torch.compile] Inductor code caching fix ( #10273 )
...
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
2024-11-20 21:44:57 -08:00
Pavani Majety
6c1208d083
[Core] Add Sliding Window Support with Flashinfer ( #10462 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2024-11-20 19:56:47 -08:00
youkaichao
388ee3de66
[torch.compile] limit inductor threads and lazy import quant ( #10482 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 18:36:33 -08:00
Guillaume Calmettes
c68f7ede6a
[Bugfix]: allow extra fields in requests to openai compatible server ( #10463 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-20 16:42:21 -05:00
youkaichao
0cd3d9717e
[7/N] torch.compile, reduce compilation time ( #10460 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 11:20:38 -08:00
Li, Jiang
63f1fde277
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU ( #10355 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-20 10:57:39 +00:00
Lucas Wilkinson
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) ( #10464 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-19 19:40:33 -08:00
ElizaWszola
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion ( #9766 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-19 13:31:12 -08:00
youkaichao
803f37eaaa
[6/N] torch.compile rollout to users ( #10437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-19 10:09:03 -08:00
Mengqing Cao
8c1fb50705
[Platform][Refactor] Extract func `get_default_attn_backend` to `Platform` ( #10358 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-11-19 11:22:26 +08:00
Lucas Wilkinson
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors ( #9855 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-18 12:59:29 -07:00
Yan Ma
6b2d25efc7
[Hardware][XPU] AWQ/GPTQ support for xpu backend ( #10107 )
...
Signed-off-by: yan ma <yan.ma@intel.com>
2024-11-18 11:18:05 -07:00
lkchen
c7dec926f6
[VLM] Report multi_modal_placeholders in output ( #10407 )
...
Signed-off-by: Linkun Chen <lkchen+anyscale@github.com>
2024-11-18 16:06:16 +08:00
youkaichao
4fd9375028
[2/N][torch.compile] make compilation cfg part of vllm cfg ( #10383 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-16 18:02:14 -08:00
电脑星人
361c29e174
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled ( #10388 )
...
Signed-off-by: imkero <kerorek@outlook.com>
2024-11-17 02:10:00 +08:00
Cyrus Leung
32e46e000f
[Frontend] Automatic detection of chat content format from AST ( #9919 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-16 13:35:40 +08:00
ElizaWszola
79ee45b428
[Misc] Bump up test_fused_moe tolerance ( #10364 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-15 16:31:18 +00:00
Cyrus Leung
b311efd0bd
[Misc] Fix import error in tensorizer tests and cleanup some code ( #10349 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-15 09:34:17 +00:00
Cyrus Leung
2ac6d0e75b
[Misc] Consolidate pooler config overrides ( #10351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-15 06:59:00 +00:00
Cyrus Leung
b40cf6402e
[Model] Support Qwen2 embeddings and use tags to select model tests ( #10184 )
2024-11-14 20:23:09 -08:00
Luka Govedič
bf2ddc6610
[bugfix] Fix static asymmetric quantization case ( #10334 )
...
Signed-off-by: Daniël de Kok <me@danieldk.eu>
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-11-15 09:35:11 +08:00
Cyrus Leung
972112d82f
[Bugfix] Fix unable to load some models ( #10312 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-14 16:55:54 -08:00
Patrick von Platen
11cd1ae6ad
[Tool parsing] Improve / correct mistral tool parsing ( #10333 )
2024-11-15 00:42:49 +00:00
Maximilien de Bayser
4a18fd14ba
Support Roberta embedding models ( #9387 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
2024-11-14 21:23:29 +00:00
youkaichao
29f3ef26a3
[ci][distributed] disable hanging tests ( #10317 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-14 00:23:39 -08:00
Mike Depinet
f67ce05d0b
[Frontend] Pythonic tool parser ( #9859 )
...
Signed-off-by: Mike Depinet <mike@fixie.ai>
2024-11-14 04:14:34 +00:00
Isotr0py
15bb8330aa
[Bugfix] Fix tensor parallel for qwen2 classification model ( #10297 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-14 10:54:59 +08:00
HoangCongDuc
ac49b59d8b
[Bugfix] bitsandbytes models fail to run pipeline parallel ( #10200 )
...
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com>
2024-11-13 09:56:39 -07:00
Cyrus Leung
0b8bb86bf1
[1/N] Initial prototype for multi-modal processor ( #10044 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-13 12:39:03 +00:00
Austin Veselka
1b886aa104
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 ( #9944 )
...
Signed-off-by: FurtherAI <austin.veselka@lighton.ai>
Co-authored-by: FurtherAI <austin.veselka@lighton.ai>
2024-11-13 08:28:13 +00:00
电脑星人
3945c82346
[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions ( #10221 )
...
Signed-off-by: imkero <kerorek@outlook.com>
2024-11-13 07:07:22 +00:00
youkaichao
0d4ea3fb5c
[core][distributed] use tcp store directly ( #10275 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-12 17:36:08 -08:00
Woosuk Kwon
112fa0bbe5
[V1] Fix CI tests on V1 engine ( #10272 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-12 16:17:20 -08:00
Umesh
8a06428c70
[LoRA] Adds support for bias in LoRA ( #5733 )
...
Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com>
Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com>
2024-11-12 11:08:40 -08:00
sroy745
b41fb9d3b1
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers ( #9982 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com>
2024-11-12 10:53:57 -08:00
zifeitong
47db6ec831
[Frontend] Add per-request number of cached token stats ( #10174 )
2024-11-12 16:42:28 +00:00
Jee Jee Li
7f5edb5900
[Misc][LoRA] Replace hardcoded cuda device with configurable argument ( #10223 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-12 11:10:15 +08:00
youkaichao
eea55cca5b
[1/N] torch.compile user interface design ( #10237 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 18:01:06 -08:00
Robert Shaw
6ace6fba2c
[V1] `AsyncLLM` Implementation ( #9826 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-11 23:05:38 +00:00
youkaichao
8a7fe47d32
[misc][distributed] auto port selection and disable tests ( #10226 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 11:54:59 -08:00
youkaichao
330e82d34a
[v1][torch.compile] support managing cudagraph buffer ( #10203 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 11:10:27 -08:00
youkaichao
e6de9784d2
[core][distributed] add stateless process group ( #10216 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 09:02:14 -08:00
Jee Jee Li
36e4acd02a
[LoRA][Kernel] Remove the unused libentry module ( #10214 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-11 09:43:23 +00:00
Isotr0py
58170d6503
[Hardware][CPU] Add embedding models support for CPU backend ( #10193 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-11 08:54:28 +00:00
youkaichao
73b9083e99
[misc] improve cloudpickle registration and tests ( #10202 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 00:10:53 +00:00
Cyrus Leung
51c2e1fcef
[CI/Build] Split up models tests ( #10069 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-09 11:39:14 -08:00
Krishna Mandal
b09895a618
[Frontend][Core] Override HF `config.json` via CLI ( #5836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-09 16:19:27 +00:00
bnellnm
f192aeba74
[Bugfix] Enable some fp8 and quantized fullgraph tests ( #10171 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com>
2024-11-09 08:01:27 +00:00
Isotr0py
47672f38b5
[CI/Build] Fix VLM broadcast tests `tensor_parallel_size` passing ( #10161 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-09 04:02:59 +00:00
Cyrus Leung
e0191a95d8
[0/N] Rename `MultiModalInputs` to `MultiModalKwargs` ( #10040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-09 11:31:02 +08:00
rasmith
127c07480e
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case ( #9857 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2024-11-08 19:59:22 -05:00
Luka Govedič
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-11-08 21:20:08 +00:00
Florian Zimmermeister
e1b5a82179
Rename vllm.logging to vllm.logging_utils ( #10134 )
2024-11-08 20:53:24 +00:00
sroy745
f6778620a9
Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 ( #10136 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com>
2024-11-08 15:56:18 +00:00
Cyrus Leung
b489fc3c91
[CI/Build] Update CPU tests to include all "standard" tests ( #5481 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-08 23:30:04 +08:00
Isotr0py
1ff4aed5bd
[Model] Expose size to Idefics3 as mm_processor_kwargs ( #10146 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-08 09:56:58 +00:00
Cody Yu
201fc07730
[V1] Prefix caching (take 2) ( #9972 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-11-07 17:34:44 -08:00
litianjian
28b2877d30
Online video support for VLMs ( #10020 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-07 20:25:59 +00:00
Russell Bryant
3be5b26a76
[CI/Build] Add shell script linting using shellcheck ( #7925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-07 18:17:29 +00:00
Nicolò Lucchesi
9d43afcc53
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding ( #9291 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2024-11-07 08:15:14 -08:00
Maximilien de Bayser
ae62fd17c0
[Frontend] Tool calling parser for Granite 3.0 models ( #9027 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-11-07 07:09:02 -08:00
Flávia Béo
aa9078fa03
Adds method to read the pooling types from model's files ( #9506 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
2024-11-07 08:42:40 +00:00
Hanzhi Zhou
6192e9b8fe
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce ( #10030 )
...
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com>
2024-11-06 23:50:47 -08:00
Cyrus Leung
db7db4aab9
[Misc] Consolidate ModelConfig code related to HF config ( #10104 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-07 06:00:21 +00:00
Li, Jiang
a4b3e0c1e9
[Hardware][CPU] Update torch 2.5 ( #9911 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-07 04:43:08 +00:00
youkaichao
719c1ca468
[core][distributed] add stateless_init_process_group ( #10072 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-06 16:42:09 -08:00
Joe Runde
d58268c56a
[V1] Make v1 more testable ( #9888 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-06 11:57:35 -08:00
Jee Jee Li
a5bba7d234
[Model] Add Idefics3 support ( #9767 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: B-201 <Joy25810@foxmail.com>
Co-authored-by: B-201 <Joy25810@foxmail.com>
2024-11-06 11:41:17 +00:00
Isotr0py
a5fda50a10
[CI/Build] Fix large_gpu_mark reason ( #10070 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-06 08:50:37 +00:00
Aaron Pham
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2024-11-06 07:11:55 +00:00
youkaichao
4be3a45158
[distributed] add function to create ipc buffers directly ( #10064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-05 22:35:03 -08:00
Travis Johnson
2bcbae704c
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer ( #10051 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-11-06 04:28:29 +00:00
Sungjae Lee
0c63c34f72
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode ( #9730 )
...
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2024-11-06 01:45:45 +00:00
Wallas Henrique
966e31697b
[Bugfix] Fix pickle of input when async output processing is on ( #9931 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-11-06 00:39:26 +00:00
youkaichao
ca9844b340
[bugfix] fix weak ref in piecewise cudagraph and tractable test ( #10048 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-05 14:49:20 -08:00
Michael Goin
235366fe2e
[CI] Prune back the number of tests in tests/kernels/* ( #9932 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-05 16:02:32 -05:00
Michael Goin
02462465ea
[CI] Prune tests/models/decoder_only/language/* tests ( #9940 )
...
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-05 16:02:23 -05:00
Cyrus Leung
bbc3619dc8
[Core] Make encoder-decoder inputs a nested structure to be more composable ( #9604 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-05 10:07:31 +08:00
tomeras91
ac04a97a9f
[Frontend] Add max_tokens prometheus metric ( #9881 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-11-04 22:53:24 +00:00
hissu-hyvarinen
5208dc7a20
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs ( #9279 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>
2024-11-04 11:37:46 -08:00
Robert Shaw
1c45f4c385
[CI] Basic Integration Test For TPU ( #9968 )
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com>
2024-11-04 11:34:26 -08:00
Chauncey
ac6b8f19b9
[Frontend] Multi-Modality Support for Loading Local Image Files ( #9915 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-04 15:34:57 +00:00
shanshan wang
54597724f4
[Model] Add support for H2OVL-Mississippi models ( #9747 )
...
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-11-04 00:15:36 +00:00
youkaichao
cea808f325
[3/N] model runner pass the whole config to model ( #9958 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-02 12:08:49 -07:00
youkaichao
e893795443
[2/N] executor pass the complete config to worker/modelrunner ( #9938 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2024-11-02 07:35:05 -07:00
sroy745
a78dd3303e
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models ( #9559 )
2024-11-01 23:22:49 -07:00
Peter Salas
6c0b7f548d
[Core][VLM] Add precise multi-modal placeholder tracking ( #8346 )
...
Signed-off-by: Peter Salas <peter@fixie.ai>
2024-11-01 16:21:10 -07:00
Pavani Majety
598b6d7b07
[Bugfix/Core] Flashinfer k_scale and v_scale ( #9861 )
2024-11-01 12:15:05 -07:00
Travis Johnson
1dd4cb2935
[Bugfix] Fix edge cases for MistralTokenizer ( #9625 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2024-11-01 10:33:15 -07:00
Cyrus Leung
ba0d892074
[Frontend] Use a proper chat template for VLM2Vec ( #9912 )
2024-11-01 14:09:07 +00:00
Michael Goin
30a2e80742
[CI/Build] Add Model Tests for PixtralHF ( #9813 )
2024-11-01 07:55:29 -06:00
Cyrus Leung
06386a64dd
[Frontend] Chat-based Embeddings API ( #9759 )
2024-11-01 08:13:35 +00:00
Yongzao
2b5bf20988
[torch.compile] Adding torch compile annotations to some models ( #9876 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-01 00:25:47 -07:00
youkaichao
566cd27797
[torch.compile] rework test plans ( #9866 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-31 22:20:17 -07:00
youkaichao
96e0c9cbbd
[torch.compile] directly register custom op ( #9896 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-31 21:56:09 -07:00
Joe Runde
031a7995f3
[Bugfix][Frontend] Reject guided decoding in multistep mode ( #9892 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-01 01:09:46 +00:00
Mor Zusman
9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 ( #9838 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-10-31 20:06:25 +00:00
sasha0552
55650c83a0
[Bugfix] Fix `illegal memory access` error with chunked prefill, prefix caching, block manager v2 and xformers enabled together ( #9532 )
...
Signed-off-by: sasha0552 <admin@sasha0552.org>
2024-10-31 11:46:36 -07:00
Alex Brooks
16b8f7a86f
[CI/Build] Add Model Tests for Qwen2-VL ( #9846 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-31 09:10:52 -07:00
Guillaume Calmettes
abbfb6134d
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint ( #9837 )
2024-10-30 18:15:56 -07:00
youkaichao
64384bbcdf
[torch.compile] upgrade tests ( #9858 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-30 16:34:22 -07:00
Yongzao
00d91c8a2c
[CI/Build] Simplify exception trace in api server tests ( #9787 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-30 14:52:05 -07:00
Joe Runde
3b3f1e7436
[Bugfix][core] replace heartbeat with pid check ( #9818 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-30 09:34:07 -07:00
Elfie Guo
9ff4511e43
[Misc] Add chunked-prefill support on FlashInfer. ( #9781 )
2024-10-30 09:33:53 -07:00
Alex Brooks
cc98f1e079
[CI/Build] VLM Test Consolidation ( #9372 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-30 09:32:17 -07:00
youkaichao
ff5ed6e1bc
[torch.compile] rework compile control with piecewise cudagraph ( #9715 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-29 23:03:49 -07:00
Will Eaton
882a1ad0de
[Model] tool calling support for ibm-granite/granite-20b-functioncalling ( #8339 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
2024-10-29 15:07:37 -07:00
Joe Runde
67bdf8e523
[Bugfix][Frontend] Guard against bad token ids ( #9634 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-29 14:13:20 -07:00
Michael Goin
ab6f981671
[CI][Bugfix] Skip chameleon for transformers 4.46.1 ( #9808 )
2024-10-29 11:12:43 -07:00
wangshuai09
622b7ab955
[Hardware] using current_platform.seed_everything ( #9785 )
...
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-29 14:47:44 +00:00
Zhong Qishuai
ef7865b4f9
[Frontend] re-enable multi-modality input in the new beam search implementation ( #9427 )
...
Signed-off-by: Qishuai Ferdinandzhong@gmail.com
2024-10-29 11:49:47 +00:00
litianjian
5f8d8075f9
[Model][VLM] Add multi-video support for LLaVA-Onevision ( #8905 )
...
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-28 18:04:10 +00:00
youkaichao
32176fee73
[torch.compile] support moe models ( #9632 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-27 21:58:04 -07:00
wangshuai09
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm ( #9642 )
...
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
madt2709
34a9941620
[Bugfix] Fix load config when using bools ( #9533 )
2024-10-27 13:46:41 -04:00
bnellnm
3cb07a36a2
[Misc] Upgrade to pytorch 2.5 ( #9588 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-27 09:44:24 +00:00
kakao-kevin-us
6650e6a930
[Model] Add classification Task with Qwen2ForSequenceClassification ( #9704 )
...
Signed-off-by: Kevin-Yang <ykcha9@gmail.com>
Co-authored-by: Kevin-Yang <ykcha9@gmail.com>
2024-10-26 17:53:35 +00:00
Vasiliy Alekseev
07e981fdf4
[Frontend] Bad words sampling parameter ( #9717 )
...
Signed-off-by: Vasily Alexeev <alvasian@yandex.ru>
2024-10-26 16:29:38 +00:00
Mengqing Cao
5cbdccd151
[Hardware][openvino] is_openvino --> current_platform.is_openvino ( #9716 )
2024-10-26 10:59:06 +00:00
Kevin H. Luu
9f7b4ba865
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 ( #9676 )
2024-10-24 20:59:00 -07:00
Charlie Fu
59449095ab
[Performance][Kernel] Fused_moe Performance Improvement ( #9384 )
...
Signed-off-by: charlifu <charlifu@amd.com>
2024-10-24 15:37:52 -07:00
Alex Brooks
722d46edb9
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints ( #9650 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-24 10:42:24 -07:00
Cyrus Leung
c866e0079d
[CI/Build] Fix VLM test failures when using transformers v4.46 ( #9666 )
2024-10-25 01:40:40 +08:00
Yongzao
d27cfbf791
[torch.compile] Adding torch compile annotations to some models ( #9641 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-24 09:31:42 -07:00
Jee Jee Li
295a061fb3
[Kernel] add kernel for FATReLU ( #9610 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-10-24 16:18:27 +08:00
Yongzao
8a02cd045a
[torch.compile] Adding torch compile annotations to some models ( #9639 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-24 00:54:57 -07:00
youkaichao
4fdc581f9e
[core] simplify seq group code ( #9569 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-10-24 00:16:44 -07:00
Cyrus Leung
836e8ef6ee
[Bugfix] Fix PP for ChatGLM and Molmo ( #9422 )
2024-10-24 06:12:05 +00:00
Vinay R Damodaran
33bab41060
[Bugfix]: Make chat content text allow type content ( #9358 )
...
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
2024-10-24 05:05:49 +00:00
Yunfei Chu
fc6c274626
[Model] Add Qwen2-Audio model support ( #9248 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-23 17:54:22 +00:00
Alex Brooks
150b779081
[Frontend] Enable Online Multi-image Support for MLlama ( #9393 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-10-23 17:28:57 +00:00
Alex Brooks
31a08f5bd2
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs ( #9612 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-23 14:05:18 +00:00
Isotr0py
3ff57ebfca
[Model] Initialize Florence-2 language backbone support ( #9555 )
2024-10-23 10:42:47 +00:00
Cyrus Leung
831540cf04
[Model] Support E5-V ( #9576 )
2024-10-23 11:35:29 +08:00
yulei
b17046e298
[BugFix] Fix metrics error for --num-scheduler-steps > 1 ( #8234 )
2024-10-22 15:43:03 -07:00
Ronen Schaffer
cd5601ac37
[BugFix] Prevent exporting duplicate OpenTelemetry spans ( #9017 )
2024-10-22 11:11:53 -07:00
Isotr0py
bb392ea2d2
[Model][VLM] Initialize support for Mono-InternVL model ( #9528 )
2024-10-22 16:01:46 +00:00
Jee Jee Li
a48e3ec052
[CI/Build][LoRA] Temporarily fix long context failure issue ( #9579 )
2024-10-22 11:32:51 +00:00
wangshuai09
3ddbe25502
[Hardware][CPU] using current_platform.is_cpu ( #9536 )
2024-10-22 00:50:43 -07:00
Wallas Henrique
c0292211ce
[CI/Build] Replaced some models on tests for smaller ones ( #9570 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-22 04:52:14 +00:00
Cyrus Leung
f085995a7b
[CI/Build] Remove unnecessary `fork_new_process` ( #9484 )
2024-10-21 19:47:29 -07:00
Travis Johnson
b729901139
[Bugfix]: serialize config by value for --trust-remote-code ( #6751 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-10-21 19:46:24 -07:00
youkaichao
76a5e13270
[core] move parallel sampling out from vllm core ( #9302 )
2024-10-22 00:31:44 +00:00
Joe Runde
ef7faad1b8
🐛 Fixup more test failures from memory profiling ( #9563 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-21 17:10:56 -07:00
Wallas Henrique
711f3a7806
[Frontend] Don't log duplicate error stacktrace for every request in the batch ( #9023 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-21 14:49:41 -07:00
Dhia Eddine Rhaiem
f6b97293aa
[Model] FalconMamba Support ( #9325 )
2024-10-21 12:50:16 -04:00
Cyrus Leung
696b01af8f
[CI/Build] Split up decoder-only LM tests ( #9488 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-20 21:27:50 -07:00
Chen Zhang
4fa3e33349
[Kernel] Support sliding window in flash attention backend ( #9403 )
2024-10-20 10:57:52 -07:00
Chen Zhang
5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger ( #9530 )
2024-10-20 00:05:02 +00:00
Yue Zhang
c5eea3c8ba
[Frontend] Support simpler image input format ( #9478 )
2024-10-18 23:17:07 -07:00
Joe Runde
380e18639f
🐛 fix torch memory profiling ( #9516 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-18 21:25:19 -04:00
sasha0552
337ed76671
[Bugfix] Fix offline mode when using `mistral_common` ( #9457 )
2024-10-18 18:12:32 -07:00
Cody Yu
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py ( #9510 )
2024-10-18 14:30:55 -07:00
Cyrus Leung
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding ( #9424 )
2024-10-18 11:31:58 -07:00
tomeras91
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser ( #9154 )
2024-10-18 10:27:48 +00:00
Joe Runde
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage ( #9352 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-17 22:47:27 -04:00
Robert Shaw
343f8e0905
Support `BERTModel` (first `encoder-only` embedding model) ( #9056 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: laishzh <laishengzhang@gmail.com>
Co-authored-by: Max de Bayser <maxdebayser@gmail.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-10-17 23:21:01 +00:00
bnellnm
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType ( #9299 )
2024-10-17 19:08:34 +00:00
Luka Govedič
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism ( #9300 )
2024-10-17 18:36:37 +00:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Mor Zusman
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba ( #9189 )
2024-10-16 12:12:43 -04:00
Cyrus Leung
cee711fdbb
[Core] Rename input data types ( #8688 )
2024-10-16 10:49:37 +00:00
Cyrus Leung
7abba39ee6
[Model] VLM2Vec, the first multimodal embedding model in vLLM ( #9303 )
2024-10-16 14:31:00 +08:00
Cyrus Leung
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
Chang Su
ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids ( #9034 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-15 15:40:43 -07:00
Michael Goin
22f8a69549
[Misc] Directly use compressed-tensors for checkpoint definitions ( #8909 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-15 15:40:25 -07:00
Nick Hill
e9d517f276
[BugFix] Fix chat API continuous usage stats ( #9357 )
2024-10-14 23:19:48 -07:00
Xiang Xu
f0fe4fe86d
[Model] Make llama3.2 support multiple and interleaved images ( #9095 )
2024-10-14 15:24:26 -07:00
Lily Liu
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
sixgod
6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 ( #9242 )
2024-10-11 17:36:13 +00:00
Tyler Michael Smith
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
Jee Jee Li
36ea79079b
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format ( #9275 )
2024-10-11 12:31:21 +00:00
youkaichao
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-10-10 21:30:44 -07:00
youkaichao
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
sroy745
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
Lucas Wilkinson
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
Michael Goin
ce00231a8b
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models ( #9213 )
2024-10-10 14:15:40 +08:00
Li, Jiang
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
youkaichao
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
chenqianfzh
2f4117c38e
support bitsandbytes quantization with more models ( #9148 )
2024-10-08 19:52:19 -06:00
bnellnm
bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch ( #9086 )
2024-10-08 14:28:12 -07:00
Daniele
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing ( #8537 )
2024-10-08 09:38:40 -07:00
Alex Brooks
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser ( #9151 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-08 14:31:26 +00:00
Alex Brooks
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-08 14:12:56 +00:00
Brendan Wong
8c746226c9
[Frontend] API support for beam search for MQLLMEngine ( #9117 )
2024-10-08 05:51:43 +00:00
youkaichao
04c12f8157
[misc] update utils to support comparing multiple settings ( #9140 )
2024-10-08 02:51:49 +00:00
Isotr0py
f19da64871
[Core] Refactor GGUF parameters packing and forwarding ( #8859 )
2024-10-07 10:01:46 +00:00
Isotr0py
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
Cyrus Leung
8c6de96ea1
[Model] Explicit interface for vLLM models and support OOT embedding models ( #9108 )
2024-10-07 06:10:35 +00:00
youkaichao
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
Varun Sundar Rabindranath
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-10-06 12:48:11 -07:00
Cyrus Leung
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-10-06 16:35:27 +08:00
Brendan Wong
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-10-05 23:39:03 -07:00
Andy Dai
5df1834895
[Bugfix] Fix order of arguments matters in config.yaml ( #8960 )
2024-10-05 17:35:11 +00:00
Chen Zhang
cfadb9c687
[Bugfix] Deprecate registration of custom configs to huggingface ( #9083 )
2024-10-05 21:56:40 +08:00
Xin Yang
15986f598c
[Model] Support Gemma2 embedding model ( #9004 )
2024-10-05 06:57:05 +00:00
ElizaWszola
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
2024-10-04 12:34:44 -06:00
Flávia Béo
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation ( #8999 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
2024-10-04 18:31:40 +00:00
Roger Wang
26aa325f4f
[Core][VLM] Test registration for OOT multimodal models ( #8717 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-04 10:38:25 -07:00
Prashant Gupta
9ade8bbc8d
[Model] add a bunch of supported lora modules for mixtral ( #9008 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
2024-10-04 16:24:40 +00:00
Cyrus Leung
0e36fd4909
[Misc] Move registry to its own file ( #9064 )
2024-10-04 10:01:37 +00:00
Murali Andoorveedu
0f6d7a9a34
[Models] Add remaining model PP support ( #7168 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-04 10:56:58 +08:00
代君
3dbb215b38
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model ( #8405 )
2024-10-04 10:36:39 +08:00
sroy745
91add85ec4
Fix failing spec decode test ( #9054 )
2024-10-03 23:07:29 +00:00
youkaichao
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
xendo
63e39937f9
[Frontend] [Neuron] Parse literals out of override-neuron-config ( #8959 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com>
2024-10-03 18:02:07 +00:00
Guillaume Calmettes
83caf35e08
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser ( #9020 )
2024-10-03 16:44:52 +08:00
Shawn Tan
19f0d25796
[Model] Adding Granite MoE. ( #8206 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-03 09:33:57 +08:00
afeldman-nm
563649aafe
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching ( #8804 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
2024-10-02 07:52:20 +00:00
Lily Liu
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
Isotr0py
bc4eb65b54
[Bugfix] Fix Fuyu tensor parallel inference ( #8986 )
2024-10-01 17:51:41 +08:00
Joe Runde
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params ( #8252 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-01 09:34:25 +08:00
Lily Liu
bce324487a
[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. ( #8975 )
2024-10-01 00:51:40 +00:00
Mor Zusman
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model ( #8533 )
2024-09-29 17:35:58 -04:00
danieljannai21
6c9ba48fde
[Frontend] Added support for HF's new `continue_final_message` parameter ( #8942 )
2024-09-29 17:59:47 +00:00
Jee Jee Li
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00
Cyrus Leung
26a68d5d7e
[CI/Build] Add test decorator for minimum GPU memory ( #8925 )
2024-09-29 02:50:51 +00:00
ElizaWszola
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False ( #8741 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-28 18:19:40 -07:00
sroy745
5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots ( #8824 )
2024-09-29 09:17:45 +08:00
Cyrus Leung
e1a3f5e831
[CI/Build] Update models tests & examples ( #8874 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-28 09:54:35 -07:00
Varun Sundar Rabindranath
19d02ff938
[Bugfix] Fix PP for Multi-Step ( #8887 )
2024-09-28 08:52:46 -07:00
Varun Sundar Rabindranath
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-09-27 13:32:07 -07:00
Luka Govedič
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method ( #7271 )
2024-09-27 14:25:10 -04:00
youkaichao
a9b15c606f
[torch.compile] use empty tensor instead of None for profiling ( #8875 )
2024-09-27 08:11:32 -07:00
Isotr0py
6d792d2f31
[Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` ( #8892 )
2024-09-27 01:15:58 -07:00
Cyrus Leung
3b00b9c26c
[Core] rename`PromptInputs` and `inputs` ( #8876 )
2024-09-26 20:35:15 -07:00
Maximilien de Bayser
344cd2b6f4
[Feature] Add support for Llama 3.1 and 3.2 tool use ( #8343 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-09-26 17:01:42 -07:00
Nick Hill
4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade ( #8829 )
2024-09-26 16:46:43 -07:00
Chirag Jain
ee2da3e9ef
fix validation: Only set tool_choice `auto` if at least one tool is provided ( #8568 )
2024-09-26 16:23:17 -07:00
Chen Zhang
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model ( #8811 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-25 13:29:32 -07:00
Simon Mo
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility ( #8760 ) ( #8810 )
2024-09-25 10:36:26 -07:00
Michael Goin
873edda6cf
[Misc] Support FP8 MoE for compressed-tensors ( #8588 )
2024-09-25 09:43:36 -07:00
Cyrus Leung
28e1299e60
rename PromptInputs and inputs with backward compatibility ( #8760 )
2024-09-25 09:36:47 -07:00
bnellnm
300da09177
[Kernel] Fullgraph and opcheck tests ( #8479 )
2024-09-25 08:35:52 -06:00
sroy745
fc3afc20df
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 ( #8752 )
2024-09-24 21:26:36 -07:00
sroy745
ee777d9c30
Fix test_schedule_swapped_simple in test_scheduler.py ( #8780 )
2024-09-24 21:26:18 -07:00
Joe Runde
6e0c9d6bd0
[Bugfix] Use heartbeats instead of health checks ( #8583 )
2024-09-24 20:37:38 -07:00
Travis Johnson
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding ( #8047 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-09-24 17:29:56 -07:00
Jee Jee Li
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 ( #8768 )
2024-09-24 17:08:55 -07:00
Andy
2529d09b5a
[Frontend] Batch inference for llm.chat() API ( #8648 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-09-24 09:44:11 -07:00
Alex Brooks
8ff7ced996
[Model] Expose Phi3v num_crops as a mm_processor_kwarg ( #8658 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-24 07:36:46 +00:00
Simon Mo
3185fb0cca
Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" ( #8750 )
2024-09-24 05:45:20 +00:00
youkaichao
0250dd68c5
re-implement beam search on top of vllm core ( #8726 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-09-23 22:08:12 -07:00
sroy745
88577ac928
Fix tests in test_scheduler.py that fail with BlockManager V2 ( #8728 )
2024-09-24 04:43:13 +00:00
Alexander Matveev
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse ( #8335 )
2024-09-23 15:38:04 -07:00
jiqing-feng
5f7bb58427
Fix typical acceptance sampler with correct recovered token ids ( #8562 )
2024-09-23 12:32:27 -07:00
Jee Jee Li
9b0e3ec970
[Kernel][LoRA] Add assertion for punica sgmv kernels ( #7585 )
2024-09-23 18:57:42 +00:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Daniele
ee5f34b1c2
[CI/Build] use setuptools-scm to set __version__ ( #4738 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-09-23 09:44:26 -07:00
Yanyi Liu
a79e522984
[Model] Support pp for qwen2-vl ( #8696 )
2024-09-23 13:46:59 +00:00
Alex Brooks
9b8c8ba119
[Core][Frontend] Support Passing Multimodal Processor Kwargs ( #8657 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-23 07:44:48 +00:00
Lily Liu
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. ( #8701 )
2024-09-22 12:34:14 -07:00
litianjian
5b59532760
[Model][VLM] Add LLaVA-Onevision model support ( #8486 )
...
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-22 10:51:44 -07:00
youkaichao
0faab90eb0
[beam search] add output for manually checking the correctness ( #8684 )
2024-09-20 19:55:33 -07:00
Cyrus Leung
0455c46ed4
[Core] Factor out common code in `SequenceData` and `Sequence` ( #8675 )
2024-09-21 02:30:39 +00:00
Cyrus Leung
0057894ef7
[Core] Rename `PromptInputs` and `inputs`( #8673 )
2024-09-20 19:00:54 -07:00
Patrick von Platen
b4e4eda92e
[Bugfix][Core] Fix tekken edge case for mistral tokenizer ( #8640 )
2024-09-20 14:33:03 -07:00
Jiaxin Shan
260d40b5ea
[Core] Support Lora lineage and base model metadata management ( #6315 )
2024-09-20 06:20:56 +00:00
Charlie Fu
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention ( #8577 )
2024-09-19 17:37:57 +00:00
sroy745
3118f63385
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. ( #8545 )
2024-09-19 02:24:15 +00:00
Tyler Michael Smith
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching ( #8039 )
2024-09-18 20:05:06 +00:00
afeldman-nm
a8c1d161a7
[Core] *Prompt* logprobs support in Multi-step ( #8199 )
2024-09-18 08:38:43 -07:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-18 13:56:58 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
Tyler Michael Smith
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching ( #8012 )
2024-09-17 23:44:27 +00:00
youkaichao
fa0c114fad
[doc] improve installation doc ( #8550 )
...
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>
2024-09-17 16:24:06 -07:00
Patrick von Platen
a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format ( #8515 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-17 17:50:37 +00:00
chenqianfzh
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization ( #8434 )
2024-09-17 08:09:12 -07:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
youkaichao
99aa4eddaf
[torch.compile] register allreduce operations as custom ops ( #8526 )
2024-09-16 22:57:57 -07:00
Alex Brooks
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing ( #8525 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-17 04:17:32 +00:00
Simon Mo
546034b466
[refactor] remove triton based sampler ( #8524 )
2024-09-16 20:04:48 -07:00
Luka Govedič
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00
Nick Hill
acd5511b6d
[BugFix] Fix clean shutdown issues ( #8492 )
2024-09-16 09:33:46 -07:00
ElizaWszola
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE ( #8032 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com>
2024-09-16 09:47:19 -06:00
Isotr0py
fc990f9795
[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel ( #8357 )
2024-09-15 16:51:44 -06:00
youkaichao
47790f3e32
[torch.compile] add a flag to disable custom op ( #8488 )
2024-09-14 13:07:16 -07:00
youkaichao
a36e070dad
[torch.compile] fix functionalization ( #8480 )
2024-09-14 09:46:04 -07:00
ywfang
8a0cf1ddc3
[Model] support minicpm3 ( #8297 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-14 14:50:26 +00:00
Charlie Fu
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm ( #8310 )
2024-09-13 17:01:11 -07:00
Nick Hill
18e9e1f7b3
[HotFix] Fix final output truncation with stop string + streaming ( #8468 )
2024-09-13 11:31:12 -07:00
Cyrus Leung
a84e598e21
[CI/Build] Reorganize models tests ( #7820 )
2024-09-13 10:20:06 -07:00
youkaichao
a2469127db
[misc][ci] fix quant test ( #8449 )
2024-09-13 17:20:14 +08:00
Isotr0py
9b4a3b235e
[CI/Build] Enable InternVL2 PP test only on single node ( #8437 )
2024-09-13 06:35:20 +00:00
Alexander Matveev
6821020109
[Bugfix] Fix async log stats ( #8417 )
2024-09-12 20:48:59 -07:00
Cyrus Leung
8427550488
[CI/Build] Update pixtral tests to use JSON ( #8436 )
2024-09-13 03:47:52 +00:00
shangmingc
40c396533d
[Bugfix] Mapping physical device indices for e2e test utils ( #8290 )
2024-09-13 11:06:28 +08:00
Cyrus Leung
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class ( #7329 )
2024-09-13 02:56:13 +00:00
Patrick von Platen
d31174a4e1
[Hotfix][Pixtral] Fix multiple images bugs ( #8415 )
2024-09-12 15:21:51 -07:00
Roger Wang
b61bd98f90
[CI/Build] Disable multi-node test for InternVL2 ( #8428 )
2024-09-12 15:05:35 -07:00
Nick Hill
551ce01078
[Core] Add engine option to return only deltas or final output ( #7381 )
2024-09-12 12:02:00 -07:00
William Lin
a6c0f3658d
[multi-step] add flashinfer backend ( #7928 )
2024-09-12 11:16:22 -07:00
Joe Runde
f2e263b801
[Bugfix] Offline mode fix ( #8376 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-12 11:11:57 -07:00
Alex Brooks
c6202daeed
[Model] Support multiple images for qwen-vl ( #8247 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-12 10:10:54 -07:00
Isotr0py
e56bf27741
[Bugfix] Fix InternVL2 inference with various num_patches ( #8375 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-12 10:10:35 -07:00
youkaichao
7de49aa86c
[torch.compile] hide slicing under custom op for inductor ( #8384 )
2024-09-12 00:11:55 -07:00
youkaichao
f842a7aff1
[misc] remove engine_use_ray ( #8126 )
2024-09-11 18:23:36 -07:00
Cody Yu
a65cb16067
[MISC] Dump model runner inputs when crashing ( #8305 )
2024-09-12 01:12:25 +00:00
Patrick von Platen
d394787e52
Pixtral ( #8377 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-11 14:41:55 -07:00
Lily Liu
775f00f81e
[Speculative Decoding] Test refactor ( #8317 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-09-11 14:07:34 -07:00
bnellnm
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com>
2024-09-11 12:52:19 -07:00
Li, Jiang
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend ( #7257 )
2024-09-11 09:46:46 -07:00
Yang Fan
3b7fea770f
[Model][VLM] Add Qwen2-VL model support ( #7905 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-11 09:31:19 -07:00
Pooya Davoodi
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch ( #8347 )
2024-09-11 05:30:11 +00:00
Yangshen⚡Deng
6a512a00df
[model] Support for Llava-Next-Video model ( #7559 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-10 22:21:36 -07:00
Pavani Majety
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
Isotr0py
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel ( #8299 )
2024-09-11 10:11:01 +08:00
Cyrus Leung
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer ( #8314 )
2024-09-10 16:49:11 +00:00
Dipika Sikka
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ ( #8217 )
2024-09-09 23:02:52 -04:00
Kyle Sayers
c7cb5c3335
[Misc] GPTQ Activation Ordering ( #8135 )
2024-09-09 16:27:26 -04:00
Kyle Mistele
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility ( #8272 )
2024-09-09 10:45:11 -04:00
Joe Runde
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image ( #8133 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-07 13:03:16 -07:00
Isotr0py
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models ( #8201 )
2024-09-07 16:38:23 +08:00
Cyrus Leung
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
youkaichao
ce2702a923
[tpu][misc] fix typo ( #8260 )
2024-09-06 22:40:46 -07:00
Cyrus Leung
2f707fcb35
[Model] Multi-input support for LLaVA ( #8238 )
2024-09-07 02:57:24 +00:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-06 17:02:05 -06:00
Alexey Kondratiev(AMD)
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests ( #8203 )
2024-09-06 11:51:03 -07:00
afeldman-nm
e5cab71531
[Frontend] Add --logprobs argument to `benchmark_serving.py` ( #8191 )
2024-09-06 09:01:14 -07:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-09-05 18:10:33 -07:00
Alex Brooks
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) ( #8029 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-05 12:48:10 +00:00
manikandan.tm@zucisystems.com
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) ( #7860 )
2024-09-05 11:33:37 +00:00
Elfie Guo
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. ( #8173 )
2024-09-05 05:12:26 +00:00
Kyle Mistele
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models ( #5649 )
...
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
2024-09-04 13:18:13 -07:00
Woosuk Kwon
561d6f8077
[CI] Change test input in Gemma LoRA test ( #8163 )
2024-09-04 13:05:50 -07:00
alexeykondrat
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm ( #7369 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-04 11:57:54 -07:00
Cody Yu
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests ( #8131 )
2024-09-04 18:53:25 +00:00
Cyrus Leung
855c262a6b
[Frontend] Multimodal support in offline chat ( #8098 )
2024-09-04 05:22:17 +00:00
Peter Salas
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks ( #7963 )
2024-09-04 04:38:21 +00:00
Dipika Sikka
2188a60c7e
[Misc] Update `GPTQ` to use `vLLMParameters` ( #7976 )
2024-09-03 17:21:44 -04:00
Alexander Matveev
6d646d08a2
[Core] Optimize Async + Multi-step ( #8050 )
2024-09-03 18:50:29 +00:00
wang.yuqi
6e36f4fa6c
improve chunked prefill performance
...
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874 )
2024-09-02 14:20:12 -07:00
Lily Liu
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling ( #7244 )
2024-09-01 21:23:29 -07:00
Shawn Tan
f8d60145b4
[Model] Add Granite model ( #7436 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-09-01 18:37:18 -07:00
Roger Wang
5b86b19954
[Misc] Optional installation of audio related packages ( #8063 )
2024-09-01 14:46:57 -07:00
Roger Wang
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items ( #8049 )
2024-08-31 16:35:53 -07:00
Pavani Majety
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. ( #8013 )
2024-08-30 22:18:50 -07:00
Wenxiang
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE ( #7729 )
...
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
2024-08-30 13:42:57 -06:00
Kaunil Dhruv
058344f89a
[Frontend]-config-cli-args ( #7737 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>
2024-08-30 08:21:02 -07:00
Jungho Christopher Cho
f97be32d1d
[VLM][Model] TP support for ViTs ( #7186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-30 08:19:27 -07:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
Cyrus Leung
4abed65c58
[VLM] Disallow overflowing `max_model_len` for multimodal models ( #7998 )
2024-08-29 17:49:04 -07:00
chenqianfzh
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models ( #7445 )
2024-08-29 19:09:08 -04:00
Pavani Majety
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto ( #7985 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-29 14:53:11 -04:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step ( #7921 )
2024-08-29 11:18:26 -07:00
Jonas M. Kübler
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding ( #7899 )
2024-08-28 22:18:13 -07:00
youkaichao
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." ( #7982 )
2024-08-28 21:27:06 -07:00
Peter Salas
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors ( #7974 )
2024-08-29 03:24:31 +00:00
youkaichao
a7f65c2be9
[torch.compile] remove reset ( #7975 )
2024-08-28 17:32:26 -07:00
youkaichao
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead ( #7898 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-08-28 16:10:12 -07:00
Mor Zusman
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM ( #7651 )
2024-08-28 15:06:52 -07:00
rasmith
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ ( #7386 )
2024-08-28 15:37:47 -04:00
Pavani Majety
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. ( #7798 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-28 10:01:22 -07:00
Cody Yu
e3580537a4
[Performance] Enable chunked prefill and prefix caching together ( #7753 )
2024-08-28 00:36:31 -07:00
Cyrus Leung
51f86bf487
[mypy][CI/Build] Fix mypy errors ( #7929 )
2024-08-27 23:47:44 -07:00
Peter Salas
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt ( #7902 )
2024-08-28 01:53:56 +00:00
zifeitong
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference ( #7230 )
2024-08-28 07:09:02 +08:00
Dipika Sikka
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7766 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-27 15:07:09 -07:00
Isotr0py
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test ( #7897 )
2024-08-27 15:28:30 +00:00
Patrick von Platen
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding ( #7739 )
2024-08-27 12:40:02 +00:00
youkaichao
64cc644425
[core][torch.compile] discard the compile for profiling ( #7796 )
2024-08-26 21:33:58 -07:00
Nick Hill
39178c7fbc
[Tests] Disable retries and use context manager for openai client ( #7565 )
2024-08-26 21:33:17 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor ( #7049 )
...
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
Dipika Sikka
665304092d
[Misc] Update `qqq` to use vLLMParameters ( #7805 )
2024-08-26 13:16:15 -06:00
Cody Yu
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule ( #7822 )
2024-08-26 11:24:53 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in `RemoteOpenAIServer` ( #7836 )
2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
0b769992ec
[Bugfix]: Use float32 for base64 embedding ( #7855 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00
Nick Hill
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation ( #7851 )
2024-08-25 15:45:14 -07:00
Isotr0py
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test ( #7835 )
2024-08-25 15:53:09 +00:00
Isotr0py
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models ( #7783 )
2024-08-25 11:51:20 +00:00
zifeitong
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes ( #7840 )
2024-08-24 18:16:24 -07:00
youkaichao
aab0fcdb63
[ci][test] fix RemoteOpenAIServer ( #7838 )
2024-08-24 17:31:28 +00:00
youkaichao
ea9fa160e3
[ci][test] exclude model download time in server start time ( #7834 )
2024-08-24 01:03:27 -07:00
youkaichao
7d9ffa2ae1
[misc][core] lazy import outlines ( #7831 )
2024-08-24 00:51:38 -07:00
Tyler Rockwood
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol ( #7654 )
2024-08-23 23:07:24 -07:00
Pooya Davoodi
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API ( #7641 )
2024-08-23 23:04:22 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine ( #7789 )
2024-08-23 12:45:53 -07:00
Dipika Sikka
f1df5dbfd6
[Misc] Update `marlin` to use vLLMParameters ( #7803 )
2024-08-23 14:30:52 -04:00
Maximilien de Bayser
e25fee57c2
[BugFix] Fix server crash on empty prompt ( #7746 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00
SangBin Cho
c01a6cb231
[Ray backend] Better error when pg topology is bad. ( #7584 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-22 17:44:25 -07:00
Joe Runde
b903e1ba7f
[Frontend] error suppression cleanup ( #7786 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 21:50:21 +00:00
Travis Johnson
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output ( #7232 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-22 09:33:48 -04:00
Dipika Sikka
955b5191c9
[Misc] update fp8 to use `vLLMParameter` ( #7437 )
2024-08-22 08:36:18 -04:00
Abhinav Goyal
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer ( #6830 )
2024-08-22 02:42:24 -07:00
Michael Goin
aae74ef95c
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )" ( #7764 )
2024-08-22 03:42:14 +00:00
Joe Runde
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness ( #7443 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 02:18:11 +00:00
zifeitong
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue ( #7710 )
2024-08-22 09:36:24 +08:00
Luka Govedič
7937009a7e
[Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
Dipika Sikka
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-21 16:17:10 -07:00
Peter Salas
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig ( #7615 )
2024-08-21 22:49:39 +00:00
Robert Shaw
970dfdc01d
[Frontend] Improve Startup Failure UX ( #7716 )
2024-08-21 19:53:01 +00:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend ( #7394 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
LI MOU
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] ( #7509 )
2024-08-21 08:54:31 -07:00
Nick Hill
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations ( #7698 )
2024-08-21 11:45:55 -04:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
2024-08-20 23:28:21 -07:00
Isotr0py
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model ( #7187 )
2024-08-20 23:18:57 -07:00
youkaichao
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test ( #7709 )
2024-08-20 17:12:44 -07:00
Antoni Baum
3b682179dd
[Core] Add `AttentionState` abstraction ( #7663 )
2024-08-20 18:50:45 +00:00
Isotr0py
aae6927be0
[VLM][Model] Add test for InternViT vision encoder ( #7409 )
2024-08-20 23:10:20 +08:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
Abhinav Goyal
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion ( #7508 )
2024-08-19 17:58:14 -07:00
Isotr0py
7601cb044d
[Core] Support tensor parallelism for GGUF quantization ( #7520 )
2024-08-19 17:30:14 -04:00
William Lin
47b65a5508
[core] Multi Step Scheduling ( #7000 )
...
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
2024-08-19 13:52:13 -07:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics ( #7606 )
2024-08-19 11:52:07 -07:00
Peng Guanwen
f710fb5265
[Core] Use flashinfer sampling kernel when available ( #7137 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-19 03:24:03 +00:00
SangBin Cho
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization ( #7109 )
2024-08-18 17:57:20 -07:00
Alex Brooks
40e1360bb6
[CI/Build] Add text-only test for Qwen models ( #7475 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-08-19 07:43:46 +08:00
Robert Shaw
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend ( #7279 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-18 20:19:48 +00:00
Roger Wang
bbf55c4805
[VLM] Refactor `MultiModalConfig` initialization and profiling ( #7530 )
2024-08-17 13:30:55 -07:00
youkaichao
832163b875
[ci][test] allow longer wait time for api server ( #7629 )
2024-08-17 11:26:38 -07:00
youkaichao
5bf45db7df
[ci][test] fix engine/logger test ( #7621 )
2024-08-16 23:00:59 -07:00
SangBin Cho
4706eb628e
[aDAG] Unflake aDAG + PP tests ( #7600 )
2024-08-16 20:49:30 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 ( #7440 )
...
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440 )
2024-08-16 13:46:01 -07:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE ( #7415 )
2024-08-16 10:06:51 -07:00
Charlie Fu
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops ( #7536 )
2024-08-15 22:38:56 -07:00
jon-chuang
50b8d08dbd
[Misc/Testing] Use `torch.testing.assert_close` ( #7324 )
2024-08-16 04:24:04 +00:00
Michael Goin
e165528778
[CI] Move quantization cpu offload tests out of fastcheck ( #7574 )
2024-08-15 21:16:20 -07:00
nunjunj
3b19e39dc5
Chat method for offline llm ( #5049 )
...
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-08-15 19:41:34 -07:00
youkaichao
4cd7d47fed
[ci/test] rearrange tests and make adag test soft fail ( #7572 )
2024-08-15 19:39:04 -07:00
Grant Pinkert
f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 ( #7453 )
2024-08-16 02:38:08 +00:00
shangmingc
b67ae00cdb
[Misc] Add quantization config support for speculative model. ( #7343 )
2024-08-15 19:34:28 -07:00
Kyle Sayers
f55a9aea45
[Misc] Revert `compressed-tensors` code reuse ( #7521 )
2024-08-14 15:07:37 -07:00
Cyrus Leung
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt ( #7126 )
2024-08-14 17:55:42 +00:00
Wallas Henrique
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray ( #7424 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-14 09:44:27 -07:00
youkaichao
ea49e6a3c8
[misc][ci] fix cpu test with plugins ( #7489 )
2024-08-13 19:27:46 -07:00
Jee Jee Li
97992802f3
[CI/Build]Reduce the time consumption for LoRA tests ( #7396 )
2024-08-13 17:27:29 -07:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation ( #7426 )
2024-08-13 16:24:17 -07:00
Kyle Sayers
373538f973
[Misc] `compressed-tensors` code reuse ( #7277 )
2024-08-13 19:05:15 -04:00
youkaichao
33e5d7e6b6
[frontend] spawn engine process from api server process ( #7484 )
2024-08-13 15:40:17 -07:00
Dipika Sikka
b1e5afc3e7
[Misc] Update `awq` and `awq_marlin` to use `vLLMParameters` ( #7422 )
2024-08-13 17:08:20 -04:00
Dipika Sikka
fb377d7e74
[Misc] Update `gptq_marlin` to use new vLLMParameters ( #7281 )
2024-08-13 14:30:11 -04:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models ( #7446 )
2024-08-13 17:39:33 +00:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 ( #7410 )
2024-08-13 05:33:41 +00:00
Andrew Wang
97a6be95ba
[Misc] improve logits processors logging message ( #7435 )
2024-08-13 02:29:34 +00:00
Cyrus Leung
9ba85bc152
[mypy] Misc. typing improvements ( #7417 )
2024-08-13 09:20:20 +08:00
Rui Qiao
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit ( #7224 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-12 17:57:16 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel ( #7208 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Roger Wang
e6e42e4b17
[Core][VLM] Support image embeddings as input ( #6613 )
2024-08-12 16:16:06 +08:00
Isotr0py
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio ( #7392 )
2024-08-10 16:19:33 +00:00
Cade Daniel
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 ( #7380 )
2024-08-09 23:48:49 +00:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
Pooya Davoodi
249b88228d
[Frontend] Support embeddings in the run_batch API ( #7132 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-09 09:48:21 -07:00
Nick Hill
b4e9528f95
[Core] Streamline stream termination in `AsyncLLMEngine` ( #7336 )
2024-08-09 07:06:36 +00:00
William Lin
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace ( #6971 )
2024-08-09 05:42:45 +00:00
Travis Johnson
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary ( #7218 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-08 22:08:46 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
Zach Zheng
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic ( #6849 )
2024-08-08 10:43:30 -07:00
Joe Runde
21b9c49aa3
[Frontend] Kill the server on engine death ( #6594 )
...
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-08 09:47:48 -07:00
Luka Govedič
5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests ( #7305 )
2024-08-08 12:16:13 -04:00
Michael Goin
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization ( #7219 )
2024-08-07 11:23:12 -07:00
Maximilien de Bayser
fde47d3bc2
[BugFix] Fix frontend multiprocessing hang ( #7217 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-08-07 18:09:36 +00:00
Isotr0py
b764547616
[Bugfix] Fix input processor for InternVL2 model ( #7164 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-07 09:32:07 -07:00
Dipika Sikka
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` ( #5874 )
2024-08-07 09:17:58 -07:00
Cyrus Leung
66d617e343
[Frontend] Gracefully handle missing chat template and fix CI failure ( #7238 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-07 09:12:05 +00:00
Nick Hill
9a3f49ae07
[BugFix] Overhaul async request cancellation ( #7111 )
2024-08-07 13:21:41 +08:00
Michael Goin
f9a5600649
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading ( #7225 )
2024-08-06 18:34:26 -07:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues ( #5941 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Lily Liu
5c60c8c423
[SpecDecode] [Minor] Fix spec decode sampler tests ( #7183 )
2024-08-06 10:40:32 -07:00
Cyrus Leung
1f26efbb3a
[Model] Support SigLIP encoder and alternative decoders for LLaVA models ( #7153 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-08-06 16:55:31 +08:00
Jee Jee Li
9118217f58
[LoRA] Relax LoRA condition ( #7146 )
2024-08-06 01:57:25 +00:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
youkaichao
dfb1a15dcb
[ci][frontend] deduplicate tests ( #7101 )
2024-08-05 15:59:22 -07:00
Cade Daniel
82a1b1a82b
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification ( #6963 )
2024-08-05 08:46:44 +00:00
Alphi
7b86e7c9cd
[Model] Add multi-image support for minicpmv ( #7122 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-05 09:23:17 +08:00
Yihuan Bu
654bc5ca49
Support for guided decoding for offline LLM ( #6878 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-04 03:12:09 +00:00
youkaichao
44dcb52e39
[ci][test] finalize fork_new_process_for_each_test ( #7114 )
2024-08-03 10:44:53 -07:00
Jee Jee Li
99d7cabd7b
[LoRA] ReplicatedLinear support LoRA ( #7081 )
2024-08-02 22:40:19 -07:00
Zach Zheng
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits ( #7018 )
2024-08-02 22:38:15 -07:00
youkaichao
a0d164567c
[ci][distributed] disable ray dag tests ( #7099 )
2024-08-02 22:32:04 -07:00
youkaichao
04e5583425
[ci][distributed] merge distributed test commands ( #7097 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-02 21:33:53 -07:00
youkaichao
69ea15e5cc
[ci][distributed] shorten wait time if server hangs ( #7098 )
2024-08-02 21:05:16 -07:00
Robert Shaw
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` ( #6883 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-02 18:27:28 -07:00
Rui Qiao
05308891e2
[Core] Pipeline parallel with Ray ADAG ( #6837 )
...
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-02 13:55:40 -07:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
youkaichao
806949514a
[ci] set timeout for test_oot_registration.py ( #7082 )
2024-08-02 10:03:24 -07:00
youkaichao
252357793d
[ci][distributed] try to fix pp test ( #7054 )
2024-08-01 22:03:12 -07:00
Woosuk Kwon
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn ( #7022 )
2024-08-01 13:14:37 -07:00
Michael Goin
fb3db61688
[CI/Build] Remove sparseml requirement from testing ( #7037 )
2024-08-01 12:00:51 -07:00
youkaichao
c8a7e93273
[core][scheduler] simplify and improve scheduler ( #6867 )
2024-07-31 23:51:09 -07:00
zifeitong
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user ( #6954 )
2024-07-31 21:13:34 -07:00
Jee Jee Li
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton ( #5036 )
2024-07-31 17:12:24 -07:00
Michael Goin
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization ( #6960 )
2024-07-31 12:47:46 -07:00
Cody Yu
bd70013407
[MISC] Introduce pipeline parallelism partition strategies ( #6920 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-07-31 12:02:17 -07:00
Cyrus Leung
daed30c4a9
[Bugfix] Fix feature size calculation for LLaVA-NeXT ( #6982 )
2024-07-31 23:46:17 +08:00
HandH1998
6512937de1
Support W4A8 quantization for vllm ( #5218 )
2024-07-31 07:55:21 -06:00
Cyrus Leung
f230cc2ca6
[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` ( #6836 )
2024-07-31 10:38:45 +08:00
Tyler Michael Smith
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun ( #6842 )
2024-07-30 16:37:01 -04:00
Sanger Steel
052b6f8ca4
[Bugfix] Fix tensorizer memory profiling bug during testing ( #6881 )
2024-07-30 11:48:50 -07:00
Nick Hill
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel ( #6698 )
2024-07-30 10:40:08 -07:00
Varun Sundar Rabindranath
af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace ( #6848 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 20:24:58 -06:00
Nick Hill
9f69d8245a
[Frontend] New `allowed_token_ids` decoding request parameter ( #6753 )
2024-07-29 23:37:27 +00:00
Thomas Parnell
9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. ( #6786 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-29 14:51:27 -07:00
Peng Guanwen
db9e5708a9
[Core] Reduce unnecessary compute when logprobs=None ( #6532 )
2024-07-29 16:47:31 +00:00
Varun Sundar Rabindranath
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace ( #6677 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
Isotr0py
7cbd9ec7a9
[Model] Initialize support for InternVL2 series models ( #6514 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-29 10:16:30 +00:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
Cyrus Leung
1ad86acf17
[Model] Initial support for BLIP-2 ( #5920 )
...
Co-authored-by: ywang96 <ywang@roblox.com>
2024-07-27 11:53:07 +00:00
Joe
14dbd5a767
[Model] H2O Danube3-4b ( #6451 )
2024-07-26 20:47:50 -07:00
Sanger Steel
969d032265
[Bugfix]: Fix Tensorizer test failures ( #6835 )
2024-07-26 20:02:25 -07:00
youkaichao
443c7cf4cf
[ci][distributed] fix flaky tests ( #6806 )
2024-07-25 17:44:09 -07:00
Michael Goin
65b1f121c8
[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints ( #6761 )
2024-07-25 09:46:15 -07:00
Chang Su
316a41ac1d
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py ( #6755 )
2024-07-24 22:48:07 -07:00
Cody Yu
309aaef825
[Bugfix] Fix decode tokens w. CUDA graph ( #6757 )
2024-07-24 22:33:56 -07:00
Alphi
9e169a4c61
[Model] Adding support for MiniCPM-V ( #4087 )
2024-07-24 20:59:30 -07:00
Evan Z. Liu
5689e256ba
[Frontend] Represent tokens with identifiable strings ( #6626 )
2024-07-25 09:51:00 +08:00
Michael Goin
421e218b37
[Bugfix] Bump transformers to 4.43.2 ( #6752 )
2024-07-24 13:22:16 -07:00
Antoni Baum
0e63494cf3
Add fp8 support to `reshape_and_cache_flash` ( #6667 )
2024-07-24 18:36:52 +00:00
Nick Hill
2cf0df3381
[Bugfix] Fix speculative decode seeded test ( #6743 )
2024-07-24 08:58:31 -07:00
Nick Hill
c882a7f5b3
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model ( #6714 )
2024-07-24 07:34:22 +00:00
William Lin
5e8ca973eb
[Bugfix] fix flashinfer cudagraph capture for PP ( #6708 )
2024-07-24 01:49:44 +00:00
dongmao zhang
87525fab92
[bitsandbytes]: support read bnb pre-quantized model ( #5753 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-23 23:45:09 +00:00
Thomas Parnell
2f808e69ab
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. ( #6645 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-23 23:05:05 +00:00
Michael Goin
01c16ede6b
[CI] Add smoke test for non-uniform AutoFP8 quantization ( #6702 )
2024-07-23 22:45:12 +00:00
Roger Wang
1bedf210e3
Bump `transformers` version for Llama 3.1 hotfix and patch Chameleon ( #6690 )
2024-07-23 13:47:48 -07:00
Yehoshua Cohen
58f53034ad
[Frontend] Add Usage data in each chunk for chat_serving. #6540 ( #6652 )
2024-07-23 11:41:55 -07:00
Roger Wang
22fa2e35cb
[VLM][Model] Support image input for Chameleon ( #6633 )
2024-07-22 23:50:48 -07:00
Cyrus Leung
97234be0ec
[Misc] Manage HTTP connections in one place ( #6600 )
2024-07-22 21:32:02 -07:00
Michael Goin
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors ( #6528 )
2024-07-23 04:11:50 +00:00
Jiaxin Shan
42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace ( #6234 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-22 15:42:40 -07:00
Tyler Michael Smith
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels ( #6649 )
2024-07-22 14:08:30 -06:00
Cyrus Leung
739b61a348
[Frontend] Refactor prompt processing ( #4028 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-22 10:13:53 -07:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
sroy745
14f91fe67c
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. ( #6485 )
2024-07-20 23:58:58 -07:00
Cyrus Leung
d7f4178dd9
[Frontend] Move chat utils ( #6602 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-21 08:38:17 +08:00
Matt Wong
06d6c5fe9f
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes ( #6543 )
2024-07-20 09:39:07 -07:00
Cyrus Leung
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors ( #6541 )
2024-07-20 04:17:24 +00:00
Antoni Baum
7bd82002ae
[Core] Allow specifying custom Executor ( #6557 )
2024-07-20 01:25:06 +00:00
Varun Sundar Rabindranath
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
2024-07-19 18:15:26 -07:00
Robert Shaw
4cc24f01b1
[ Kernel ] Enable Dynamic Per Token `fp8` ( #6547 )
2024-07-19 23:08:15 +00:00
Thomas Parnell
f0bbfaf917
[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection ( #6578 )
2024-07-19 14:01:03 -07:00
Antoni Baum
9ed82e7074
[Misc] Small perf improvements ( #6520 )
2024-07-19 12:10:56 -07:00
Thomas Parnell
a5314e8698
[Model] RowParallelLinear: pass bias to quant_method.apply ( #6327 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-19 07:15:22 -06:00
Woo-Yeon Lee
a921e86392
[BUGFIX] Raise an error for no draft token case when draft_tp>1 ( #6369 )
2024-07-19 06:01:09 -07:00
Cyrus Leung
6366efc67b
[Bugfix][Frontend] Fix missing `/metrics` endpoint ( #6463 )
2024-07-19 03:55:13 +00:00
Thomas Parnell
d4201e06d5
[Bugfix] Make spec. decode respect per-request seed. ( #6034 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-07-18 19:22:08 -07:00
Nick Hill
b5672a112c
[Core] Multiprocessing Pipeline Parallel support ( #6130 )
...
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-18 19:15:52 -07:00
youkaichao
f53b8f0d05
[ci][test] add correctness test for cpu offloading ( #6549 )
2024-07-18 23:41:06 +00:00
Nick Hill
e2fbaee725
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs ( #6227 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-07-18 15:13:30 +08:00
Cody Yu
b5af8c223c
[Model] Pipeline parallel support for Mixtral ( #6516 )
2024-07-17 19:26:04 -07:00
Varun Sundar Rabindranath
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-18 01:38:35 +00:00
Alexander Matveev
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step ( #6338 )
2024-07-17 14:30:28 -07:00
Antoni Baum
5f0b9933e6
[Bugfix] Fix Ray Metrics API usage ( #6354 )
2024-07-17 19:40:10 +00:00
Cody Yu
2fa4623d9e
[Core] Refactor _prepare_model_input_tensors - take 2 ( #6164 )
2024-07-17 09:37:16 -07:00
Murali Andoorveedu
5fa6e9876e
[Bugfix] Fix for multinode crash on 4 PP ( #6495 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-17 08:25:10 +00:00
Cyrus Leung
5bf35a91e4
[Doc][CI/Build] Update docs and tests to use `vllm serve` ( #6431 )
2024-07-17 07:43:21 +00:00
youkaichao
7f62077af5
[misc][distributed] improve tests ( #6488 )
2024-07-16 17:35:52 -07:00
youkaichao
09c2eb85dd
[ci][distributed] add pipeline parallel correctness test ( #6410 )
2024-07-16 15:44:22 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` ( #6081 )
2024-07-16 15:31:32 -07:00
Cody Yu
160e1d8c99
[Misc] Log spec decode metrics ( #6454 )
2024-07-16 20:37:10 +00:00
Cyrus Leung
38ef94888a
[CI/Build] Remove "boardwalk" image asset ( #6460 )
2024-07-16 08:59:36 -07:00
sasha0552
7a3d2a5b95
[Frontend] Support for chat completions input in the tokenize endpoint ( #5923 )
2024-07-16 20:18:09 +08:00
Cyrus Leung
d97011512e
[CI/Build] vLLM cache directory for images ( #6444 )
2024-07-15 23:12:25 -07:00
Joe
d92b3c5cde
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests ( #6419 )
2024-07-15 18:54:15 -07:00
Mor Zusman
9ad32dacd9
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug ( #6425 )
...
Co-authored-by: Mor Zusman <morz@ai21.com>
2024-07-16 01:32:55 +00:00
Thomas Parnell
4ef95b0f06
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF ( #6409 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-15 13:14:49 -04:00
youkaichao
69672f116c
[core][distributed] simplify code to support pipeline parallel ( #6406 )
2024-07-14 21:20:51 -07:00
zifeitong
b47008b4d2
[BugFix] BatchResponseData body should be optional ( #6345 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-07-15 04:06:09 +00:00
Ethan Xu
dbfe254eda
[Feature] vLLM CLI ( #5090 )
...
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-07-14 15:36:43 -07:00
Isotr0py
540c0368b1
[Model] Initialize Fuyu-8B support ( #3924 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-14 05:27:14 +00:00
youkaichao
41708e5034
[ci] try to add multi-node tests ( #6280 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-12 21:51:48 -07:00
Michael Goin
111fc6e7ec
[Misc] Add generated git commit hash as `vllm.__commit__` ( #6386 )
2024-07-12 22:52:15 +00:00
Yihuan Bu
b039cbbce3
[Misc] add fixture to guided processor tests ( #6341 )
2024-07-12 09:55:39 -07:00
Cyrus Leung
024ad87cdc
[Bugfix] Fix dtype mismatch in PaliGemma ( #6367 )
2024-07-12 08:22:18 -07:00
Robert Shaw
aea19f0989
[ Misc ] Support Models With Bias in `compressed-tensors` integration ( #6356 )
2024-07-12 11:11:29 -04:00
Hongxia Yang
b6c16cf8ff
[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm ( #6352 )
2024-07-11 21:30:46 -07:00
Lily Liu
d6ab528997
[Misc] Remove flashinfer warning, add flashinfer tests to CI ( #6351 )
2024-07-12 01:32:06 +00:00
Robert Shaw
7ed6a4f0e1
[ BugFix ] Prompt Logprobs Detokenization ( #6223 )
...
Co-authored-by: Zifei Tong <zifeitong@gmail.com>
2024-07-11 22:02:29 +00:00
xwjiang2010
1df43de9bb
[bug fix] Fix llava next feature size calculation. ( #6339 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
2024-07-11 17:21:10 +00:00
Robert Shaw
b675069d74
[ Misc ] Refactor Marlin Python Utilities ( #6082 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-07-11 15:40:11 +00:00
sroy745
ae151d73be
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models ( #5765 )
2024-07-10 16:02:47 -07:00
youkaichao
da78caecfa
[core][distributed] zmq fallback for broadcasting large objects ( #6183 )
...
[core][distributed] add zmq fallback for broadcasting large objects (#6183 )
2024-07-09 18:49:11 -07:00
Abhinav Goyal
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer ( #4978 )
2024-07-09 18:34:02 -07:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts ( #4645 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
tomeras91
ddc369fba1
[Bugfix] Mamba cache Cuda Graph padding ( #6214 )
2024-07-08 11:25:51 -07:00
afeldman-nm
543aa48573
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) ( #4888 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-08 17:12:15 +00:00
Robert Shaw
abfe705a02
[ Misc ] Support Fp8 via `llm-compressor` ( #6110 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
Roger Wang
6206dcb29e
[Model] Add PaliGemma ( #5189 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-07 09:25:50 +08:00
jvlunteren
f1e15da6fe
[Frontend] Continuous usage stats in OpenAI completion API ( #5742 )
2024-07-05 10:37:09 -07:00
Lily Liu
69ec3ca14c
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer ( #6051 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-04 16:35:51 -07:00
Cyrus Leung
3dd507083f
[CI/Build] Cleanup VLM tests ( #6107 )
2024-07-03 18:58:18 -07:00
Robert Shaw
62963d129e
[ Misc ] Clean Up `CompressedTensorsW8A8` ( #6113 )
2024-07-03 22:50:08 +00:00
xwjiang2010
d9e98f42e4
[vlm] Remove vision language config. ( #6089 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-03 22:14:16 +00:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
SangBin Cho
d18bab3587
[CI] Fix base url doesn't strip "/" ( #6087 )
2024-07-02 21:31:25 -07:00
Cyrus Leung
9831aec49f
[Core] Dynamic image size support for VLMs ( #5276 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: ywang96 <ywang@roblox.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-07-02 20:34:00 -07:00
youkaichao
482045ee77
[hardware][misc] introduce platform abstraction ( #6080 )
2024-07-02 20:12:22 -07:00
Mor Zusman
9d6a8daa87
[Model] Jamba support ( #4115 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 23:11:29 +00:00
Qubitium-ModelCloud
ee93f4f92a
[CORE] Quantized lm-head Framework ( #4442 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-07-02 22:25:17 +00:00
Robert Shaw
7c008c51a9
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral ( #5970 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-02 21:54:35 +00:00
Robert Shaw
4d26d806e1
Update conftest.py ( #6076 )
2024-07-02 20:14:22 +00:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
Sirej Dua
15aba081f3
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) ( #6050 )
...
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
xwjiang2010
98d6682cd1
[VLM] Remove `image_input_type` from VLM config ( #5852 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
Alexander Matveev
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) ( #5602 )
2024-07-01 20:10:37 -07:00
Avshalom Manevich
12a59959ed
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs ( #6029 )
2024-07-01 21:08:29 +00:00
sroy745
80ca1e6a3a
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker ( #5348 )
2024-07-01 00:33:05 -07:00
youkaichao
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization ( #6007 )
2024-06-30 20:07:34 -07:00
Robert Shaw
af9ad46fca
[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) ( #5940 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-30 23:06:27 +00:00
SangBin Cho
f5e73c9f1b
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. ( #5909 )
...
Co-authored-by: sang <sangcho@anyscale.com>
2024-06-30 17:11:15 +00:00
llmpros
c6c240aa0a
[Frontend]: Support base64 embedding ( #5935 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-06-30 23:53:00 +08:00
youkaichao
2be6955a3f
[ci][distributed] fix device count call
...
[ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991 )
2024-06-30 08:06:13 +00:00
Cyrus Leung
9d47f64eb6
[CI/Build] [3/3] Reorganize entrypoints tests ( #5966 )
2024-06-30 12:58:49 +08:00
Cyrus Leung
cff6a1fec1
[CI/Build] Reuse code for checking output consistency ( #5988 )
2024-06-30 11:44:25 +08:00
Matt Wong
9def10664e
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests ( #5949 )
2024-06-29 12:47:58 -07:00
Cyrus Leung
99397da534
[CI/Build] Add TP test for vision models ( #5892 )
2024-06-29 15:45:54 +00:00
Robert Shaw
8dbfcd35bf
[ CI/Build ] Added E2E Test For Compressed Tensors ( #5839 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 21:12:58 +08:00
Cyrus Leung
51e971d39e
[Bugfix] Support `eos_token_id` from `config.json` ( #5954 )
2024-06-29 11:19:02 +00:00
Woosuk Kwon
580353da93
[Bugfix] Fix precisions in Gemma 1 ( #5913 )
2024-06-29 03:10:21 +00:00
Joe Runde
ba4994443a
[Kernel] Add punica dimensions for Granite 3b and 8b ( #5930 )
...
Signed-off-by: Joe Runde <joe@joerun.de>
2024-06-29 10:48:25 +08:00
William Lin
906a19cdb0
[Misc] Extend vLLM Metrics logging API ( #5925 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-29 10:36:06 +08:00
Lily Liu
7041de4384
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode ( #4628 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>
2024-06-28 15:28:49 -07:00
Tyler Michael Smith
6a2d659d28
[Bugfix] Fix compute datatype for cutlass 3.x epilogues ( #5931 )
2024-06-28 17:10:34 +00:00
Cody Yu
b2c620230a
[Spec Decode] Introduce DraftModelRunner ( #5799 )
2024-06-28 09:17:51 -07:00
xwjiang2010
b90d8cd832
[Distributed] Make it clear that % should not be in tensor dict keys. ( #5927 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
2024-06-28 15:20:22 +00:00
Cyrus Leung
3b752a6555
[CI/Build] [2/3] Reorganize entrypoints tests ( #5904 )
2024-06-28 07:59:18 -07:00
Ilya Lavrenov
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend ( #5379 )
2024-06-28 13:50:16 +00:00
Cyrus Leung
5cbe8d155c
[Core] Registry for processing model inputs ( #5214 )
...
Co-authored-by: ywang96 <ywang@roblox.com>
2024-06-28 12:09:56 +00:00
Roger Wang
736ed38849
[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test ( #5922 )
2024-06-27 11:43:04 -07:00
Cyrus Leung
e9d32d077d
[CI/Build] [1/3] Reorganize entrypoints tests ( #5526 )
2024-06-27 12:43:17 +00:00
xwjiang2010
d12af207d2
[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly ( #5880 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
2024-06-27 15:15:24 +08:00
sasha0552
c54269d967
[Frontend] Add tokenize/detokenize endpoints ( #5054 )
2024-06-26 16:54:22 +00:00
Luka Govedič
5bfd1bbc98
[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` ( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-06-26 15:16:00 +00:00
Cyrus Leung
6984c02a27
[CI/Build] Refactor image test assets ( #5821 )
2024-06-26 01:02:34 -07:00
youkaichao
515080ad2f
[bugfix][distributed] fix shm broadcast when the queue size is full ( #5801 )
2024-06-25 21:56:02 -07:00
Stephanie Wang
dda4811591
[Core] Refactor Worker and ModelRunner to consolidate control plane communication ( #5408 )
...
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>
2024-06-25 20:30:03 -07:00
Thomas Parnell
c2a8ac75e0
[CI/Build] Add E2E tests for MLPSpeculator ( #5791 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-26 00:04:08 +00:00
Matt Wong
dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes ( #5422 )
2024-06-25 15:56:15 -07:00
Dipika Sikka
dd248f7675
[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` ( #5794 )
2024-06-25 19:23:35 +00:00
Michael Goin
d9b34baedd
[CI/Build] Add unit testing for FlexibleArgumentParser ( #5798 )
2024-06-25 12:18:03 -07:00
Antoni Baum
67882dbb44
[Core] Add fault tolerance for `RayTokenizerGroupPool` ( #5748 )
2024-06-25 10:15:10 -07:00
Woo-Yeon Lee
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model ( #5414 )
2024-06-25 09:56:06 +00:00
Isotr0py
edd5fe5fa2
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement ( #5772 )
2024-06-24 12:11:53 +08:00
Murali Andoorveedu
5d4d90536f
[Distributed] Add send and recv helpers ( #5719 )
2024-06-23 14:42:28 -07:00
rohithkrn
f5dda63eb5
[LoRA] Add support for pinning lora adapters in the LRU cache ( #5603 )
2024-06-21 15:42:46 -07:00
youkaichao
d9a252bc8e
[Core][Distributed] add shm broadcast ( #5399 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-06-21 05:12:35 +00:00
Jee Li
67005a07bc
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora ( #5665 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-21 04:46:28 +00:00
Chang Su
c35e4a3dd7
[BugFix] Fix test_phi3v.py ( #5725 )
2024-06-21 04:45:34 +00:00
Jinzhen Lin
1f5674218f
[Kernel] Add punica dimension for Qwen2 LoRA ( #5441 )
2024-06-20 17:55:41 -07:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
Michael Goin
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names ( #5718 )
2024-06-20 17:00:13 -06:00
Cyrus Leung
3730a1c832
[Misc] Improve conftest ( #5681 )
2024-06-19 19:09:21 -07:00
Dipika Sikka
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes ( #5650 )
2024-06-19 18:06:44 -04:00
zifeitong
78687504f7
[Bugfix] AsyncLLMEngine hangs with asyncio.run ( #5654 )
2024-06-19 13:57:12 -07:00
youkaichao
d571ca0108
[ci][distributed] add tests for custom allreduce ( #5689 )
2024-06-19 20:16:04 +00:00
Thomas Parnell
e5150f2c28
[Bugfix] Added test for sampling repetition penalty bug. ( #5659 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-19 06:03:55 +00:00
sergey-tinkoff
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
Dipika Sikka
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization ( #5542 )
2024-06-18 12:45:05 -04:00
Ronen Schaffer
7879f24dcc
[Misc] Add OpenTelemetry support ( #4687 )
...
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.
I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00