Commit Graph

466 Commits

Author SHA1 Message Date
Yunfei Chu fc6c274626
[Model] Add Qwen2-Audio model support (#9248)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-23 17:54:22 +00:00
Alex Brooks 31a08f5bd2
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs (#9612)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-23 14:05:18 +00:00
Isotr0py 3ff57ebfca
[Model] Initialize Florence-2 language backbone support (#9555) 2024-10-23 10:42:47 +00:00
Cyrus Leung 831540cf04
[Model] Support E5-V (#9576) 2024-10-23 11:35:29 +08:00
Cody Yu d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Michael Goin 3921a2f29e
[Model] Support Pixtral models in the HF Transformers format (#9036) 2024-10-18 13:29:56 -06:00
Cyrus Leung 051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Tyler Michael Smith ae8b633ba3
[Bugfix] Fix offline_inference_with_prefix.py (#9505) 2024-10-18 16:59:19 +00:00
Kuntai Du 81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default (#8704)
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Lucas Wilkinson 9d30a056e7
[misc] CUDA Time Layerwise Profiler (#8337)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-10-17 10:36:09 -04:00
Roger Wang 59230ef32b
[Misc] Consolidate example usage of OpenAI client for multimodal models (#9412)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-16 11:20:51 +00:00
Cyrus Leung 7abba39ee6
[Model] VLM2Vec, the first multimodal embedding model in vLLM (#9303) 2024-10-16 14:31:00 +08:00
Xiang Xu f0fe4fe86d
[Model] Make llama3.2 support multiple and interleaved images (#9095) 2024-10-14 15:24:26 -07:00
Reza Salehi dfe43a2071
[Model] Molmo vLLM Integration (#9016)
Co-authored-by: sanghol <sanghol@allenai.org>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-10-14 07:56:24 -07:00
sixgod 6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 (#9242) 2024-10-11 17:36:13 +00:00
Alex Brooks a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs (#9131)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-08 14:12:56 +00:00
Cyrus Leung 151ef4efd2
[Model] Support NVLM-D and fix QK Norm in InternViT (#9045)
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2024-10-07 11:55:12 +00:00
youkaichao 18b296fdb2
[core] remove beam search from the core (#9105) 2024-10-07 05:47:04 +00:00
Andy Dai 05c531be47
[Misc] Improved prefix cache example (#9077) 2024-10-04 21:38:42 +00:00
代君 3dbb215b38
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (#8405) 2024-10-04 10:36:39 +08:00
Travis Johnson 19a4dd0990
[Bugfix] example template should not add parallel_tool_prompt if tools is none (#9007) 2024-10-03 03:04:17 +00:00
Isotr0py 2ae25f79cf
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (#8946) 2024-09-30 13:01:20 +08:00
Cyrus Leung e1a3f5e831
[CI/Build] Update models tests & examples (#8874)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-28 09:54:35 -07:00
Maximilien de Bayser 344cd2b6f4
[Feature] Add support for Llama 3.1 and 3.2 tool use (#8343)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-09-26 17:01:42 -07:00
Chen Zhang 770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model (#8811)
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-25 13:29:32 -07:00
Jee Jee Li 13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768) 2024-09-24 17:08:55 -07:00
Andy 2529d09b5a
[Frontend] Batch inference for llm.chat() API (#8648)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-09-24 09:44:11 -07:00
Alex Brooks 8ff7ced996
[Model] Expose Phi3v num_crops as a mm_processor_kwarg (#8658)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-24 07:36:46 +00:00
litianjian 5b59532760
[Model][VLM] Add LLaVA-Onevision model support (#8486)
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-22 10:51:44 -07:00
Alex Brooks 8ca5051b9a
[Misc] Use NamedTuple in Multi-image example (#8705)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-22 20:56:20 +08:00
Patrick von Platen a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format (#8515)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-17 17:50:37 +00:00
William Lin ba77527955
[bugfix] torch profiler bug for single gpu with GPUExecutor (#8354) 2024-09-12 21:30:00 -07:00
Roger Wang 360ddbd37e
[Misc] Update Pixtral example (#8431) 2024-09-12 17:31:18 -07:00
Alex Brooks c6202daeed
[Model] Support multiple images for qwen-vl (#8247)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-12 10:10:54 -07:00
Patrick von Platen d394787e52
Pixtral (#8377)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-11 14:41:55 -07:00
Yang Fan 3b7fea770f
[Model][VLM] Add Qwen2-VL model support (#7905)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-11 09:31:19 -07:00
Yangshen⚡Deng 6a512a00df
[model] Support for Llava-Next-Video model (#7559)
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-10 22:21:36 -07:00
Pavani Majety efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (#6112) 2024-09-11 00:38:40 -04:00
Isotr0py e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201) 2024-09-07 16:38:23 +08:00
Kyle Mistele 41e95c5247
[Bugfix] Fix Hermes tool call chat template bug (#8256)
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
2024-09-07 10:49:01 +08:00
William Lin 12dd715807
[misc] [doc] [frontend] LLM torch profiler support (#7943) 2024-09-06 17:48:48 -07:00
Dipika Sikka 23f322297f
[Misc] Remove `SqueezeLLM` (#8220) 2024-09-06 16:29:03 -06:00
Alex Brooks 9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-05 12:48:10 +00:00
Cyrus Leung 288a938872
[Doc] Indicate more information about supported modalities (#8181) 2024-09-05 10:51:53 +00:00
Harsha vardhan manoj Bikki 008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… (#8062)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
2024-09-04 16:33:43 -07:00
Kyle Mistele e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649)
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
2024-09-04 13:18:13 -07:00
Peter Salas 2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks (#7963) 2024-09-04 04:38:21 +00:00
Roger Wang 5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items (#8049) 2024-08-31 16:35:53 -07:00
Harsha vardhan manoj Bikki 257afc37c5
[Neuron] Adding support for context-lenght, token-gen buckets. (#7885)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
2024-08-29 13:58:14 -07:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 0b769992ec
[Bugfix]: Use float32 for base64 embedding (#7855)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00
Peter Salas 57792ed469
[Doc] Fix incorrect docs from #7615 (#7788) 2024-08-22 10:02:06 -07:00
Peter Salas 1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig (#7615) 2024-08-21 22:49:39 +00:00
Ronen Schaffer 2aa00d59ad
[CI/Build] Pin OpenTelemetry versions and make errors clearer (#7266)
[CI/Build] Pin OpenTelemetry versions and make a availability errors clearer (#7266)
2024-08-20 10:02:21 -07:00
nunjunj 3b19e39dc5
Chat method for offline llm (#5049)
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-08-15 19:41:34 -07:00
Pooya Davoodi 249b88228d
[Frontend] Support embeddings in the run_batch API (#7132)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-09 09:48:21 -07:00
Isotr0py 67abdbb42f
[VLM][Doc] Add `stop_token_ids` to InternVL example (#7354) 2024-08-09 14:51:04 +00:00
Cyrus Leung 7eb4a51c5f
[Core] Support serving encoder/decoder models (#7258) 2024-08-09 10:39:41 +08:00
Jee Jee Li 757ac70a64
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 (#7273) 2024-08-08 14:02:41 +00:00
afeldman-nm fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Isotr0py 360bd67cf0
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Jungho Christopher Cho c0d8f1636c
[Model] SiglipVisionModel ported from transformers (#6942)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-05 06:22:12 +00:00
Isotr0py 7cbd9ec7a9
[Model] Initialize support for InternVL2 series models (#6514)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-29 10:16:30 +00:00
Cyrus Leung 1ad86acf17
[Model] Initial support for BLIP-2 (#5920)
Co-authored-by: ywang96 <ywang@roblox.com>
2024-07-27 11:53:07 +00:00
Wang Ran (汪然) a57d75821c
[bugfix] make args.stream work (#6831) 2024-07-27 09:07:02 +00:00
Roger Wang 925de97e05
[Bugfix] Fix VLM example typo (#6859) 2024-07-27 14:24:08 +08:00
Roger Wang aa46953a20
[Misc][VLM][Doc] Consolidate offline examples for vision language models (#6858)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-07-26 22:44:13 -07:00
Gurpreet Singh Dhami b5f49ee55b
Update README.md (#6847) 2024-07-27 00:26:45 +00:00
Alphi b75e314fff
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V (#6787)
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-07-25 09:42:49 -07:00
Chang Su 316a41ac1d
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755) 2024-07-24 22:48:07 -07:00
Alphi 9e169a4c61
[Model] Adding support for MiniCPM-V (#4087) 2024-07-24 20:59:30 -07:00
youkaichao c051bfe4eb
[doc][distributed] doc for setting up multi-node environment (#6529)
[doc][distributed] add more doc for setting up multi-node environment (#6529)
2024-07-22 21:22:09 -07:00
youkaichao 1c27d25fb5
[core][model] yet another cpu offload implementation (#6496)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-17 20:54:35 -07:00
Cyrus Leung 5bf35a91e4
[Doc][CI/Build] Update docs and tests to use `vllm serve` (#6431) 2024-07-17 07:43:21 +00:00
Cyrus Leung d97011512e
[CI/Build] vLLM cache directory for images (#6444) 2024-07-15 23:12:25 -07:00
Woosuk Kwon 4552e37b55
[CI/Build][TPU] Add TPU CI test (#6277)
Co-authored-by: kevin <kevin@anyscale.com>
2024-07-15 14:31:16 -07:00
Isotr0py 540c0368b1
[Model] Initialize Fuyu-8B support (#3924)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-14 05:27:14 +00:00
Roger Wang 6206dcb29e
[Model] Add PaliGemma (#5189)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-07 09:25:50 +08:00
Christian Rohmann 0097bb1829
[Bugfix] Use templated datasource in grafana.json to allow automatic imports (#6136)
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
2024-07-05 09:49:47 -07:00
xwjiang2010 d9e98f42e4
[vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-03 22:14:16 +00:00
Cyrus Leung 9831aec49f
[Core] Dynamic image size support for VLMs (#5276)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: ywang96 <ywang@roblox.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-07-02 20:34:00 -07:00
xwjiang2010 98d6682cd1
[VLM] Remove `image_input_type` from VLM config (#5852)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
youkaichao 8893130b63
[doc][misc] further lower visibility of simple api server (#6041)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-01 10:50:56 -07:00
Roger Wang 329df38f1a
[Misc] Update Phi-3-Vision Example (#5981)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-06-29 14:34:29 +08:00
Tyler Michael Smith 5d2a1a9cf0
Unmark more files as executable (#5962) 2024-06-28 17:34:56 -04:00
Cyrus Leung 5cbe8d155c
[Core] Registry for processing model inputs (#5214)
Co-authored-by: ywang96 <ywang@roblox.com>
2024-06-28 12:09:56 +00:00
Cyrus Leung 6eabc6cb0e
[Doc] Add note about context length in Phi-3-Vision example (#5887) 2024-06-26 23:20:01 -07:00
Nick Hill 2110557dab
[BugFix] Fix cuda graph for MLPSpeculator (#5875)
Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com>
2024-06-27 04:12:10 +00:00
Roger Wang b9e84259e9
[Misc] Add example for LLaVA-NeXT (#5879) 2024-06-26 17:57:16 -07:00
Roger Wang 3aa7b6cf66
[Misc][Doc] Add Example of using OpenAI Server with VLM (#5832) 2024-06-25 20:34:25 -07:00
Joshua Rosenkranz b12518d3cf
[Model] MLPSpeculator speculative decoding support (#4947)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
Michael Goin 8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) 2024-06-20 17:00:13 -06:00
Isotr0py 7d46c8d378
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684) 2024-06-19 17:58:32 +08:00
Ronen Schaffer 7879f24dcc
[Misc] Add OpenTelemetry support (#4687)
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Isotr0py daef218b55
[Model] Initialize Phi-3-vision support (#4986) 2024-06-17 19:34:33 -07:00
Cyrus Leung 0e9164b40a
[mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
Allen.Dou d74674bbd9
[Misc] Fix arg names (#5524) 2024-06-14 09:47:44 -07:00
Allen.Dou 55d6361b13
[Misc] Fix arg names in quantizer script (#5507) 2024-06-13 19:02:53 -07:00
Travis Johnson 51602eefd3
[Frontend] [Core] Support for sharded tensorized models (#4990)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-12 14:13:52 -07:00
Roger Wang 7a9cb294ae
[Frontend] Add OpenAI Vision API Support (#5237)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-07 11:23:32 -07:00
Zhuohan Li bd0e7802e0
[Bugfix] Add warmup for prefix caching example (#5235) 2024-06-03 19:36:41 -07:00
Cyrus Leung 7a64d24aad
[Core] Support image processor (#4197) 2024-06-02 22:56:41 -07:00
Daniil Arapov c2d6d2f960
[Bugfix]: Fix issues related to prefix caching example (#5177) (#5180) 2024-06-01 15:53:52 -07:00
chenqianfzh b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) 2024-06-01 14:51:10 -06:00
Ronen Schaffer ae495c74ea
[Doc]Replace deprecated flag in readme (#4526) 2024-05-29 22:26:33 +00:00
Cyrus Leung 5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
Antoni Baum c5711ef985
[Doc] Update Ray Data distributed offline inference example (#4871) 2024-05-17 10:52:11 -07:00
Alex Wu dbc0754ddf
[docs] Fix typo in examples filename openi -> openai (#4864) 2024-05-17 00:42:17 +09:00
Aurick Qiao 30e754390c
[Core] Implement sharded state loader (#4690)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-05-15 22:11:54 -07:00
Alex Wu 52f8107cf2
[Frontend] Support OpenAI batch file format (#4794)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-15 19:13:36 -04:00
Sanger Steel 8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208) 2024-05-13 14:57:07 -07:00
Chang Su e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) 2024-05-11 11:30:37 -07:00
Hao Zhang ebce310b74
[Model] Snowflake arctic model implementation (#4652)
Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com>
Co-authored-by: Aurick Qiao <qiao@aurick.net>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-09 22:37:14 +00:00
Robert Shaw cea64430f6
[Bugfix] Update grafana.json (#4711) 2024-05-09 10:10:13 -07:00
Danny Guinther b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273) 2024-05-01 17:34:40 -07:00
Ronen Schaffer bf480c5302
Add more Prometheus metrics (#2764)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-04-28 15:59:33 -07:00
James Fleming 2b7949c1c2
AQLM CUDA support (#3287)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Harry Mellor 3d925165f2
Add example scripts to documentation (#4225)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-22 16:36:54 +00:00
Antoni Baum 69e1d2fb69
[Core] Refactor model loading code (#4097) 2024-04-16 11:34:39 -07:00
Sanger Steel d619ae2d19
[Doc] Add better clarity for tensorizer usage (#4090)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-15 13:28:25 -07:00
Sanger Steel 711a000255
[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476) 2024-04-13 17:13:01 -07:00
Cade Daniel e0dd4d3589
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py (#3864) 2024-04-04 21:57:33 -07:00
Adrian Abeyta 2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
Woosuk Kwon c0935c96d3
[Bugfix] Set enable_prefix_caching=True in prefix caching example (#3703) 2024-03-28 16:26:30 -07:00
Simon Mo a4075cba4d
[CI] Add test case to run examples scripts (#3638) 2024-03-28 14:36:10 -07:00
xwjiang2010 64172a976c
[Feature] Add vision language model support. (#3042) 2024-03-25 14:16:30 -07:00
SangBin Cho 01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
Zhuohan Li e90fc21f2e
[Hardware][Neuron] Refactor neuron support (#3471) 2024-03-22 01:22:17 +00:00
Simon Mo 8e67598aa6
[Misc] fix line length for entire codebase (#3444) 2024-03-16 00:36:29 -07:00
Dinghow Yang cf6ff18246
Fix Baichuan chat template (#3340) 2024-03-15 21:02:12 -07:00
Dinghow Yang 253a98078a
Add chat templates for ChatGLM (#3418) 2024-03-14 23:19:22 -07:00
Dinghow Yang 21539e6856
Add chat templates for Falcon (#3420) 2024-03-14 23:19:02 -07:00
Allen.Dou a37415c31b
allow user to chose which vllm's merics to display in grafana (#3393) 2024-03-14 06:35:13 +00:00
DAIZHENWEI 654865e21d
Support Mistral Model Inference with transformers-neuronx (#3153) 2024-03-11 13:19:51 -07:00
Sage Moore ce4f5a29fb
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Liangfu Chen 3b7178cfa4
[Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00
jvmncs 8f36444c4f
multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Cheng Su 4abf6336ec
Add one example to run batch inference distributed on Ray (#2696) 2024-02-02 15:41:42 -08:00
Robert Shaw 93b38bea5d
Refactor Prometheus and Add Request Level Metrics (#2316) 2024-01-31 14:58:07 -08:00
Simon Mo 1e4277d2d1
lint: format all python file instead of just source code (#2567) 2024-01-23 15:53:06 -08:00
Antoni Baum 9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Jason Zhu 5d80a9178b
Minor fix in prefill cache example (#2494) 2024-01-18 09:40:34 -08:00
shiyi.c_98 d10f8e1d43
[Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
arkohut 97460585d9
Add gradio chatbot for openai webserver (#2307) 2024-01-11 19:45:56 -08:00
KKY 74cd5abdd1
Add baichuan chat template jinjia file (#2390) 2024-01-09 09:13:02 -08:00
Ronen Schaffer 1066cbd152
Remove deprecated parameter: concurrency_count (#2315) 2024-01-03 09:56:21 -08:00
Massimiliano Pronesti c07a442854
chore(examples-docs): upgrade to OpenAI V1 (#1785) 2023-12-03 01:11:22 -08:00
Adam Brusselback 66785cc05c
Support chat template and `echo` for chat API (#1756) 2023-11-30 16:43:13 -08:00
iongpt ac8d36f3e5
Refactor LLMEngine demo script for clarity and modularity (#1413)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-30 09:14:37 -07:00
Zhuohan Li 9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Yunfeng Bai 09ff7f106a
API server support ipv4 / ipv6 dualstack (#1288)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-07 15:15:54 -07:00
Woosuk Kwon 55fe8a81ec
Refactor scheduler (#658) 2023-08-02 16:42:01 -07:00
Zhuohan Li 1b0bd0fe8a
Add Falcon support (new) (#592) 2023-08-02 14:04:39 -07:00
Zhuohan Li 82ad323dee
[Fix] Add chat completion Example and simplify dependencies (#576) 2023-07-25 23:45:48 -07:00
Zhuohan Li d6fa1be3a8
[Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00
Woosuk Kwon 14f0b39cda
[Bugfix] Fix a bug in RequestOutput.finished (#202) 2023-06-22 00:17:24 -07:00
Woosuk Kwon 0b98ba15c7
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
Zhuohan Li e5464ee484
Rename servers to engines (#152) 2023-06-17 17:25:21 +08:00
Zhuohan Li eedb46bf03
Rename servers and change port numbers to reduce confusion (#149) 2023-06-17 00:13:02 +08:00
Woosuk Kwon 311490a720
Add script for benchmarking serving throughput (#145) 2023-06-14 19:55:38 -07:00
Zhuohan Li 5020e1e80c
Non-streaming simple fastapi server (#144) 2023-06-10 10:43:07 -07:00
Zhuohan Li 4298374265
Add docstrings for LLMServer and related classes and examples (#142) 2023-06-07 18:25:20 +08:00
Woosuk Kwon 211318d44a
Add throughput benchmarking script (#133) 2023-05-28 03:20:05 -07:00
Zhuohan Li 057daef778
OpenAI Compatible Frontend (#116) 2023-05-23 21:39:50 -07:00
Woosuk Kwon 655a5e48df
Introduce LLM class for offline inference (#115) 2023-05-21 17:04:18 -07:00
Woosuk Kwon f746ced08d
Implement stop strings and best_of (#114) 2023-05-21 11:18:00 -07:00
Woosuk Kwon c3442c1f6f
Refactor system architecture (#109) 2023-05-20 13:06:59 -07:00