Commit Graph

325 Commits

Author SHA1 Message Date
Roger Wang 29c748930e
[CI] Fix flaky entrypoint tests (#11403)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-21 21:08:44 -08:00
Yanyi Liu 5aef49806d
[Feature] Add load generation config from model (#11164)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
Signed-off-by: Yanyi Liu <wolfsonliu@163.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-12-19 10:50:38 +00:00
Michael Goin a30482f054
[CI] Expand test_guided_generate to test all backends (#11313)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-19 04:00:38 +00:00
Michael Goin c77eb8a33c
[Bugfix] Set temperature=0.7 in test_guided_choice_chat (#11264) 2024-12-17 16:34:06 -08:00
Joe Runde 2d1b9baa8f
[Bugfix] Fix request cancellation without polling (#11190) 2024-12-17 12:26:32 -08:00
kYLe 66d4b16724
[Frontend] Add OpenAI API support for input_audio (#11027)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-16 22:09:58 -08:00
Michael Goin 0064f697d3
[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse (#10935)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-17 11:39:58 +08:00
youkaichao 551603feff
[core] overhaul memory profiling and fix backward compatibility (#10511)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-16 13:32:25 -08:00
Isotr0py d927dbcd88
[Model] Refactor Ultravox to use merged input processor (#11198)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-16 10:09:53 +00:00
Brad Hilton 9c3dadd1c9
[Frontend] Add `logits_processors` as an extra completion argument (#11150)
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com>
2024-12-14 16:46:42 +00:00
Cyrus Leung 0920ab9131
[Doc] Reorganize online pooling APIs (#11172)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-14 00:22:22 +08:00
Cyrus Leung eeec9e3390
[Frontend] Separate pooling APIs in offline inference (#11129)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-13 10:40:07 +00:00
Jiaxin Shan 85362f028c
[Misc][LoRA] Ensure Lora Adapter requests return adapter name (#11094)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-12 09:25:16 +00:00
Cyrus Leung 8f10d5e393
[Misc] Split up pooling tasks (#10820)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 01:28:00 -08:00
Isotr0py a811dd6608
[Model] merged input processor for Phi-3-Vision models (#10977)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-09 12:55:10 -08:00
Michael Goin 8d370e91cb
[Bugfix] Fallback to outlines for complex json schemas (#10899)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-05 11:14:06 +08:00
Aaron Pham 9323a3153b
[Core][Performance] Add XGrammar support for guided decoding and set it as default (#10785)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-12-03 15:17:00 +08:00
Cyrus Leung d2f058e76c
[Misc] Rename embedding classes to pooling (#10801)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-01 14:36:51 +08:00
tomeras91 395b1c7454
[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server (#10635)
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-11-27 13:21:10 -08:00
youkaichao 308cc5e21e
[ci] fix slow tests (#10698)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-27 09:26:14 -08:00
youkaichao 334d64d1e8
[ci] add vllm_test_utils (#10659)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-26 00:20:04 -08:00
Chauncey d04b13a380
[Bug]: Authorization ignored when root_path is set (#10606)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-25 16:21:41 +00:00
Maximilien de Bayser 214efc2c3c
Support Cross encoder models (#10400)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
2024-11-24 18:56:20 -08:00
Varun Vinayak Shenoy 7d8ffb344f
[Bugfix] Internal Server Error when tool_choice is incorrect. (#10567)
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com>
2024-11-22 21:13:29 -08:00
Travis Johnson 9195dbdbca
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use (#10164)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-11-23 10:17:38 +08:00
Chauncey da7e702c6f
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-21 16:24:32 +00:00
Guillaume Calmettes c68f7ede6a
[Bugfix]: allow extra fields in requests to openai compatible server (#10463)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-20 16:42:21 -05:00
Cyrus Leung 32e46e000f
[Frontend] Automatic detection of chat content format from AST (#9919)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-16 13:35:40 +08:00
Mike Depinet f67ce05d0b
[Frontend] Pythonic tool parser (#9859)
Signed-off-by: Mike Depinet <mike@fixie.ai>
2024-11-14 04:14:34 +00:00
Robert Shaw 6ace6fba2c
[V1] `AsyncLLM` Implementation (#9826)
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-11 23:05:38 +00:00
litianjian 28b2877d30
Online video support for VLMs (#10020)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-07 20:25:59 +00:00
Joe Runde d58268c56a
[V1] Make v1 more testable (#9888)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-06 11:57:35 -08:00
tomeras91 ac04a97a9f
[Frontend] Add max_tokens prometheus metric (#9881)
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-11-04 22:53:24 +00:00
Robert Shaw 1c45f4c385
[CI] Basic Integration Test For TPU (#9968)
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com>
2024-11-04 11:34:26 -08:00
Cyrus Leung ba0d892074
[Frontend] Use a proper chat template for VLM2Vec (#9912) 2024-11-01 14:09:07 +00:00
Cyrus Leung 06386a64dd
[Frontend] Chat-based Embeddings API (#9759) 2024-11-01 08:13:35 +00:00
Joe Runde 031a7995f3
[Bugfix][Frontend] Reject guided decoding in multistep mode (#9892)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-01 01:09:46 +00:00
Guillaume Calmettes abbfb6134d
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint (#9837) 2024-10-30 18:15:56 -07:00
Joe Runde 67bdf8e523
[Bugfix][Frontend] Guard against bad token ids (#9634)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-29 14:13:20 -07:00
Zhong Qishuai ef7865b4f9
[Frontend] re-enable multi-modality input in the new beam search implementation (#9427)
Signed-off-by: Qishuai Ferdinandzhong@gmail.com
2024-10-29 11:49:47 +00:00
Vinay R Damodaran 33bab41060
[Bugfix]: Make chat content text allow type content (#9358)
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
2024-10-24 05:05:49 +00:00
Alex Brooks 150b779081
[Frontend] Enable Online Multi-image Support for MLlama (#9393)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-10-23 17:28:57 +00:00
Wallas Henrique c0292211ce
[CI/Build] Replaced some models on tests for smaller ones (#9570)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-22 04:52:14 +00:00
youkaichao 76a5e13270
[core] move parallel sampling out from vllm core (#9302) 2024-10-22 00:31:44 +00:00
Chen Zhang 5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger (#9530) 2024-10-20 00:05:02 +00:00
Yue Zhang c5eea3c8ba
[Frontend] Support simpler image input format (#9478) 2024-10-18 23:17:07 -07:00
sasha0552 337ed76671
[Bugfix] Fix offline mode when using `mistral_common` (#9457) 2024-10-18 18:12:32 -07:00
Cody Yu d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Cyrus Leung 051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Joe Runde de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage (#9352)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-17 22:47:27 -04:00
Chang Su ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-15 15:40:43 -07:00
Nick Hill e9d517f276
[BugFix] Fix chat API continuous usage stats (#9357) 2024-10-14 23:19:48 -07:00
youkaichao cbc2ef5529
[misc] hide best_of from engine (#9261)
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-10-10 21:30:44 -07:00
Daniele 9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (#8537) 2024-10-08 09:38:40 -07:00
Alex Brooks 069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser (#9151)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-08 14:31:26 +00:00
Brendan Wong 8c746226c9
[Frontend] API support for beam search for MQLLMEngine (#9117) 2024-10-08 05:51:43 +00:00
Brendan Wong 168cab6bbf
[Frontend] API support for beam search (#9087)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-10-05 23:39:03 -07:00
Flávia Béo 0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation (#8999)
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
2024-10-04 18:31:40 +00:00
Roger Wang 26aa325f4f
[Core][VLM] Test registration for OOT multimodal models (#8717)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-04 10:38:25 -07:00
Joe Runde 062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params (#8252)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-01 09:34:25 +08:00
danieljannai21 6c9ba48fde
[Frontend] Added support for HF's new `continue_final_message` parameter (#8942) 2024-09-29 17:59:47 +00:00
Cyrus Leung 3b00b9c26c
[Core] rename`PromptInputs` and `inputs` (#8876) 2024-09-26 20:35:15 -07:00
Nick Hill 4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade (#8829) 2024-09-26 16:46:43 -07:00
Simon Mo 4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810) 2024-09-25 10:36:26 -07:00
Cyrus Leung 28e1299e60
rename PromptInputs and inputs with backward compatibility (#8760) 2024-09-25 09:36:47 -07:00
Andy 2529d09b5a
[Frontend] Batch inference for llm.chat() API (#8648)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-09-24 09:44:11 -07:00
Alexander Matveev 1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (#8335) 2024-09-23 15:38:04 -07:00
Jiaxin Shan 260d40b5ea
[Core] Support Lora lineage and base model metadata management (#6315) 2024-09-20 06:20:56 +00:00
Alexander Matveev 7c7714d856
[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (#8157)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-18 13:56:58 +00:00
Joe Runde f2e263b801
[Bugfix] Offline mode fix (#8376)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-12 11:11:57 -07:00
Pooya Davoodi cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (#8347) 2024-09-11 05:30:11 +00:00
Jiaxin Shan db3bf7c991
[Core] Support load and unload LoRA in api server (#6566)
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-09-05 18:10:33 -07:00
Cyrus Leung 855c262a6b
[Frontend] Multimodal support in offline chat (#8098) 2024-09-04 05:22:17 +00:00
Roger Wang 5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items (#8049) 2024-08-31 16:35:53 -07:00
Nick Hill 39178c7fbc
[Tests] Disable retries and use context manager for openai client (#7565) 2024-08-26 21:33:17 -07:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 0b769992ec
[Bugfix]: Use float32 for base64 embedding (#7855)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00
youkaichao 7d9ffa2ae1
[misc][core] lazy import outlines (#7831) 2024-08-24 00:51:38 -07:00
Tyler Rockwood d81abefd2e
[Frontend] add json_schema support from OpenAI protocol (#7654) 2024-08-23 23:07:24 -07:00
Pooya Davoodi 8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API (#7641) 2024-08-23 23:04:22 -07:00
Maximilien de Bayser e25fee57c2
[BugFix] Fix server crash on empty prompt (#7746)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00
Joe Runde b903e1ba7f
[Frontend] error suppression cleanup (#7786)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 21:50:21 +00:00
Joe Runde cde9183b40
[Bug][Frontend] Improve ZMQ client robustness (#7443)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 02:18:11 +00:00
Peter Salas 1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig (#7615) 2024-08-21 22:49:39 +00:00
Robert Shaw 970dfdc01d
[Frontend] Improve Startup Failure UX (#7716) 2024-08-21 19:53:01 +00:00
Robert Shaw f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend (#7394)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
Cyrus Leung baaedfdb2d
[mypy] Enable following imports for entrypoints (#7248)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
2024-08-20 23:28:21 -07:00
Robert Shaw e3b318216d
[ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend (#7279)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-18 20:19:48 +00:00
Roger Wang bbf55c4805
[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530) 2024-08-17 13:30:55 -07:00
nunjunj 3b19e39dc5
Chat method for offline llm (#5049)
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-08-15 19:41:34 -07:00
Grant Pinkert f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453) 2024-08-16 02:38:08 +00:00
youkaichao 16422ea76f
[misc][plugin] add plugin system implementation (#7426) 2024-08-13 16:24:17 -07:00
youkaichao 33e5d7e6b6
[frontend] spawn engine process from api server process (#7484) 2024-08-13 15:40:17 -07:00
Peter Salas 00c3d68e45
[Frontend][Core] Add plumbing to support audio language models (#7446) 2024-08-13 17:39:33 +00:00
Cyrus Leung 7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) 2024-08-13 05:33:41 +00:00
Andrew Wang 97a6be95ba
[Misc] improve logits processors logging message (#7435) 2024-08-13 02:29:34 +00:00
Pooya Davoodi 249b88228d
[Frontend] Support embeddings in the run_batch API (#7132)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-09 09:48:21 -07:00
Cyrus Leung 7eb4a51c5f
[Core] Support serving encoder/decoder models (#7258) 2024-08-09 10:39:41 +08:00
Joe Runde 21b9c49aa3
[Frontend] Kill the server on engine death (#6594)
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-08 09:47:48 -07:00
Maximilien de Bayser fde47d3bc2
[BugFix] Fix frontend multiprocessing hang (#7217)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-08-07 18:09:36 +00:00
Cyrus Leung 66d617e343
[Frontend] Gracefully handle missing chat template and fix CI failure (#7238)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-07 09:12:05 +00:00
youkaichao dfb1a15dcb
[ci][frontend] deduplicate tests (#7101) 2024-08-05 15:59:22 -07:00
Yihuan Bu 654bc5ca49
Support for guided decoding for offline LLM (#6878)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-04 03:12:09 +00:00
Robert Shaw ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` (#6883)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-02 18:27:28 -07:00
youkaichao 806949514a
[ci] set timeout for test_oot_registration.py (#7082) 2024-08-02 10:03:24 -07:00
zifeitong 3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954) 2024-07-31 21:13:34 -07:00
Nick Hill 9f69d8245a
[Frontend] New `allowed_token_ids` decoding request parameter (#6753) 2024-07-29 23:37:27 +00:00
Chang Su 316a41ac1d
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755) 2024-07-24 22:48:07 -07:00
Evan Z. Liu 5689e256ba
[Frontend] Represent tokens with identifiable strings (#6626) 2024-07-25 09:51:00 +08:00
Yehoshua Cohen 58f53034ad
[Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652) 2024-07-23 11:41:55 -07:00
Cyrus Leung 97234be0ec
[Misc] Manage HTTP connections in one place (#6600) 2024-07-22 21:32:02 -07:00
Cyrus Leung 739b61a348
[Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-22 10:13:53 -07:00
Cyrus Leung 6366efc67b
[Bugfix][Frontend] Fix missing `/metrics` endpoint (#6463) 2024-07-19 03:55:13 +00:00
Nick Hill e2fbaee725
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-07-18 15:13:30 +08:00
Cyrus Leung 5bf35a91e4
[Doc][CI/Build] Update docs and tests to use `vllm serve` (#6431) 2024-07-17 07:43:21 +00:00
sasha0552 7a3d2a5b95
[Frontend] Support for chat completions input in the tokenize endpoint (#5923) 2024-07-16 20:18:09 +08:00
Joe d92b3c5cde
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419) 2024-07-15 18:54:15 -07:00
zifeitong b47008b4d2
[BugFix] BatchResponseData body should be optional (#6345)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-07-15 04:06:09 +00:00
youkaichao 41708e5034
[ci] try to add multi-node tests (#6280)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-12 21:51:48 -07:00
Yihuan Bu b039cbbce3
[Misc] add fixture to guided processor tests (#6341) 2024-07-12 09:55:39 -07:00
jvlunteren f1e15da6fe
[Frontend] Continuous usage stats in OpenAI completion API (#5742) 2024-07-05 10:37:09 -07:00
xwjiang2010 d9e98f42e4
[vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-03 22:14:16 +00:00
SangBin Cho d18bab3587
[CI] Fix base url doesn't strip "/" (#6087) 2024-07-02 21:31:25 -07:00
Murali Andoorveedu c5832d2ae9
[Core] Pipeline Parallel Support (#4412)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
xwjiang2010 98d6682cd1
[VLM] Remove `image_input_type` from VLM config (#5852)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
llmpros c6c240aa0a
[Frontend]: Support base64 embedding (#5935)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-06-30 23:53:00 +08:00
Cyrus Leung 9d47f64eb6
[CI/Build] [3/3] Reorganize entrypoints tests (#5966) 2024-06-30 12:58:49 +08:00
Matt Wong 9def10664e
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949) 2024-06-29 12:47:58 -07:00
Cyrus Leung 3b752a6555
[CI/Build] [2/3] Reorganize entrypoints tests (#5904) 2024-06-28 07:59:18 -07:00
Cyrus Leung e9d32d077d
[CI/Build] [1/3] Reorganize entrypoints tests (#5526) 2024-06-27 12:43:17 +00:00
sasha0552 c54269d967
[Frontend] Add tokenize/detokenize endpoints (#5054) 2024-06-26 16:54:22 +00:00
Matt Wong dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) 2024-06-25 15:56:15 -07:00
Cyrus Leung 81fbb3655f
[CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568) 2024-06-15 07:29:42 -04:00
Cyrus Leung 0e9164b40a
[mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
Cyrus Leung 39873476f8
[CI/Build] Simplify OpenAI server setup in tests (#5100) 2024-06-13 11:21:53 -07:00
Cyrus Leung 640052b069
[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026) 2024-06-10 22:36:46 -07:00
maor-ps 351d5e7b82
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-11 10:30:31 +08:00
Itay Etelis 774d1035e4
[Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319) 2024-06-10 14:22:09 +00:00
Roger Wang 7a9cb294ae
[Frontend] Add OpenAI Vision API Support (#5237)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-07 11:23:32 -07:00
Itay Etelis baa15a9ec3
[Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135) 2024-06-07 03:29:24 +00:00
Matthew Goldey 828da0d44e
[Frontend] enable passing multiple LoRA adapters at once to generate() (#5300) 2024-06-06 15:48:13 -05:00
Breno Faria 7b0a0dfb22
[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
2024-06-05 16:49:12 -07:00
Toshiki Kataoka 06b2550cbb
[Bugfix] Support `prompt_logprobs==0` (#5217) 2024-06-03 17:59:30 -07:00
Breno Faria f775a07e30
[FRONTEND] OpenAI `tools` support named functions (#5032) 2024-06-03 18:25:29 -05:00
Breno Faria 87d41c849d
[BUGFIX] [FRONTEND] Correct chat logprobs (#5029)
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
2024-05-30 02:52:14 -07:00
Cyrus Leung 5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
Alex Wu 52f8107cf2
[Frontend] Support OpenAI batch file format (#4794)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-15 19:13:36 -04:00
Cyrus Leung fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API (#4758) 2024-05-15 14:58:46 -07:00
Cyrus Leung 350f9e107f
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425)
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
Chang Su e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) 2024-05-11 11:30:37 -07:00
Cyrus Leung f12b20decc
[Frontend] Move async logic outside of constructor (#4674) 2024-05-08 22:48:33 -07:00
Sebastian Schoennenbeck f8e7adda21
Fix/async chat serving (#2727) 2024-05-03 11:04:14 -07:00
sasha0552 c47ba4aaa9
[Bugfix] Add validation for seed (#4529) 2024-05-01 19:31:22 +00:00
Robert Caulk c3845d82dc
Allow user to define whitespace pattern for outlines (#4305) 2024-04-30 20:48:39 -07:00
Florian Greinacher a494140433
[Frontend] Support complex message content for chat completions endpoint (#3467)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-04-30 16:28:46 -07:00
Cyrus Leung 8947bc3c15
[Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355) 2024-04-27 05:08:24 +00:00
nunjunj 91528575ec
[Frontend] multiple sampling params support (#3570) 2024-04-20 00:11:57 -07:00
Ayush Rautwar 138485a82d
[Bugfix] Add fix for JSON whitespace (#4189)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
2024-04-19 20:49:22 -07:00
James Whedbee e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) 2024-04-18 21:12:55 +00:00
Noam Gat 05434764cd
LM Format Enforcer Guided Decoding Support (#3868)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-16 05:54:57 +00:00
Dylan Hawk 95e7d4a97c
Fix echo/logprob OpenAI completion bug (#3441)
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-04-11 22:15:50 +00:00
SangBin Cho 67b4221a61
[Core][5/N] Fully working chunked prefill e2e (#3884) 2024-04-10 17:56:48 -07:00
youkaichao 95baec828f
[Core] enable out-of-tree model register (#3871) 2024-04-06 17:11:41 -07:00
Roy f510395bbf
[BugFix][Frontend] Fix completion logprobs=0 error (#3731) 2024-03-29 09:38:21 -07:00
Dylan Hawk 0b4997e05c
[Bugfix] API stream returning two stops (#3450)
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-03-25 10:14:34 -07:00
SangBin Cho 01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
Simon Mo 120157fd2a
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211) 2024-03-16 13:35:27 -07:00
Zhuohan Li 2f8844ba08
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Antoni Baum 22de45235c
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
felixzhu555 703e42ee4b
Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Dylan Hawk e0ade06d63
Support logit bias for OpenAI API (#3027) 2024-02-27 11:51:53 +08:00
Jared Moore 70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI (#2918) 2024-02-26 10:39:34 +08:00
jvmncs 8f36444c4f
multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Simon Mo 3a7dd7e367
Support Batch Completion in Server (#2529) 2024-01-24 17:11:07 -08:00
Simon Mo dd7e8f5f64
refactor complemention api for readability (#2499) 2024-01-18 16:45:14 -08:00
FlorianJoncour 14cc317ba4
OpenAI Server refactoring (#2360) 2024-01-16 21:33:14 -08:00