Commit Graph

2143 Commits

Author SHA1 Message Date
Danny Guinther b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273) 2024-05-01 17:34:40 -07:00
sasha0552 c47ba4aaa9
[Bugfix] Add validation for seed (#4529) 2024-05-01 19:31:22 +00:00
Nick Hill a657bfc48a
[Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357) 2024-05-01 18:41:59 +00:00
leiwen83 24750f4cad
[Core] Enable prefix caching with block manager v2 enabled (#4142)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
leiwen83 b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding (#4237)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-01 11:13:03 -07:00
SangBin Cho 6f1df80436
[Test] Add ignore_eos test (#4519) 2024-05-01 08:45:42 -04:00
Jee Li d6f4bd7cdd
[Misc]Add customized information for models (#4132) 2024-04-30 21:18:14 -07:00
Robert Caulk c3845d82dc
Allow user to define whitespace pattern for outlines (#4305) 2024-04-30 20:48:39 -07:00
Florian Greinacher a494140433
[Frontend] Support complex message content for chat completions endpoint (#3467)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-04-30 16:28:46 -07:00
Robert Shaw 111815d482
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-04-30 21:46:12 +00:00
leiwen83 4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor (#4165)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-30 10:12:59 -07:00
youkaichao f4f921b7f1
[Core][Distributed] use cpu group to broadcast metadata in cpu (#4444) 2024-04-29 13:52:22 -07:00
Robert Shaw 73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922)
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00
Prashant Gupta d6e520e170
[Core] Support offline use of local cache for models (#4374)
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>
2024-04-27 09:59:55 -07:00
Nick Hill 81661da7b2
[BugFix] Fix `min_tokens` when `eos_token_id` is None (#4389)
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>
2024-04-27 09:52:46 -07:00
Ruoyu Qin dfea173148
[Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363) 2024-04-27 09:48:37 -07:00
Roy 7134303cbb
[Bugfix][Core] Fix get decoding config from ray (#4335) 2024-04-27 11:30:08 +00:00
Austin Veselka eefeb16464
[Kernel] Full Tensor Parallelism for LoRA Layers (#3524)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-04-27 00:03:48 -07:00
Cyrus Leung 8947bc3c15
[Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355) 2024-04-27 05:08:24 +00:00
Cody Yu a62aaf1df5
[Misc][Refactor] Generalize linear_method to be quant_method (#4373) 2024-04-26 16:41:14 -04:00
SangBin Cho 603ad84815
[Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) 2024-04-26 13:02:02 +00:00
Cyrus Leung a74dee9b62
[Bugfix] Fix parameter name in `get_tokenizer` (#4107) 2024-04-25 19:10:48 -07:00
Woosuk Kwon 468d761b32
[Misc] Reduce supported Punica dtypes (#4304) 2024-04-23 18:54:33 -07:00
youkaichao 91f50a6fe2
[Core][Distributed] use cpu/gloo to initialize pynccl (#4248) 2024-04-23 18:32:19 -07:00
Cyrus Leung 1e8f4252aa
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292) 2024-04-23 18:19:03 +00:00
James Fleming 2b7949c1c2
AQLM CUDA support (#3287)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Cade Daniel 62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) 2024-04-23 08:02:36 +00:00
SangBin Cho 050f285ff6
[Core] Scheduling optimization 2 (#4280) 2024-04-23 08:02:11 +00:00
SangBin Cho ad8d696a99
[Core] Scheduler perf fix (#4270) 2024-04-22 21:11:06 +00:00
GeauxEric a37d815b83
Make initialization of tokenizer and detokenizer optional (#3748)
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-21 22:06:46 +00:00
nunjunj 91528575ec
[Frontend] multiple sampling params support (#3570) 2024-04-20 00:11:57 -07:00
Cody Yu a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118)
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Ayush Rautwar 138485a82d
[Bugfix] Add fix for JSON whitespace (#4189)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
2024-04-19 20:49:22 -07:00
Jee Li d17c8477f1
[Bugfix] Fix LoRA loading check (#4138)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-04-19 00:59:54 -07:00
youkaichao 8a7a3e4436
[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-18 16:15:12 -07:00
James Whedbee e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) 2024-04-18 21:12:55 +00:00
Michał Moskal e8cc7967ff
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128) 2024-04-18 00:51:28 -07:00
Michael Goin 53b018edcb
[Bugfix] Get available quantization methods from quantization registry (#4098) 2024-04-18 00:21:55 -07:00
youkaichao 6dc1fc9cfe
[Core] nccl integrity check and test (#4155)
[Core] Add integrity check during initialization; add test for it (#4155)
2024-04-17 22:28:52 -07:00
Shoichi Uchinami a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134) 2024-04-17 10:02:45 -07:00
youkaichao 8438e0569e
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024)
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)
2024-04-17 08:34:33 +00:00
Cade Daniel e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) 2024-04-16 13:09:21 -07:00
Antoni Baum 69e1d2fb69
[Core] Refactor model loading code (#4097) 2024-04-16 11:34:39 -07:00
Noam Gat 05434764cd
LM Format Enforcer Guided Decoding Support (#3868)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-16 05:54:57 +00:00
SangBin Cho 4e7ee664e2
[Core] Fix engine-use-ray broken (#4105) 2024-04-16 05:24:53 +00:00
Sanger Steel 711a000255
[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476) 2024-04-13 17:13:01 -07:00
Jee Li 989ae2538d
[Kernel] Add punica dimension for Baichuan-13B (#4053) 2024-04-13 07:55:05 -07:00
SangBin Cho 36729bac13
[Test] Test multiple attn backend for chunked prefill. (#4023) 2024-04-12 09:56:57 -07:00
Jee Li 1096717ae9
[Core] Support LoRA on quantized models (#4012) 2024-04-11 21:02:44 -07:00
Nick Hill e46a60aa4c
[BugFix] Fix handling of stop strings and stop token ids (#3672) 2024-04-11 15:34:12 -07:00
Antoni Baum 1e96c3341a
Add extra punica sizes to support bigger vocabs (#4015) 2024-04-11 22:18:57 +00:00
Dylan Hawk 95e7d4a97c
Fix echo/logprob OpenAI completion bug (#3441)
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-04-11 22:15:50 +00:00
Antoni Baum a10d3056da
[Core] Set `linear_weights` directly on the layer (#3977) 2024-04-11 16:35:51 -04:00
Kunshang Ji e9da5a40c6
[Misc] Add indirection layer for custom ops (#3913) 2024-04-10 20:26:07 -07:00
SangBin Cho e42df7227d
[Test] Add xformer and flash attn tests (#3961)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-11 03:09:50 +00:00
SangBin Cho 67b4221a61
[Core][5/N] Fully working chunked prefill e2e (#3884) 2024-04-10 17:56:48 -07:00
youkaichao 63e7176f26
[Core][Refactor] move parallel_utils into vllm/distributed (#3950)
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
2024-04-10 15:33:30 -07:00
Travis Johnson 0258b7a94b
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-04-10 01:39:56 -07:00
胡译文 b3104b2a10
[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899) 2024-04-10 00:09:36 -07:00
Jee Li 11dd6ebb89
[Misc] Avoid loading incorrect LoRA config (#3777) 2024-04-09 19:47:15 -07:00
Cade Daniel e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837) 2024-04-09 11:44:15 -07:00
youkaichao 95baec828f
[Core] enable out-of-tree model register (#3871) 2024-04-06 17:11:41 -07:00
SangBin Cho 18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853) 2024-04-05 10:17:58 -07:00
Cade Daniel e5043a3e75
[Misc] Add pytest marker to opt-out of global test cleanup (#3863) 2024-04-04 21:54:16 -07:00
Matthias Gerstgrasser aabe8f40f2
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-04-03 21:52:18 -07:00
Michael Feil 537ee25f43
[Core] Enable hf_transfer by default if available (#3817) 2024-04-04 04:02:43 +00:00
Adrian Abeyta 2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
SangBin Cho 3dcb3e8b98
[3/N] Refactor scheduler for chunked prefill scheduling (#3550) 2024-04-03 14:13:49 -07:00
Cade Daniel 5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
2024-04-03 00:40:57 +00:00
Cade Daniel eb69d68804
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783) 2024-04-02 00:49:51 +00:00
Qubitium 7d4e1b85e7
[Misc] Add support for new autogptq checkpoint_format (#3689)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-04-01 19:32:01 -04:00
Cade Daniel 93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250) 2024-04-01 22:55:24 +00:00
Nick Hill 49782fcb76
[Misc] Some minor simplifications to detokenization logic (#3670)
Some simplifications made for clarity.

Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
2024-04-01 13:22:06 -07:00
Robert Shaw 563c1d7ec5
[CI/Build] Make Marlin Tests Green (#3753) 2024-03-30 19:18:34 -07:00
mawong-amd b6d103542c
[Kernel] Layernorm performance optimization (#3662) 2024-03-30 14:26:38 -07:00
Roy f510395bbf
[BugFix][Frontend] Fix completion logprobs=0 error (#3731) 2024-03-29 09:38:21 -07:00
Roy 6110c39dc8
[BugFix] Fix tokenizer out of vocab size (#3685) 2024-03-29 08:18:59 -07:00
youkaichao 756b30a5f3
[Core][Test] move local_rank to the last arg with default value(#3711)
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)
2024-03-28 21:19:45 -07:00
SangBin Cho 26422e477b
[Test] Make model tests run again and remove --forked from pytest (#3631)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-28 21:06:40 -07:00
Roy 515386ef3c
[Core] Support multi-node inference(eager and cuda graph) (#3686) 2024-03-28 15:01:55 -07:00
SangBin Cho b51c1cc9d2
[2/N] Chunked prefill data update (#3538) 2024-03-28 10:06:01 -07:00
Cade Daniel 14ccd94c89
[Core][Bugfix]Refactor block manager for better testability (#3492) 2024-03-27 23:59:28 -07:00
Roger Wang 45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277) 2024-03-27 13:39:26 -07:00
youkaichao 8f44facddd
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00
Jee Li 566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels (#3636) 2024-03-27 00:37:42 +00:00
Jee Li 8af890a865
Enable more models to inference based on LoRA (#3382)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-25 18:09:31 -07:00
Nick Hill dfeb2ecc3a
[Misc] Include matched stop string/token in responses (#2976)
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
2024-03-25 17:31:32 -07:00
xwjiang2010 64172a976c
[Feature] Add vision language model support. (#3042) 2024-03-25 14:16:30 -07:00
Simon Mo f408d05c52
hotfix isort on logprobs ranks pr (#3622) 2024-03-25 11:55:46 -07:00
Dylan Hawk 0b4997e05c
[Bugfix] API stream returning two stops (#3450)
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-03-25 10:14:34 -07:00
Travis Johnson c13ad1b7bd
feat: implement the min_tokens sampling parameter (#3124)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-03-25 10:14:26 -07:00
Swapnil Parekh 819924e749
[Core] Adding token ranks along with logprobs (#3516)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
2024-03-25 10:13:10 -07:00
SangBin Cho 01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
Woosuk Kwon 925f3332ca
[Core] Refactor Attention Take 2 (#3462) 2024-03-25 04:39:33 +00:00
youkaichao 837e185142
[CI/Build] fix flaky test (#3602) 2024-03-24 17:43:05 -07:00
youkaichao 8b268a46a7
[CI] typo fix: is_hip --> is_hip() (#3595) 2024-03-24 16:03:06 -07:00
Nick Hill 41deac4a3d
[BugFix] 1D query fix for MoE models (#3597) 2024-03-24 16:00:16 -07:00
Antoni Baum bfdb1ba5c3
[Core] Improve detokenization performance for prefill (#3469)
Co-authored-by: MeloYang <meloyang05@gmail.com>
2024-03-22 13:44:12 -07:00
Thomas Parnell cf2f084d56
Dynamic scheduler delay to improve ITL performance (#3279)
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
2024-03-22 12:28:14 -07:00
Zhuohan Li e90fc21f2e
[Hardware][Neuron] Refactor neuron support (#3471) 2024-03-22 01:22:17 +00:00
Roy ea5f14e6ff
[Bugfix][Model] Fix Qwen2 (#3554) 2024-03-22 00:18:58 +00:00
Roy f1c0fc3919
Migrate `logits` computation and gather to `model_runner` (#3233) 2024-03-20 23:25:01 +00:00
SangBin Cho 6e435de766
[1/n][Chunked Prefill] Refactor input query shapes (#3236) 2024-03-20 14:46:05 -07:00
Antoni Baum 426ec4ec67
[1/n] Triton sampling kernel (#3186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-20 14:45:08 -07:00
Woosuk Kwon 5ee14494e4
[Misc] Remove cache stream and cache events (#3461) 2024-03-20 00:38:53 -07:00
ElizaWszola 9474e89ba4
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled (#3357)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-20 00:11:11 -07:00
Robert Shaw 097aa0ea22
[CI/Build] Fix Bad Import In Test (#3473) 2024-03-18 20:28:00 +00:00
Simon Mo 120157fd2a
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211) 2024-03-16 13:35:27 -07:00
simon-mo ad50bf4b25 fix lint 2024-03-15 22:23:38 -07:00
Tao He 3123f15138
Fixes the incorrect argument in the prefix-prefill test cases (#3246) 2024-03-15 20:58:10 -07:00
Antoni Baum fb96c1e98c
Asynchronous tokenization (#2879) 2024-03-15 23:37:01 +00:00
陈序 54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
Terry 7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
Or Sharir ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350) 2024-03-13 12:18:25 -07:00
Woosuk Kwon 602358f8a8
Add kernel for GeGLU with approximate GELU (#3337) 2024-03-12 22:06:17 -07:00
Breno Faria 49a3c8662b
Fixes #1556 double free (#3347) 2024-03-13 00:30:08 +00:00
Zhuohan Li 4c922709b6
Add distributed model executor abstraction (#3191) 2024-03-11 11:03:45 -07:00
Zhuohan Li 2f8844ba08
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Roy 9e8744a545
[BugFix] Fix get tokenizer when using ray (#3301) 2024-03-10 19:17:16 -07:00
Terry 0bba88df03
Enhance lora tests with more layer and rank variations (#3243) 2024-03-09 17:14:16 -08:00
Cade Daniel 8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) 2024-03-08 23:32:46 -08:00
ElizaWszola b35cc93420
Fix auto prefix bug (#3239) 2024-03-07 16:37:28 -08:00
jacobthebanana 8cbba4622c
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263) 2024-03-07 23:03:22 +00:00
Woosuk Kwon 2daf23ab0c
Separate attention backends (#3005) 2024-03-07 01:45:50 -08:00
Cade Daniel a33ce60c66
[Testing] Fix core tests (#3224) 2024-03-06 01:04:23 -08:00
SangBin Cho 24aecf421a
[Tests] Add block manager and scheduler tests (#3108) 2024-03-05 18:23:34 -08:00
Nick Hill 8999ec3c16
Store `eos_token_id` in `Sequence` for easy access (#3166) 2024-03-05 15:35:43 -08:00
Antoni Baum ff578cae54
Add health check, make async Engine more robust (#3015)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-04 22:01:40 +00:00
Antoni Baum 22de45235c
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Sage Moore ce4f5a29fb
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Robert Shaw c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
felixzhu555 703e42ee4b
Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Seonghyeon bfdcfa6a05
Support starcoder2 architecture (#3089) 2024-02-29 00:51:48 -08:00
Woosuk Kwon 929b4f2973
Add LoRA support for Gemma (#3050) 2024-02-28 13:03:28 -08:00
Liangfu Chen 3b7178cfa4
[Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00
Tao He 71bcaf99e2
Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Dylan Hawk e0ade06d63
Support logit bias for OpenAI API (#3027) 2024-02-27 11:51:53 +08:00
Jared Moore 70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI (#2918) 2024-02-26 10:39:34 +08:00
Harry Mellor ef978fe411
Port metrics from `aioprometheus` to `prometheus_client` (#2730) 2024-02-25 11:54:00 -08:00
Ronen Schaffer 4caf7044e0
Include tokens from prompt phase in `counter_generation_tokens` (#2802) 2024-02-22 14:00:12 -08:00
Woosuk Kwon fd5dcc5c81
Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
Massimiliano Pronesti 93dc5a2870
chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
Nick Hill 7d2dcce175
Support per-request seed (#2514) 2024-02-21 11:47:00 -08:00
Antoni Baum 017d9f1515
Add metrics to RequestOutput (#2876) 2024-02-20 21:55:57 -08:00
Zhuohan Li 63e2a6419d
[FIX] Fix beam search test (#2930) 2024-02-20 14:37:39 -08:00
Ronen Schaffer e433c115bc
Fix `vllm:prompt_tokens_total` metric calculation (#2869) 2024-02-18 23:55:41 -08:00
Isotr0py ab3a5a8259
Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
Zhuohan Li a61f0521b8
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
jvmncs 8f36444c4f
multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Woosuk Kwon d7afab6d3a
[BugFix] Fix GC bug for `LLM` class (#2882) 2024-02-14 22:17:44 -08:00
Terry 2a543d6efe
Add LoRA support for Mixtral (#2831)
* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell
2024-02-14 00:55:45 +01:00
Lily Liu fe6d09ae61
[Minor] More fix of test_cache.py CI test failure (#2750) 2024-02-06 11:38:38 -08:00
Woosuk Kwon f0d4e14557
Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
Hongxia Yang 56f738ae9b
[ROCm] Fix some kernels failed unit tests (#2498) 2024-02-05 14:25:36 -08:00
Kunshang Ji 96b6f475dd
Remove hardcoded `device="cuda" ` to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Philipp Moritz d0d93b92b1
Add unit test for Mixtral MoE layer (#2677) 2024-01-31 14:34:17 -08:00
Philipp Moritz 89efcf1ce5
[Minor] Fix test_cache.py CI test failure (#2684) 2024-01-31 10:12:11 -08:00
Vladimir 4f65af0e25
Add swap_blocks unit tests (#2616) 2024-01-30 09:30:50 -08:00
wangding zeng 5d60def02c
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star 9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou 380170038e
Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Simon Mo 3a7dd7e367
Support Batch Completion in Server (#2529) 2024-01-24 17:11:07 -08:00
Nikola Borisov 3209b49033
[Bugfix] fix crash if max_tokens=None (#2570) 2024-01-23 22:38:55 -08:00
Antoni Baum 9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Jason Zhu 7a0b011dd5
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553) 2024-01-22 14:47:25 -08:00
Cade Daniel 18bfcdd05c
[Speculative decoding 2/9] Multi-step worker for draft model (#2424) 2024-01-21 16:31:47 -08:00
Zhuohan Li ef9b636e2d
Simplify broadcast logic for control messages (#2501) 2024-01-19 11:23:30 -08:00
Simon Mo dd7e8f5f64
refactor complemention api for readability (#2499) 2024-01-18 16:45:14 -08:00
shiyi.c_98 d10f8e1d43
[Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
FlorianJoncour 14cc317ba4
OpenAI Server refactoring (#2360) 2024-01-16 21:33:14 -08:00
Hyunsung Lee e1957c6ebd
Add StableLM3B model (#2372) 2024-01-16 20:32:40 -08:00
Simon Mo 6e01e8c1c8
[CI] Add Buildkite (#2355) 2024-01-14 12:37:58 -08:00
陈序 218dc2ccda
Aligning `top_p` and `top_k` Sampling (#1885)
* Align top_p and top_k with huggingface

* remove _get_prompt_and_output_tokens

* rename _apply_top_p_top_k

* compare top_p top_k with hf

* fix test errors
2024-01-12 22:51:03 +01:00
Cade Daniel 79d64c4954
[Speculative decoding 1/9] Optimized rejection sampler (#2336) 2024-01-09 15:38:41 -08:00
Woosuk Kwon 941767127c
Revert the changes in test_cache (#2335) 2024-01-03 17:32:05 -08:00
Zhuohan Li fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-03 11:30:22 -08:00
Jee Li 77af974b40
[FIX] Support non-zero CUDA devices in custom kernels (#1959) 2024-01-02 19:09:59 -08:00
Zhuohan Li 358c328d69
[BUGFIX] Fix communication test (#2285) 2023-12-27 17:18:11 -05:00
Zhuohan Li 4aaafdd289
[BUGFIX] Fix the path of test prompts (#2273) 2023-12-26 10:37:21 -08:00
Zhuohan Li 66b108d142
[BUGFIX] Fix API server test (#2270) 2023-12-26 10:37:06 -08:00
avideci de60a3fb93
Added DeciLM-7b and DeciLM-7b-instruct (#2062) 2023-12-19 02:29:33 -08:00
Woosuk Kwon f8c688d746
[Minor] Add Phi 2 to supported models (#2159) 2023-12-17 02:54:57 -08:00
Woosuk Kwon f1c8520146
[BugFix] Fix input positions for long context with sliding window (#2088) 2023-12-13 12:28:13 -08:00
wbn dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
Woosuk Kwon cd3aa153a4
Fix broken worker test (#1900) 2023-12-02 22:17:33 -08:00
Woosuk Kwon 9b294976a2
Add PyTorch-native implementation of custom layers (#1898) 2023-12-02 21:18:40 -08:00
Woosuk Kwon 5f09cbdb63
Fix broken sampler tests (#1896)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-02 16:06:17 -08:00
Adam Brusselback 66785cc05c
Support chat template and `echo` for chat API (#1756) 2023-11-30 16:43:13 -08:00
Yanming W e0c6f556e8
[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
Simon Mo 5ffc0d13a2
Migrate linter from `pylint` to `ruff` (#1665) 2023-11-20 11:58:01 -08:00
Zhuohan Li 20d0699d49
[Fix] Fix comm test (#1691) 2023-11-16 16:28:39 -08:00
maximzubkov 521b35f799
Support Microsoft Phi 1.5 (#1664) 2023-11-16 14:28:39 -08:00
Simon Mo cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step (#1666) 2023-11-16 13:11:41 -08:00
Yanming W 8efe23f150
Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546) 2023-11-08 14:19:12 -08:00
Noam Gat 555bdcc5a3
Added logits processor API to sampling params (#1469) 2023-11-03 14:12:15 -07:00
Cade Daniel e575df33b1
[Small] Formatter only checks lints in changed files (#1528) 2023-10-31 15:39:38 -07:00
Woosuk Kwon 0ce8647dc5
Fix integer overflows in attention & cache ops (#1514) 2023-10-31 15:19:30 -07:00
Woosuk Kwon 9524867701
Add Mistral 7B to `test_models` (#1366) 2023-10-16 17:49:54 -07:00
Woosuk Kwon d3a5bd9fb7
Fix sampler test (#1379) 2023-10-16 12:57:26 -07:00
Zhuohan Li 9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Woosuk Kwon 928de46888
Implement PagedAttention V2 (#1348) 2023-10-16 00:59:57 -07:00
Zhuohan Li ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) 2023-10-02 15:36:09 -07:00
Woosuk Kwon 6f88f762bf
Fix OOM in attention kernel test (#1223) 2023-09-28 14:33:24 -07:00
Antoni Baum cf5cb1e33e
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00
Zhuohan Li 947b794146
[Sampler] Vectorized sampling (simplified) (#1048)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-22 17:48:04 -07:00
Antoni Baum ff36139ffc
Remove AsyncLLMEngine busy loop, shield background task (#1059) 2023-09-17 00:29:08 -07:00
Antoni Baum dd54a4b026
Fix detokenization leaving special tokens (#1044)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-14 16:37:03 -07:00
Antoni Baum 9841d48a10
Use TGI-like incremental detokenization (#984) 2023-09-13 13:38:01 -07:00
Woosuk Kwon e67b4f2c2a
Use FP32 in RoPE initialization (#1004)
Co-authored-by: One <imone@tuta.io>
2023-09-11 00:26:35 -07:00
Antoni Baum 080438477f
Start background task in `AsyncLLMEngine.generate` (#988)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-08 00:03:39 -07:00
Zhuohan Li db09d4ad83
[FIX] Fix Alibi implementation in PagedAttention kernel (#945)
* [FIX] Fix Alibi implementation in PagedAttention kernel

* Fix test_attention

* Fix

---------

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
2023-09-07 15:53:14 -07:00
Antoni Baum c07ece5ca4
Make `AsyncLLMEngine` more robust & fix batched abort (#969)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2023-09-07 13:43:45 -07:00
Woosuk Kwon 320a622ec4
[BugFix] Implement RoPE for GPT-J (#941) 2023-09-06 11:54:33 +09:00
Antoni Baum c9927c1a6a
Use queue for finished requests (#957) 2023-09-05 19:27:23 -07:00
Woosuk Kwon fbd80ad409
Clean up kernel unit tests (#938) 2023-09-05 16:57:38 -07:00
Zhuohan Li 002800f081
Align vLLM's beam search implementation with HF generate (#857) 2023-09-04 17:29:42 -07:00
Woosuk Kwon 32b6816e55
Add tests for models (#922) 2023-09-01 11:19:43 +09:00
Aman Gupta Karmani 75471386de
use flash-attn via xformers (#877) 2023-08-29 21:52:13 -07:00
Woosuk Kwon d64bf1646c
Implement approximate GELU kernels (#828) 2023-08-23 07:43:21 +09:00
Tao Peng d7a1c6d614
Fix paged attention testing. (#495)
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
2023-07-24 21:01:56 -07:00
Song bda41c70dd
hotfix attn alibi wo head mapping (#496)
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
2023-07-18 11:31:48 -07:00
Andre Slavescu c894836108
[Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon e41f06702c
Add support for BLOOM (#331) 2023-07-03 13:12:35 -07:00
Zhuohan Li d6fa1be3a8
[Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00
Woosuk Kwon 0b98ba15c7
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
Woosuk Kwon e38074b1e6
Support FP32 (#141) 2023-06-07 00:40:21 -07:00
Woosuk Kwon a283ec2eec
Add contributing guideline and mypy config (#122) 2023-05-23 17:58:51 -07:00
Woosuk Kwon 825d8892b5
Use pytest format for unit tests (#107) 2023-05-17 17:11:23 -07:00
Woosuk Kwon c9d5b6d4a8
Replace FlashAttention with xformers (#70) 2023-05-05 02:01:08 -07:00
Woosuk Kwon 436e523bf1
Refactor attention kernels (#53) 2023-05-03 13:40:13 -07:00
Woosuk Kwon a96d63c21d
Add support for GPT-NeoX (Pythia) (#50) 2023-04-28 00:32:10 -07:00
Siyuan (Ryans) Zhuang e3cec88aa5
Memcpy kernel for flash attention (#29)
* optimize

* add benchmark

* add assert

* add test
2023-04-10 18:22:49 -07:00
Woosuk Kwon b9926f7f66
Support block size 32 (#35) 2023-04-09 23:07:18 -07:00
Woosuk Kwon c267b1a02c
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27)
* Add query stride to multi_query_cached_kv_attention

* Add kernel benchmark script
2023-04-08 13:36:09 -07:00
Woosuk Kwon 0f40557af6
Implement block copy kernel to optimize beam search (#32) 2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang 21b3671bbc
Basic attention kernel that supports cached KV + (multi-)prompts (#24) 2023-04-04 20:34:46 -07:00
Woosuk Kwon 897cb2ae28
Optimize data movement (#20) 2023-04-02 00:30:17 -07:00
Woosuk Kwon 09e9245478
Add custom kernel for RMS normalization (#16) 2023-04-01 00:51:22 +08:00
Woosuk Kwon 88c0268a18
Implement custom kernel for LLaMA rotary embedding (#14) 2023-03-30 11:04:21 -07:00
Woosuk Kwon a1b3de86cd
Refactor the test code for attention kernels (#13) 2023-03-29 18:59:27 -07:00
Woosuk Kwon 3e9f991d6a
Use FlashAttention for `multi_query_kv_attention` (#4) 2023-03-01 21:13:08 -08:00
Woosuk Kwon 0deacbce6e
Implement `single_query_cached_kv_attention` kernel (#3) 2023-03-01 15:02:19 -08:00
Woosuk Kwon af68ec1c5c Add tests for kernels 2023-02-18 19:23:07 +00:00