afeldman-nm
f42a006b15
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend ( #5210 )
2024-06-03 20:32:57 -07:00
Toshiki Kataoka
06b2550cbb
[Bugfix] Support `prompt_logprobs==0` ( #5217 )
2024-06-03 17:59:30 -07:00
Breno Faria
f775a07e30
[FRONTEND] OpenAI `tools` support named functions ( #5032 )
2024-06-03 18:25:29 -05:00
Kaiyang Chen
10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 ( #3834 )
2024-06-03 13:37:11 -07:00
Yuan
cafb8e06c5
[CI/BUILD] enable intel queue for longer CPU tests ( #4113 )
2024-06-03 10:39:50 -07:00
Tyler Michael Smith
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
Cyrus Leung
7a64d24aad
[Core] Support image processor ( #4197 )
2024-06-02 22:56:41 -07:00
Cyrus Leung
dfbe60dc62
[Misc] Simplify code and fix type annotations in `conftest.py` ( #5118 )
2024-06-02 16:05:50 -07:00
Simon Mo
ed59a7ed23
Update test_ignore_eos ( #4898 )
2024-06-02 02:21:53 +00:00
chenqianfzh
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA ( #4776 )
2024-06-01 14:51:10 -06:00
Varun Sundar Rabindranath
f081c3ce4b
[Kernel] Update Cutlass fp8 configs ( #5144 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-01 08:46:07 +00:00
Tyler Michael Smith
260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU ( #5137 )
2024-06-01 06:45:32 +00:00
SnowDist
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-30 19:24:41 -07:00
Breno Faria
87d41c849d
[BUGFIX] [FRONTEND] Correct chat logprobs ( #5029 )
...
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
2024-05-30 02:52:14 -07:00
Cyrus Leung
b1c255630d
[Core] Avoid the need to pass `None` values to `Sequence.inputs` ( #5099 )
2024-05-29 16:05:01 -07:00
Cyrus Leung
eecd864388
[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` ( #5096 )
2024-05-29 16:02:25 -07:00
afeldman-nm
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) ( #4837 )
2024-05-29 16:09:13 +00:00
Cyrus Leung
18c1f16d86
[Bugfix] Fix arguments passed to `Sequence` in stop checker test ( #5092 )
2024-05-29 07:16:41 +00:00
youkaichao
5bd3c65072
[Core][Optimization] remove vllm-nccl ( #5091 )
2024-05-29 05:13:52 +00:00
Junichi Sato
dfba529b40
[Bugfix] Remove the last EOS token unless explicitly specified ( #5077 )
2024-05-28 17:15:35 -07:00
Cyrus Leung
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
Michał Moskal
d4f3985907
[Core] Sliding window for block manager v2 ( #4545 )
...
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
2024-05-28 11:07:07 +09:00
Zhuohan Li
1102bef219
[Bugfix / Core] Prefix Caching Guards (merged with main) ( #4846 )
...
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-27 15:18:17 -07:00
Lily Liu
d5a1697772
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding ( #5000 )
2024-05-25 10:00:14 -07:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-24 22:00:52 -07:00
leiwen83
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 ( #4764 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-24 10:07:09 -07:00
Robert Shaw
919770957f
[Bugfix] Fix Mistral v0.3 Weight Loading ( #5005 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-24 12:28:27 +00:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Murali Andoorveedu
5eda2ea02a
[Core][1/N] Support send/recv in PyNCCL Groups ( #4988 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-05-23 09:54:48 -07:00
Alexander Matveev
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00
Cody Yu
ee3eea0a1b
[Misc] Take user preference in attention selector ( #4960 )
2024-05-23 07:55:56 +09:00
raywanb
97b030005c
[Model] LoRA gptbigcode implementation ( #3949 )
2024-05-22 13:58:59 -07:00
Cody Yu
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints ( #4893 )
...
The 2nd PR for #4532 .
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
Tyler Michael Smith
8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs ( #4954 )
...
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
SangBin Cho
c74c913bfb
[misc] remove comments that were supposed to be removed ( #4977 )
2024-05-22 09:02:58 -04:00
sasha0552
9b9a10d6cb
[Frontend] Dynamic RoPE scaling ( #4638 )
2024-05-22 01:32:35 -04:00
Isotr0py
f12c3b5b3d
[Model] Add Phi-2 LoRA support ( #4886 )
2024-05-21 14:24:17 +09:00
Alexei-V-Ivanov-AMD
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test ( #4834 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
2024-05-20 11:29:28 -07:00
Woosuk Kwon
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
Alexander Matveev
27ce85476e
[Kernel] Add marlin_24 unit tests ( #4901 )
2024-05-19 11:37:34 -04:00
Cyrus Leung
f68470e803
[Bugfix][Model] Add base class for vision-language models ( #4809 )
2024-05-19 00:13:33 -07:00
SangBin Cho
2e9a2227ec
[Lora] Support long context lora ( #4787 )
...
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.
It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.
Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
Jinzhen Lin
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
Tyler Michael Smith
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
Silencio
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
2024-05-16 11:16:09 -07:00
youkaichao
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
Alexander Matveev
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-05-16 12:56:15 -04:00
Jinzhen Lin
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
alexm-nm
5c342570d7
Add marlin unit tests and marlin benchmark script ( #4815 )
2024-05-16 09:36:49 -04:00
Cody Yu
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
2024-05-16 00:53:51 -07:00
Aurick Qiao
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-05-15 22:11:54 -07:00
Alex Wu
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-15 19:13:36 -04:00
Cyrus Leung
fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API ( #4758 )
2024-05-15 14:58:46 -07:00
Cyrus Leung
e9cdd2b1e2
[CI/Build] Further decouple HuggingFace implementation from ours during tests ( #4166 )
2024-05-14 23:38:40 -07:00
SangBin Cho
65bf2ac165
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API ( #4681 )
...
This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
It also refactors subquery_start_loc which was not refactored in the previous PR
2024-05-15 14:00:10 +09:00
SangBin Cho
8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding ( #3648 )" ( #4820 )
...
Lora 3 & 4 test seems to have illegal memory access failure after this commit;
[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241
This reverts commit 1356df5
.
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
Nick Hill
676a99982f
[Core] Add MultiprocessingGPUExecutor ( #4539 )
...
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
2024-05-14 10:38:59 -07:00
Stephen Krider
1356df53bd
[Kernel] Use flash-attn for decoding ( #3648 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2024-05-13 15:50:33 -07:00
Cody Yu
ce532ff45c
[Speculative decoding] Improve n-gram efficiency ( #4724 )
2024-05-13 15:00:13 -07:00
Sanger Steel
8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 ( #4208 )
2024-05-13 14:57:07 -07:00
Woosuk Kwon
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
SangBin Cho
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-13 23:50:44 +09:00
Cyrus Leung
350f9e107f
[CI/Build] Move `test_utils.py` to `tests/utils.py` ( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
youkaichao
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
Robert Shaw
a709e87a4f
[CI/Build] Tweak Marlin Nondeterminism Issues ( #4713 )
2024-05-12 17:46:31 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
youkaichao
4e12131089
[Core][Test] fix function name typo in custom allreduce ( #4750 )
2024-05-10 15:14:40 -07:00
Robert Shaw
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
heeju-kim2
2e7796f2cf
[Speculative decoding] CUDA graph support ( #4295 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-10 17:36:25 +00:00
SangBin Cho
6a0f617210
[Core] Fix circular reference which leaked llm instance in local dev env ( #4737 )
...
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
2024-05-10 23:54:32 +09:00
Allen.Dou
e965d46184
[Misc] Keep only one implementation of the create_dummy_prompt function. ( #4716 )
2024-05-09 21:42:38 -07:00
youkaichao
208b71bcc1
[Core][Distributed] refactor pynccl ( #4591 )
...
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591 )
2024-05-09 19:48:43 -07:00
Cody Yu
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
Woosuk Kwon
0ee535b294
[Misc] Set block size at initialization & Fix test_model_runner ( #4705 )
2024-05-09 09:04:59 -07:00
Woosuk Kwon
190bc838e1
[Misc] Remove unnecessary ModelRunner imports ( #4703 )
2024-05-09 00:17:17 -07:00
Cyrus Leung
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
Cody Yu
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size ( #4592 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-08 21:44:00 +00:00
youkaichao
230c4b38c1
[CI/Test] fix swap test for multi gpu ( #4689 )
2024-05-08 13:14:02 -07:00
youkaichao
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
DefTruth
0f9a6e3d22
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi ( #4573 )
2024-05-08 09:19:58 -07:00
SangBin Cho
f6a593093a
[CI] Make mistral tests pass ( #4596 )
2024-05-08 08:44:35 -07:00
youkaichao
cc466a3290
[Core][Distributed] support cpu&device in broadcast tensor dict ( #4660 )
...
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660 )
2024-05-07 19:34:47 -07:00
leiwen83
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly ( #4551 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-07 11:40:18 -07:00
youkaichao
469f85c782
[Core][Optimization] change copy-on-write from dict[int, list] to list ( #4648 )
2024-05-07 11:06:32 -07:00
youkaichao
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
DearPlanet
4302987069
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics ( #3937 )
2024-05-04 15:39:34 -07:00
Michael Goin
2a052011ca
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) ( #4527 )
...
Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436 .
This PR enables the following checkpoint loading features for Mixtral:
Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:
The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
2024-05-04 11:45:16 -07:00
Cody Yu
bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData ( #4540 )
2024-05-03 17:47:07 -07:00
Cade Daniel
ab50275111
[Speculative decoding] Support target-model logprobs ( #4378 )
2024-05-03 15:52:01 -07:00
Lily Liu
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck
f8e7adda21
Fix/async chat serving ( #2727 )
2024-05-03 11:04:14 -07:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
youkaichao
344a5d0c33
[Core][Distributed] enable allreduce for multiple tp groups ( #4566 )
2024-05-02 17:32:33 -07:00
SangBin Cho
0f8a91401c
[Core] Ignore infeasible swap requests. ( #4557 )
2024-05-02 14:31:20 -07:00
Michał Moskal
32881f3f31
[kernel] fix sliding window in prefix prefill Triton kernel ( #4405 )
...
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2024-05-02 11:23:37 -07:00
alexm-nm
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
youkaichao
2a85f93007
[Core][Distributed] enable multiple tp group ( #4512 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-02 04:28:21 +00:00
Ronen Schaffer
5e401bce17
[CI]Add regression tests to ensure the async engine generates metrics ( #4524 )
2024-05-01 19:57:12 -07:00
SangBin Cho
0d62fe58db
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption ( #4451 )
2024-05-01 19:24:13 -07:00
Danny Guinther
b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided ( #4273 )
2024-05-01 17:34:40 -07:00
sasha0552
c47ba4aaa9
[Bugfix] Add validation for seed ( #4529 )
2024-05-01 19:31:22 +00:00
Nick Hill
a657bfc48a
[Core] Add `multiproc_worker_utils` for multiprocessing-based workers ( #4357 )
2024-05-01 18:41:59 +00:00
leiwen83
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled ( #4142 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
leiwen83
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding ( #4237 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-01 11:13:03 -07:00
SangBin Cho
6f1df80436
[Test] Add ignore_eos test ( #4519 )
2024-05-01 08:45:42 -04:00
Jee Li
d6f4bd7cdd
[Misc]Add customized information for models ( #4132 )
2024-04-30 21:18:14 -07:00
Robert Caulk
c3845d82dc
Allow user to define whitespace pattern for outlines ( #4305 )
2024-04-30 20:48:39 -07:00
Florian Greinacher
a494140433
[Frontend] Support complex message content for chat completions endpoint ( #3467 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-04-30 16:28:46 -07:00
Robert Shaw
111815d482
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) ( #4332 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-04-30 21:46:12 +00:00
leiwen83
4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor ( #4165 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-30 10:12:59 -07:00
youkaichao
f4f921b7f1
[Core][Distributed] use cpu group to broadcast metadata in cpu ( #4444 )
2024-04-29 13:52:22 -07:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00
Prashant Gupta
d6e520e170
[Core] Support offline use of local cache for models ( #4374 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>
2024-04-27 09:59:55 -07:00
Nick Hill
81661da7b2
[BugFix] Fix `min_tokens` when `eos_token_id` is None ( #4389 )
...
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>
2024-04-27 09:52:46 -07:00
Ruoyu Qin
dfea173148
[Bugfix] Abort requests when the connection to /v1/completions is interrupted ( #4363 )
2024-04-27 09:48:37 -07:00
Roy
7134303cbb
[Bugfix][Core] Fix get decoding config from ray ( #4335 )
2024-04-27 11:30:08 +00:00
Austin Veselka
eefeb16464
[Kernel] Full Tensor Parallelism for LoRA Layers ( #3524 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-04-27 00:03:48 -07:00
Cyrus Leung
8947bc3c15
[Frontend][Bugfix] Disallow extra fields in OpenAI API ( #4355 )
2024-04-27 05:08:24 +00:00
Cody Yu
a62aaf1df5
[Misc][Refactor] Generalize linear_method to be quant_method ( #4373 )
2024-04-26 16:41:14 -04:00
SangBin Cho
603ad84815
[Core] Refactoring sampler and support prompt logprob for chunked prefill ( #4309 )
2024-04-26 13:02:02 +00:00
Cyrus Leung
a74dee9b62
[Bugfix] Fix parameter name in `get_tokenizer` ( #4107 )
2024-04-25 19:10:48 -07:00
Woosuk Kwon
468d761b32
[Misc] Reduce supported Punica dtypes ( #4304 )
2024-04-23 18:54:33 -07:00
youkaichao
91f50a6fe2
[Core][Distributed] use cpu/gloo to initialize pynccl ( #4248 )
2024-04-23 18:32:19 -07:00
Cyrus Leung
1e8f4252aa
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened ( #4292 )
2024-04-23 18:19:03 +00:00
James Fleming
2b7949c1c2
AQLM CUDA support ( #3287 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Cade Daniel
62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. ( #3951 )
2024-04-23 08:02:36 +00:00
SangBin Cho
050f285ff6
[Core] Scheduling optimization 2 ( #4280 )
2024-04-23 08:02:11 +00:00
SangBin Cho
ad8d696a99
[Core] Scheduler perf fix ( #4270 )
2024-04-22 21:11:06 +00:00
GeauxEric
a37d815b83
Make initialization of tokenizer and detokenizer optional ( #3748 )
...
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-21 22:06:46 +00:00
nunjunj
91528575ec
[Frontend] multiple sampling params support ( #3570 )
2024-04-20 00:11:57 -07:00
Cody Yu
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Ayush Rautwar
138485a82d
[Bugfix] Add fix for JSON whitespace ( #4189 )
...
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
2024-04-19 20:49:22 -07:00
Jee Li
d17c8477f1
[Bugfix] Fix LoRA loading check ( #4138 )
...
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-04-19 00:59:54 -07:00
youkaichao
8a7a3e4436
[Core] add an option to log every function call to for debugging hang/crash in distributed inference ( #4079 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-18 16:15:12 -07:00
James Whedbee
e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields ( #4149 )
2024-04-18 21:12:55 +00:00
Michał Moskal
e8cc7967ff
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill ( #4128 )
2024-04-18 00:51:28 -07:00
Michael Goin
53b018edcb
[Bugfix] Get available quantization methods from quantization registry ( #4098 )
2024-04-18 00:21:55 -07:00
youkaichao
6dc1fc9cfe
[Core] nccl integrity check and test ( #4155 )
...
[Core] Add integrity check during initialization; add test for it (#4155 )
2024-04-17 22:28:52 -07:00
Shoichi Uchinami
a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA ( #4134 )
2024-04-17 10:02:45 -07:00
youkaichao
8438e0569e
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication ( #4024 )
...
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024 )
2024-04-17 08:34:33 +00:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine ( #3894 )
2024-04-16 13:09:21 -07:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00
Noam Gat
05434764cd
LM Format Enforcer Guided Decoding Support ( #3868 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-16 05:54:57 +00:00
SangBin Cho
4e7ee664e2
[Core] Fix engine-use-ray broken ( #4105 )
2024-04-16 05:24:53 +00:00
Sanger Steel
711a000255
[Frontend] [Core] feat: Add model loading using `tensorizer` ( #3476 )
2024-04-13 17:13:01 -07:00
Jee Li
989ae2538d
[Kernel] Add punica dimension for Baichuan-13B ( #4053 )
2024-04-13 07:55:05 -07:00
SangBin Cho
36729bac13
[Test] Test multiple attn backend for chunked prefill. ( #4023 )
2024-04-12 09:56:57 -07:00
Jee Li
1096717ae9
[Core] Support LoRA on quantized models ( #4012 )
2024-04-11 21:02:44 -07:00
Nick Hill
e46a60aa4c
[BugFix] Fix handling of stop strings and stop token ids ( #3672 )
2024-04-11 15:34:12 -07:00
Antoni Baum
1e96c3341a
Add extra punica sizes to support bigger vocabs ( #4015 )
2024-04-11 22:18:57 +00:00
Dylan Hawk
95e7d4a97c
Fix echo/logprob OpenAI completion bug ( #3441 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-04-11 22:15:50 +00:00
Antoni Baum
a10d3056da
[Core] Set `linear_weights` directly on the layer ( #3977 )
2024-04-11 16:35:51 -04:00
Kunshang Ji
e9da5a40c6
[Misc] Add indirection layer for custom ops ( #3913 )
2024-04-10 20:26:07 -07:00
SangBin Cho
e42df7227d
[Test] Add xformer and flash attn tests ( #3961 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-11 03:09:50 +00:00
SangBin Cho
67b4221a61
[Core][5/N] Fully working chunked prefill e2e ( #3884 )
2024-04-10 17:56:48 -07:00
youkaichao
63e7176f26
[Core][Refactor] move parallel_utils into vllm/distributed ( #3950 )
...
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950 )
2024-04-10 15:33:30 -07:00
Travis Johnson
0258b7a94b
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty ( #3876 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-04-10 01:39:56 -07:00
胡译文
b3104b2a10
[Bugfix] Fix logits processor when prompt_logprobs is not None ( #3899 )
2024-04-10 00:09:36 -07:00
Jee Li
11dd6ebb89
[Misc] Avoid loading incorrect LoRA config ( #3777 )
2024-04-09 19:47:15 -07:00
Cade Daniel
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" ( #3837 )
2024-04-09 11:44:15 -07:00
youkaichao
95baec828f
[Core] enable out-of-tree model register ( #3871 )
2024-04-06 17:11:41 -07:00
SangBin Cho
18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. ( #3853 )
2024-04-05 10:17:58 -07:00
Cade Daniel
e5043a3e75
[Misc] Add pytest marker to opt-out of global test cleanup ( #3863 )
2024-04-04 21:54:16 -07:00
Matthias Gerstgrasser
aabe8f40f2
[Core] [Frontend] Make detokenization optional ( #3749 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-04-03 21:52:18 -07:00
Michael Feil
537ee25f43
[Core] Enable hf_transfer by default if available ( #3817 )
2024-04-04 04:02:43 +00:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
SangBin Cho
3dcb3e8b98
[3/N] Refactor scheduler for chunked prefill scheduling ( #3550 )
2024-04-03 14:13:49 -07:00
Cade Daniel
5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding ( #3706 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
2024-04-03 00:40:57 +00:00
Cade Daniel
eb69d68804
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup ( #3783 )
2024-04-02 00:49:51 +00:00
Qubitium
7d4e1b85e7
[Misc] Add support for new autogptq checkpoint_format ( #3689 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-04-01 19:32:01 -04:00
Cade Daniel
93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding ( #3250 )
2024-04-01 22:55:24 +00:00
Nick Hill
49782fcb76
[Misc] Some minor simplifications to detokenization logic ( #3670 )
...
Some simplifications made for clarity.
Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
2024-04-01 13:22:06 -07:00
Robert Shaw
563c1d7ec5
[CI/Build] Make Marlin Tests Green ( #3753 )
2024-03-30 19:18:34 -07:00
mawong-amd
b6d103542c
[Kernel] Layernorm performance optimization ( #3662 )
2024-03-30 14:26:38 -07:00
Roy
f510395bbf
[BugFix][Frontend] Fix completion logprobs=0 error ( #3731 )
2024-03-29 09:38:21 -07:00
Roy
6110c39dc8
[BugFix] Fix tokenizer out of vocab size ( #3685 )
2024-03-29 08:18:59 -07:00
youkaichao
756b30a5f3
[Core][Test] move local_rank to the last arg with default value( #3711 )
...
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711 )
2024-03-28 21:19:45 -07:00
SangBin Cho
26422e477b
[Test] Make model tests run again and remove --forked from pytest ( #3631 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-28 21:06:40 -07:00
Roy
515386ef3c
[Core] Support multi-node inference(eager and cuda graph) ( #3686 )
2024-03-28 15:01:55 -07:00
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update ( #3538 )
2024-03-28 10:06:01 -07:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability ( #3492 )
2024-03-27 23:59:28 -07:00
Roger Wang
45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark ( #3277 )
2024-03-27 13:39:26 -07:00
youkaichao
8f44facddd
[Core] remove cupy dependency ( #3625 )
2024-03-27 00:33:26 -07:00
Jee Li
566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels ( #3636 )
2024-03-27 00:37:42 +00:00
Jee Li
8af890a865
Enable more models to inference based on LoRA ( #3382 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-25 18:09:31 -07:00
Nick Hill
dfeb2ecc3a
[Misc] Include matched stop string/token in responses ( #2976 )
...
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
2024-03-25 17:31:32 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
Simon Mo
f408d05c52
hotfix isort on logprobs ranks pr ( #3622 )
2024-03-25 11:55:46 -07:00
Dylan Hawk
0b4997e05c
[Bugfix] API stream returning two stops ( #3450 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-03-25 10:14:34 -07:00
Travis Johnson
c13ad1b7bd
feat: implement the min_tokens sampling parameter ( #3124 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-03-25 10:14:26 -07:00
Swapnil Parekh
819924e749
[Core] Adding token ranks along with logprobs ( #3516 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
2024-03-25 10:13:10 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
youkaichao
837e185142
[CI/Build] fix flaky test ( #3602 )
2024-03-24 17:43:05 -07:00
youkaichao
8b268a46a7
[CI] typo fix: is_hip --> is_hip() ( #3595 )
2024-03-24 16:03:06 -07:00
Nick Hill
41deac4a3d
[BugFix] 1D query fix for MoE models ( #3597 )
2024-03-24 16:00:16 -07:00
Antoni Baum
bfdb1ba5c3
[Core] Improve detokenization performance for prefill ( #3469 )
...
Co-authored-by: MeloYang <meloyang05@gmail.com>
2024-03-22 13:44:12 -07:00
Thomas Parnell
cf2f084d56
Dynamic scheduler delay to improve ITL performance ( #3279 )
...
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
2024-03-22 12:28:14 -07:00
Zhuohan Li
e90fc21f2e
[Hardware][Neuron] Refactor neuron support ( #3471 )
2024-03-22 01:22:17 +00:00
Roy
ea5f14e6ff
[Bugfix][Model] Fix Qwen2 ( #3554 )
2024-03-22 00:18:58 +00:00
Roy
f1c0fc3919
Migrate `logits` computation and gather to `model_runner` ( #3233 )
2024-03-20 23:25:01 +00:00
SangBin Cho
6e435de766
[1/n][Chunked Prefill] Refactor input query shapes ( #3236 )
2024-03-20 14:46:05 -07:00
Antoni Baum
426ec4ec67
[1/n] Triton sampling kernel ( #3186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-20 14:45:08 -07:00
Woosuk Kwon
5ee14494e4
[Misc] Remove cache stream and cache events ( #3461 )
2024-03-20 00:38:53 -07:00
ElizaWszola
9474e89ba4
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled ( #3357 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-20 00:11:11 -07:00
Robert Shaw
097aa0ea22
[CI/Build] Fix Bad Import In Test ( #3473 )
2024-03-18 20:28:00 +00:00
Simon Mo
120157fd2a
Support arbitrary json_object in OpenAI and Context Free Grammar ( #3211 )
2024-03-16 13:35:27 -07:00
simon-mo
ad50bf4b25
fix lint
2024-03-15 22:23:38 -07:00
Tao He
3123f15138
Fixes the incorrect argument in the prefix-prefill test cases ( #3246 )
2024-03-15 20:58:10 -07:00
Antoni Baum
fb96c1e98c
Asynchronous tokenization ( #2879 )
2024-03-15 23:37:01 +00:00
陈序
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled ( #3373 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel ( #3095 )
2024-03-13 13:45:26 -07:00
Or Sharir
ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. ( #3350 )
2024-03-13 12:18:25 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU ( #3337 )
2024-03-12 22:06:17 -07:00
Breno Faria
49a3c8662b
Fixes #1556 double free ( #3347 )
2024-03-13 00:30:08 +00:00
Zhuohan Li
4c922709b6
Add distributed model executor abstraction ( #3191 )
2024-03-11 11:03:45 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Roy
9e8744a545
[BugFix] Fix get tokenizer when using ray ( #3301 )
2024-03-10 19:17:16 -07:00
Terry
0bba88df03
Enhance lora tests with more layer and rank variations ( #3243 )
2024-03-09 17:14:16 -08:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling ( #3103 )
2024-03-08 23:32:46 -08:00
ElizaWszola
b35cc93420
Fix auto prefix bug ( #3239 )
2024-03-07 16:37:28 -08:00
jacobthebanana
8cbba4622c
Possible fix for conflict between Automated Prefix Caching ( #2762 ) and multi-LoRA support ( #1804 ) ( #3263 )
2024-03-07 23:03:22 +00:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
Cade Daniel
a33ce60c66
[Testing] Fix core tests ( #3224 )
2024-03-06 01:04:23 -08:00
SangBin Cho
24aecf421a
[Tests] Add block manager and scheduler tests ( #3108 )
2024-03-05 18:23:34 -08:00
Nick Hill
8999ec3c16
Store `eos_token_id` in `Sequence` for easy access ( #3166 )
2024-03-05 15:35:43 -08:00
Antoni Baum
ff578cae54
Add health check, make async Engine more robust ( #3015 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-04 22:01:40 +00:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server ( #2819 )
...
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Seonghyeon
bfdcfa6a05
Support starcoder2 architecture ( #3089 )
2024-02-29 00:51:48 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
Tao He
71bcaf99e2
Enable GQA support in the prefix prefill kernels ( #3007 )
...
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Dylan Hawk
e0ade06d63
Support logit bias for OpenAI API ( #3027 )
2024-02-27 11:51:53 +08:00
Jared Moore
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI ( #2918 )
2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411
Port metrics from `aioprometheus` to `prometheus_client` ( #2730 )
2024-02-25 11:54:00 -08:00
Ronen Schaffer
4caf7044e0
Include tokens from prompt phase in `counter_generation_tokens` ( #2802 )
2024-02-22 14:00:12 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma ( #2975 )
2024-02-21 20:17:52 -08:00
Massimiliano Pronesti
93dc5a2870
chore(vllm): codespell for spell checking ( #2820 )
2024-02-21 18:56:01 -08:00
Nick Hill
7d2dcce175
Support per-request seed ( #2514 )
2024-02-21 11:47:00 -08:00
Antoni Baum
017d9f1515
Add metrics to RequestOutput ( #2876 )
2024-02-20 21:55:57 -08:00
Zhuohan Li
63e2a6419d
[FIX] Fix beam search test ( #2930 )
2024-02-20 14:37:39 -08:00
Ronen Schaffer
e433c115bc
Fix `vllm:prompt_tokens_total` metric calculation ( #2869 )
2024-02-18 23:55:41 -08:00
Isotr0py
ab3a5a8259
Support OLMo models. ( #2832 )
2024-02-18 21:05:15 -08:00
Zhuohan Li
a61f0521b8
[Test] Add basic correctness test ( #2908 )
2024-02-18 16:44:50 -08:00
jvmncs
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Woosuk Kwon
d7afab6d3a
[BugFix] Fix GC bug for `LLM` class ( #2882 )
2024-02-14 22:17:44 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral ( #2831 )
...
* add mixtral lora support
* formatting
* fix incorrectly ported logic
* polish tests
* minor fixes and refactoring
* minor fixes
* formatting
* rename and remove redundant logic
* refactoring
* refactoring
* minor fix
* minor refactoring
* fix code smell
2024-02-14 00:55:45 +01:00
Lily Liu
fe6d09ae61
[Minor] More fix of test_cache.py CI test failure ( #2750 )
2024-02-06 11:38:38 -08:00
Woosuk Kwon
f0d4e14557
Add fused top-K softmax kernel for MoE ( #2769 )
2024-02-05 17:38:02 -08:00
Hongxia Yang
56f738ae9b
[ROCm] Fix some kernels failed unit tests ( #2498 )
2024-02-05 14:25:36 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded `device="cuda" ` to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Philipp Moritz
d0d93b92b1
Add unit test for Mixtral MoE layer ( #2677 )
2024-01-31 14:34:17 -08:00
Philipp Moritz
89efcf1ce5
[Minor] Fix test_cache.py CI test failure ( #2684 )
2024-01-31 10:12:11 -08:00
Vladimir
4f65af0e25
Add swap_blocks unit tests ( #2616 )
2024-01-30 09:30:50 -08:00
wangding zeng
5d60def02c
DeepseekMoE support with Fused MoE kernel ( #2453 )
...
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Simon Mo
3a7dd7e367
Support Batch Completion in Server ( #2529 )
2024-01-24 17:11:07 -08:00
Nikola Borisov
3209b49033
[Bugfix] fix crash if max_tokens=None ( #2570 )
2024-01-23 22:38:55 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Jason Zhu
7a0b011dd5
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py ( #2553 )
2024-01-22 14:47:25 -08:00
Cade Daniel
18bfcdd05c
[Speculative decoding 2/9] Multi-step worker for draft model ( #2424 )
2024-01-21 16:31:47 -08:00
Zhuohan Li
ef9b636e2d
Simplify broadcast logic for control messages ( #2501 )
2024-01-19 11:23:30 -08:00
Simon Mo
dd7e8f5f64
refactor complemention api for readability ( #2499 )
2024-01-18 16:45:14 -08:00
shiyi.c_98
d10f8e1d43
[Experimental] Prefix Caching Support ( #1669 )
...
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
FlorianJoncour
14cc317ba4
OpenAI Server refactoring ( #2360 )
2024-01-16 21:33:14 -08:00
Hyunsung Lee
e1957c6ebd
Add StableLM3B model ( #2372 )
2024-01-16 20:32:40 -08:00
Simon Mo
6e01e8c1c8
[CI] Add Buildkite ( #2355 )
2024-01-14 12:37:58 -08:00
陈序
218dc2ccda
Aligning `top_p` and `top_k` Sampling ( #1885 )
...
* Align top_p and top_k with huggingface
* remove _get_prompt_and_output_tokens
* rename _apply_top_p_top_k
* compare top_p top_k with hf
* fix test errors
2024-01-12 22:51:03 +01:00
Cade Daniel
79d64c4954
[Speculative decoding 1/9] Optimized rejection sampler ( #2336 )
2024-01-09 15:38:41 -08:00
Woosuk Kwon
941767127c
Revert the changes in test_cache ( #2335 )
2024-01-03 17:32:05 -08:00
Zhuohan Li
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead ( #2221 )
2024-01-03 11:30:22 -08:00
Jee Li
77af974b40
[FIX] Support non-zero CUDA devices in custom kernels ( #1959 )
2024-01-02 19:09:59 -08:00
Zhuohan Li
358c328d69
[BUGFIX] Fix communication test ( #2285 )
2023-12-27 17:18:11 -05:00
Zhuohan Li
4aaafdd289
[BUGFIX] Fix the path of test prompts ( #2273 )
2023-12-26 10:37:21 -08:00
Zhuohan Li
66b108d142
[BUGFIX] Fix API server test ( #2270 )
2023-12-26 10:37:06 -08:00
avideci
de60a3fb93
Added DeciLM-7b and DeciLM-7b-instruct ( #2062 )
2023-12-19 02:29:33 -08:00
Woosuk Kwon
f8c688d746
[Minor] Add Phi 2 to supported models ( #2159 )
2023-12-17 02:54:57 -08:00
Woosuk Kwon
f1c8520146
[BugFix] Fix input positions for long context with sliding window ( #2088 )
2023-12-13 12:28:13 -08:00
wbn
dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. ( #1997 )
...
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
Woosuk Kwon
cd3aa153a4
Fix broken worker test ( #1900 )
2023-12-02 22:17:33 -08:00
Woosuk Kwon
9b294976a2
Add PyTorch-native implementation of custom layers ( #1898 )
2023-12-02 21:18:40 -08:00
Woosuk Kwon
5f09cbdb63
Fix broken sampler tests ( #1896 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-02 16:06:17 -08:00
Adam Brusselback
66785cc05c
Support chat template and `echo` for chat API ( #1756 )
2023-11-30 16:43:13 -08:00
Yanming W
e0c6f556e8
[Build] Avoid building too many extensions ( #1624 )
2023-11-23 16:31:19 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from `pylint` to `ruff` ( #1665 )
2023-11-20 11:58:01 -08:00
Zhuohan Li
20d0699d49
[Fix] Fix comm test ( #1691 )
2023-11-16 16:28:39 -08:00
maximzubkov
521b35f799
Support Microsoft Phi 1.5 ( #1664 )
2023-11-16 14:28:39 -08:00
Simon Mo
cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step ( #1666 )
2023-11-16 13:11:41 -08:00
Yanming W
8efe23f150
Fix input_metadata.selected_token_indices in worker prepare_inputs ( #1546 )
2023-11-08 14:19:12 -08:00
Noam Gat
555bdcc5a3
Added logits processor API to sampling params ( #1469 )
2023-11-03 14:12:15 -07:00
Cade Daniel
e575df33b1
[Small] Formatter only checks lints in changed files ( #1528 )
2023-10-31 15:39:38 -07:00
Woosuk Kwon
0ce8647dc5
Fix integer overflows in attention & cache ops ( #1514 )
2023-10-31 15:19:30 -07:00
Woosuk Kwon
9524867701
Add Mistral 7B to `test_models` ( #1366 )
2023-10-16 17:49:54 -07:00
Woosuk Kwon
d3a5bd9fb7
Fix sampler test ( #1379 )
2023-10-16 12:57:26 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs ( #1328 )
...
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Woosuk Kwon
928de46888
Implement PagedAttention V2 ( #1348 )
2023-10-16 00:59:57 -07:00
Zhuohan Li
ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic ( #1181 )
2023-10-02 15:36:09 -07:00
Woosuk Kwon
6f88f762bf
Fix OOM in attention kernel test ( #1223 )
2023-09-28 14:33:24 -07:00
Antoni Baum
cf5cb1e33e
Allocate more shared memory to attention kernel ( #1154 )
2023-09-26 22:27:13 -07:00
Zhuohan Li
947b794146
[Sampler] Vectorized sampling (simplified) ( #1048 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-22 17:48:04 -07:00
Antoni Baum
ff36139ffc
Remove AsyncLLMEngine busy loop, shield background task ( #1059 )
2023-09-17 00:29:08 -07:00
Antoni Baum
dd54a4b026
Fix detokenization leaving special tokens ( #1044 )
...
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-14 16:37:03 -07:00
Antoni Baum
9841d48a10
Use TGI-like incremental detokenization ( #984 )
2023-09-13 13:38:01 -07:00
Woosuk Kwon
e67b4f2c2a
Use FP32 in RoPE initialization ( #1004 )
...
Co-authored-by: One <imone@tuta.io>
2023-09-11 00:26:35 -07:00
Antoni Baum
080438477f
Start background task in `AsyncLLMEngine.generate` ( #988 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-08 00:03:39 -07:00
Zhuohan Li
db09d4ad83
[FIX] Fix Alibi implementation in PagedAttention kernel ( #945 )
...
* [FIX] Fix Alibi implementation in PagedAttention kernel
* Fix test_attention
* Fix
---------
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
2023-09-07 15:53:14 -07:00
Antoni Baum
c07ece5ca4
Make `AsyncLLMEngine` more robust & fix batched abort ( #969 )
...
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2023-09-07 13:43:45 -07:00
Woosuk Kwon
320a622ec4
[BugFix] Implement RoPE for GPT-J ( #941 )
2023-09-06 11:54:33 +09:00
Antoni Baum
c9927c1a6a
Use queue for finished requests ( #957 )
2023-09-05 19:27:23 -07:00
Woosuk Kwon
fbd80ad409
Clean up kernel unit tests ( #938 )
2023-09-05 16:57:38 -07:00
Zhuohan Li
002800f081
Align vLLM's beam search implementation with HF generate ( #857 )
2023-09-04 17:29:42 -07:00
Woosuk Kwon
32b6816e55
Add tests for models ( #922 )
2023-09-01 11:19:43 +09:00
Aman Gupta Karmani
75471386de
use flash-attn via xformers ( #877 )
2023-08-29 21:52:13 -07:00
Woosuk Kwon
d64bf1646c
Implement approximate GELU kernels ( #828 )
2023-08-23 07:43:21 +09:00
Tao Peng
d7a1c6d614
Fix paged attention testing. ( #495 )
...
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
2023-07-24 21:01:56 -07:00
Song
bda41c70dd
hotfix attn alibi wo head mapping ( #496 )
...
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
2023-07-18 11:31:48 -07:00
Andre Slavescu
c894836108
[Model] Add support for GPT-J ( #226 )
...
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon
e41f06702c
Add support for BLOOM ( #331 )
2023-07-03 13:12:35 -07:00
Zhuohan Li
d6fa1be3a8
[Quality] Add code formatter and linter ( #326 )
2023-07-03 11:31:55 -07:00
Woosuk Kwon
0b98ba15c7
Change the name to vLLM ( #150 )
2023-06-17 03:07:40 -07:00
Woosuk Kwon
e38074b1e6
Support FP32 ( #141 )
2023-06-07 00:40:21 -07:00
Woosuk Kwon
a283ec2eec
Add contributing guideline and mypy config ( #122 )
2023-05-23 17:58:51 -07:00
Woosuk Kwon
825d8892b5
Use pytest format for unit tests ( #107 )
2023-05-17 17:11:23 -07:00
Woosuk Kwon
c9d5b6d4a8
Replace FlashAttention with xformers ( #70 )
2023-05-05 02:01:08 -07:00
Woosuk Kwon
436e523bf1
Refactor attention kernels ( #53 )
2023-05-03 13:40:13 -07:00
Woosuk Kwon
a96d63c21d
Add support for GPT-NeoX (Pythia) ( #50 )
2023-04-28 00:32:10 -07:00
Siyuan (Ryans) Zhuang
e3cec88aa5
Memcpy kernel for flash attention ( #29 )
...
* optimize
* add benchmark
* add assert
* add test
2023-04-10 18:22:49 -07:00
Woosuk Kwon
b9926f7f66
Support block size 32 ( #35 )
2023-04-09 23:07:18 -07:00
Woosuk Kwon
c267b1a02c
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script ( #27 )
...
* Add query stride to multi_query_cached_kv_attention
* Add kernel benchmark script
2023-04-08 13:36:09 -07:00
Woosuk Kwon
0f40557af6
Implement block copy kernel to optimize beam search ( #32 )
2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang
21b3671bbc
Basic attention kernel that supports cached KV + (multi-)prompts ( #24 )
2023-04-04 20:34:46 -07:00
Woosuk Kwon
897cb2ae28
Optimize data movement ( #20 )
2023-04-02 00:30:17 -07:00
Woosuk Kwon
09e9245478
Add custom kernel for RMS normalization ( #16 )
2023-04-01 00:51:22 +08:00
Woosuk Kwon
88c0268a18
Implement custom kernel for LLaMA rotary embedding ( #14 )
2023-03-30 11:04:21 -07:00
Woosuk Kwon
a1b3de86cd
Refactor the test code for attention kernels ( #13 )
2023-03-29 18:59:27 -07:00
Woosuk Kwon
3e9f991d6a
Use FlashAttention for `multi_query_kv_attention` ( #4 )
2023-03-01 21:13:08 -08:00
Woosuk Kwon
0deacbce6e
Implement `single_query_cached_kv_attention` kernel ( #3 )
2023-03-01 15:02:19 -08:00
Woosuk Kwon
af68ec1c5c
Add tests for kernels
2023-02-18 19:23:07 +00:00