Commit Graph

2143 Commits

Author SHA1 Message Date
Terry 2a543d6efe
Add LoRA support for Mixtral (#2831)
* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell
2024-02-14 00:55:45 +01:00
Lily Liu fe6d09ae61
[Minor] More fix of test_cache.py CI test failure (#2750) 2024-02-06 11:38:38 -08:00
Woosuk Kwon f0d4e14557
Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
Hongxia Yang 56f738ae9b
[ROCm] Fix some kernels failed unit tests (#2498) 2024-02-05 14:25:36 -08:00
Kunshang Ji 96b6f475dd
Remove hardcoded `device="cuda" ` to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Philipp Moritz d0d93b92b1
Add unit test for Mixtral MoE layer (#2677) 2024-01-31 14:34:17 -08:00
Philipp Moritz 89efcf1ce5
[Minor] Fix test_cache.py CI test failure (#2684) 2024-01-31 10:12:11 -08:00
Vladimir 4f65af0e25
Add swap_blocks unit tests (#2616) 2024-01-30 09:30:50 -08:00
wangding zeng 5d60def02c
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star 9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou 380170038e
Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Simon Mo 3a7dd7e367
Support Batch Completion in Server (#2529) 2024-01-24 17:11:07 -08:00
Nikola Borisov 3209b49033
[Bugfix] fix crash if max_tokens=None (#2570) 2024-01-23 22:38:55 -08:00
Antoni Baum 9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Jason Zhu 7a0b011dd5
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553) 2024-01-22 14:47:25 -08:00
Cade Daniel 18bfcdd05c
[Speculative decoding 2/9] Multi-step worker for draft model (#2424) 2024-01-21 16:31:47 -08:00
Zhuohan Li ef9b636e2d
Simplify broadcast logic for control messages (#2501) 2024-01-19 11:23:30 -08:00
Simon Mo dd7e8f5f64
refactor complemention api for readability (#2499) 2024-01-18 16:45:14 -08:00
shiyi.c_98 d10f8e1d43
[Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
FlorianJoncour 14cc317ba4
OpenAI Server refactoring (#2360) 2024-01-16 21:33:14 -08:00
Hyunsung Lee e1957c6ebd
Add StableLM3B model (#2372) 2024-01-16 20:32:40 -08:00
Simon Mo 6e01e8c1c8
[CI] Add Buildkite (#2355) 2024-01-14 12:37:58 -08:00
陈序 218dc2ccda
Aligning `top_p` and `top_k` Sampling (#1885)
* Align top_p and top_k with huggingface

* remove _get_prompt_and_output_tokens

* rename _apply_top_p_top_k

* compare top_p top_k with hf

* fix test errors
2024-01-12 22:51:03 +01:00
Cade Daniel 79d64c4954
[Speculative decoding 1/9] Optimized rejection sampler (#2336) 2024-01-09 15:38:41 -08:00
Woosuk Kwon 941767127c
Revert the changes in test_cache (#2335) 2024-01-03 17:32:05 -08:00
Zhuohan Li fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-03 11:30:22 -08:00
Jee Li 77af974b40
[FIX] Support non-zero CUDA devices in custom kernels (#1959) 2024-01-02 19:09:59 -08:00
Zhuohan Li 358c328d69
[BUGFIX] Fix communication test (#2285) 2023-12-27 17:18:11 -05:00
Zhuohan Li 4aaafdd289
[BUGFIX] Fix the path of test prompts (#2273) 2023-12-26 10:37:21 -08:00
Zhuohan Li 66b108d142
[BUGFIX] Fix API server test (#2270) 2023-12-26 10:37:06 -08:00
avideci de60a3fb93
Added DeciLM-7b and DeciLM-7b-instruct (#2062) 2023-12-19 02:29:33 -08:00
Woosuk Kwon f8c688d746
[Minor] Add Phi 2 to supported models (#2159) 2023-12-17 02:54:57 -08:00
Woosuk Kwon f1c8520146
[BugFix] Fix input positions for long context with sliding window (#2088) 2023-12-13 12:28:13 -08:00
wbn dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
Woosuk Kwon cd3aa153a4
Fix broken worker test (#1900) 2023-12-02 22:17:33 -08:00
Woosuk Kwon 9b294976a2
Add PyTorch-native implementation of custom layers (#1898) 2023-12-02 21:18:40 -08:00
Woosuk Kwon 5f09cbdb63
Fix broken sampler tests (#1896)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-02 16:06:17 -08:00
Adam Brusselback 66785cc05c
Support chat template and `echo` for chat API (#1756) 2023-11-30 16:43:13 -08:00
Yanming W e0c6f556e8
[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
Simon Mo 5ffc0d13a2
Migrate linter from `pylint` to `ruff` (#1665) 2023-11-20 11:58:01 -08:00
Zhuohan Li 20d0699d49
[Fix] Fix comm test (#1691) 2023-11-16 16:28:39 -08:00
maximzubkov 521b35f799
Support Microsoft Phi 1.5 (#1664) 2023-11-16 14:28:39 -08:00
Simon Mo cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step (#1666) 2023-11-16 13:11:41 -08:00
Yanming W 8efe23f150
Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546) 2023-11-08 14:19:12 -08:00
Noam Gat 555bdcc5a3
Added logits processor API to sampling params (#1469) 2023-11-03 14:12:15 -07:00
Cade Daniel e575df33b1
[Small] Formatter only checks lints in changed files (#1528) 2023-10-31 15:39:38 -07:00
Woosuk Kwon 0ce8647dc5
Fix integer overflows in attention & cache ops (#1514) 2023-10-31 15:19:30 -07:00
Woosuk Kwon 9524867701
Add Mistral 7B to `test_models` (#1366) 2023-10-16 17:49:54 -07:00
Woosuk Kwon d3a5bd9fb7
Fix sampler test (#1379) 2023-10-16 12:57:26 -07:00
Zhuohan Li 9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Woosuk Kwon 928de46888
Implement PagedAttention V2 (#1348) 2023-10-16 00:59:57 -07:00
Zhuohan Li ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) 2023-10-02 15:36:09 -07:00
Woosuk Kwon 6f88f762bf
Fix OOM in attention kernel test (#1223) 2023-09-28 14:33:24 -07:00
Antoni Baum cf5cb1e33e
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00
Zhuohan Li 947b794146
[Sampler] Vectorized sampling (simplified) (#1048)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-22 17:48:04 -07:00
Antoni Baum ff36139ffc
Remove AsyncLLMEngine busy loop, shield background task (#1059) 2023-09-17 00:29:08 -07:00
Antoni Baum dd54a4b026
Fix detokenization leaving special tokens (#1044)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-14 16:37:03 -07:00
Antoni Baum 9841d48a10
Use TGI-like incremental detokenization (#984) 2023-09-13 13:38:01 -07:00
Woosuk Kwon e67b4f2c2a
Use FP32 in RoPE initialization (#1004)
Co-authored-by: One <imone@tuta.io>
2023-09-11 00:26:35 -07:00
Antoni Baum 080438477f
Start background task in `AsyncLLMEngine.generate` (#988)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-08 00:03:39 -07:00
Zhuohan Li db09d4ad83
[FIX] Fix Alibi implementation in PagedAttention kernel (#945)
* [FIX] Fix Alibi implementation in PagedAttention kernel

* Fix test_attention

* Fix

---------

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
2023-09-07 15:53:14 -07:00
Antoni Baum c07ece5ca4
Make `AsyncLLMEngine` more robust & fix batched abort (#969)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2023-09-07 13:43:45 -07:00
Woosuk Kwon 320a622ec4
[BugFix] Implement RoPE for GPT-J (#941) 2023-09-06 11:54:33 +09:00
Antoni Baum c9927c1a6a
Use queue for finished requests (#957) 2023-09-05 19:27:23 -07:00
Woosuk Kwon fbd80ad409
Clean up kernel unit tests (#938) 2023-09-05 16:57:38 -07:00
Zhuohan Li 002800f081
Align vLLM's beam search implementation with HF generate (#857) 2023-09-04 17:29:42 -07:00
Woosuk Kwon 32b6816e55
Add tests for models (#922) 2023-09-01 11:19:43 +09:00
Aman Gupta Karmani 75471386de
use flash-attn via xformers (#877) 2023-08-29 21:52:13 -07:00
Woosuk Kwon d64bf1646c
Implement approximate GELU kernels (#828) 2023-08-23 07:43:21 +09:00
Tao Peng d7a1c6d614
Fix paged attention testing. (#495)
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
2023-07-24 21:01:56 -07:00
Song bda41c70dd
hotfix attn alibi wo head mapping (#496)
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
2023-07-18 11:31:48 -07:00
Andre Slavescu c894836108
[Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon e41f06702c
Add support for BLOOM (#331) 2023-07-03 13:12:35 -07:00
Zhuohan Li d6fa1be3a8
[Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00
Woosuk Kwon 0b98ba15c7
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
Woosuk Kwon e38074b1e6
Support FP32 (#141) 2023-06-07 00:40:21 -07:00
Woosuk Kwon a283ec2eec
Add contributing guideline and mypy config (#122) 2023-05-23 17:58:51 -07:00
Woosuk Kwon 825d8892b5
Use pytest format for unit tests (#107) 2023-05-17 17:11:23 -07:00
Woosuk Kwon c9d5b6d4a8
Replace FlashAttention with xformers (#70) 2023-05-05 02:01:08 -07:00
Woosuk Kwon 436e523bf1
Refactor attention kernels (#53) 2023-05-03 13:40:13 -07:00
Woosuk Kwon a96d63c21d
Add support for GPT-NeoX (Pythia) (#50) 2023-04-28 00:32:10 -07:00
Siyuan (Ryans) Zhuang e3cec88aa5
Memcpy kernel for flash attention (#29)
* optimize

* add benchmark

* add assert

* add test
2023-04-10 18:22:49 -07:00
Woosuk Kwon b9926f7f66
Support block size 32 (#35) 2023-04-09 23:07:18 -07:00
Woosuk Kwon c267b1a02c
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27)
* Add query stride to multi_query_cached_kv_attention

* Add kernel benchmark script
2023-04-08 13:36:09 -07:00
Woosuk Kwon 0f40557af6
Implement block copy kernel to optimize beam search (#32) 2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang 21b3671bbc
Basic attention kernel that supports cached KV + (multi-)prompts (#24) 2023-04-04 20:34:46 -07:00
Woosuk Kwon 897cb2ae28
Optimize data movement (#20) 2023-04-02 00:30:17 -07:00
Woosuk Kwon 09e9245478
Add custom kernel for RMS normalization (#16) 2023-04-01 00:51:22 +08:00
Woosuk Kwon 88c0268a18
Implement custom kernel for LLaMA rotary embedding (#14) 2023-03-30 11:04:21 -07:00
Woosuk Kwon a1b3de86cd
Refactor the test code for attention kernels (#13) 2023-03-29 18:59:27 -07:00
Woosuk Kwon 3e9f991d6a
Use FlashAttention for `multi_query_kv_attention` (#4) 2023-03-01 21:13:08 -08:00
Woosuk Kwon 0deacbce6e
Implement `single_query_cached_kv_attention` kernel (#3) 2023-03-01 15:02:19 -08:00
Woosuk Kwon af68ec1c5c Add tests for kernels 2023-02-18 19:23:07 +00:00