afeldman-nm
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) ( #4837 )
2024-05-29 16:09:13 +00:00
Cyrus Leung
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
Michał Moskal
d4f3985907
[Core] Sliding window for block manager v2 ( #4545 )
...
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
2024-05-28 11:07:07 +09:00
leiwen83
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 ( #4764 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-24 10:07:09 -07:00
SangBin Cho
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-13 23:50:44 +09:00
Cyrus Leung
350f9e107f
[CI/Build] Move `test_utils.py` to `tests/utils.py` ( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
Robert Shaw
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
youkaichao
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
youkaichao
469f85c782
[Core][Optimization] change copy-on-write from dict[int, list] to list ( #4648 )
2024-05-07 11:06:32 -07:00
youkaichao
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
SangBin Cho
0f8a91401c
[Core] Ignore infeasible swap requests. ( #4557 )
2024-05-02 14:31:20 -07:00
leiwen83
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled ( #4142 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
SangBin Cho
050f285ff6
[Core] Scheduling optimization 2 ( #4280 )
2024-04-23 08:02:11 +00:00
SangBin Cho
ad8d696a99
[Core] Scheduler perf fix ( #4270 )
2024-04-22 21:11:06 +00:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine ( #3894 )
2024-04-16 13:09:21 -07:00
SangBin Cho
67b4221a61
[Core][5/N] Fully working chunked prefill e2e ( #3884 )
2024-04-10 17:56:48 -07:00
Cade Daniel
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" ( #3837 )
2024-04-09 11:44:15 -07:00
SangBin Cho
18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. ( #3853 )
2024-04-05 10:17:58 -07:00
SangBin Cho
3dcb3e8b98
[3/N] Refactor scheduler for chunked prefill scheduling ( #3550 )
2024-04-03 14:13:49 -07:00
Cade Daniel
eb69d68804
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup ( #3783 )
2024-04-02 00:49:51 +00:00
Cade Daniel
93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding ( #3250 )
2024-04-01 22:55:24 +00:00
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update ( #3538 )
2024-03-28 10:06:01 -07:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability ( #3492 )
2024-03-27 23:59:28 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Thomas Parnell
cf2f084d56
Dynamic scheduler delay to improve ITL performance ( #3279 )
...
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
2024-03-22 12:28:14 -07:00
SangBin Cho
6e435de766
[1/n][Chunked Prefill] Refactor input query shapes ( #3236 )
2024-03-20 14:46:05 -07:00
ElizaWszola
9474e89ba4
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled ( #3357 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-20 00:11:11 -07:00
Breno Faria
49a3c8662b
Fixes #1556 double free ( #3347 )
2024-03-13 00:30:08 +00:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Cade Daniel
a33ce60c66
[Testing] Fix core tests ( #3224 )
2024-03-06 01:04:23 -08:00
SangBin Cho
24aecf421a
[Tests] Add block manager and scheduler tests ( #3108 )
2024-03-05 18:23:34 -08:00