vLLM/vllm - vllm - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Danny Guinther	b8afa8b95a	[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273 )	2024-05-01 17:34:40 -07:00
sasha0552	c47ba4aaa9	[Bugfix] Add validation for seed (#4529 )	2024-05-01 19:31:22 +00:00
Nick Hill	a657bfc48a	[Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357 )	2024-05-01 18:41:59 +00:00
leiwen83	24750f4cad	[Core] Enable prefix caching with block manager v2 enabled (#4142 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>	2024-05-01 11:20:32 -07:00
leiwen83	b38e42fbca	[Speculative decoding] Add ngram prompt lookup decoding (#4237 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-01 11:13:03 -07:00
SangBin Cho	6f1df80436	[Test] Add ignore_eos test (#4519 )	2024-05-01 08:45:42 -04:00
Jee Li	d6f4bd7cdd	[Misc]Add customized information for models (#4132 )	2024-04-30 21:18:14 -07:00
Robert Caulk	c3845d82dc	Allow user to define whitespace pattern for outlines (#4305 )	2024-04-30 20:48:39 -07:00
Florian Greinacher	a494140433	[Frontend] Support complex message content for chat completions endpoint (#3467 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-04-30 16:28:46 -07:00
Robert Shaw	111815d482	[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-04-30 21:46:12 +00:00
leiwen83	4bb53e2dde	[BugFix] fix num_lookahead_slots missing in async executor (#4165 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-04-30 10:12:59 -07:00
youkaichao	f4f921b7f1	[Core][Distributed] use cpu group to broadcast metadata in cpu (#4444 )	2024-04-29 13:52:22 -07:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Prashant Gupta	d6e520e170	[Core] Support offline use of local cache for models (#4374 ) Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>	2024-04-27 09:59:55 -07:00
Nick Hill	81661da7b2	[BugFix] Fix `min_tokens` when `eos_token_id` is None (#4389 ) Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>	2024-04-27 09:52:46 -07:00
Ruoyu Qin	dfea173148	[Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363 )	2024-04-27 09:48:37 -07:00
Roy	7134303cbb	[Bugfix][Core] Fix get decoding config from ray (#4335 )	2024-04-27 11:30:08 +00:00
Austin Veselka	eefeb16464	[Kernel] Full Tensor Parallelism for LoRA Layers (#3524 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-04-27 00:03:48 -07:00
Cyrus Leung	8947bc3c15	[Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355 )	2024-04-27 05:08:24 +00:00
Cody Yu	a62aaf1df5	[Misc][Refactor] Generalize linear_method to be quant_method (#4373 )	2024-04-26 16:41:14 -04:00
SangBin Cho	603ad84815	[Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309 )	2024-04-26 13:02:02 +00:00
Cyrus Leung	a74dee9b62	[Bugfix] Fix parameter name in `get_tokenizer` (#4107 )	2024-04-25 19:10:48 -07:00
Woosuk Kwon	468d761b32	[Misc] Reduce supported Punica dtypes (#4304 )	2024-04-23 18:54:33 -07:00
youkaichao	91f50a6fe2	[Core][Distributed] use cpu/gloo to initialize pynccl (#4248 )	2024-04-23 18:32:19 -07:00
Cyrus Leung	1e8f4252aa	[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292 )	2024-04-23 18:19:03 +00:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Cade Daniel	62b8aebc6f	[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951 )	2024-04-23 08:02:36 +00:00
SangBin Cho	050f285ff6	[Core] Scheduling optimization 2 (#4280 )	2024-04-23 08:02:11 +00:00
SangBin Cho	ad8d696a99	[Core] Scheduler perf fix (#4270 )	2024-04-22 21:11:06 +00:00
GeauxEric	a37d815b83	Make initialization of tokenizer and detokenizer optional (#3748 ) Co-authored-by: Yun Ding <yunding@nvidia.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-21 22:06:46 +00:00
nunjunj	91528575ec	[Frontend] multiple sampling params support (#3570 )	2024-04-20 00:11:57 -07:00
Cody Yu	a22cdea371	[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118 ) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.	2024-04-20 04:28:57 +00:00
Ayush Rautwar	138485a82d	[Bugfix] Add fix for JSON whitespace (#4189 ) Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>	2024-04-19 20:49:22 -07:00
Jee Li	d17c8477f1	[Bugfix] Fix LoRA loading check (#4138 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-04-19 00:59:54 -07:00
youkaichao	8a7a3e4436	[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-18 16:15:12 -07:00
James Whedbee	e1bb2fd52d	[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149 )	2024-04-18 21:12:55 +00:00
Michał Moskal	e8cc7967ff	[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128 )	2024-04-18 00:51:28 -07:00
Michael Goin	53b018edcb	[Bugfix] Get available quantization methods from quantization registry (#4098 )	2024-04-18 00:21:55 -07:00
youkaichao	6dc1fc9cfe	[Core] nccl integrity check and test (#4155 ) [Core] Add integrity check during initialization; add test for it (#4155)	2024-04-17 22:28:52 -07:00
Shoichi Uchinami	a53222544c	[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134 )	2024-04-17 10:02:45 -07:00
youkaichao	8438e0569e	[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024 ) [Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)	2024-04-17 08:34:33 +00:00
Cade Daniel	e95cd87959	[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894 )	2024-04-16 13:09:21 -07:00
Antoni Baum	69e1d2fb69	[Core] Refactor model loading code (#4097 )	2024-04-16 11:34:39 -07:00
Noam Gat	05434764cd	LM Format Enforcer Guided Decoding Support (#3868 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-16 05:54:57 +00:00
SangBin Cho	4e7ee664e2	[Core] Fix engine-use-ray broken (#4105 )	2024-04-16 05:24:53 +00:00
Sanger Steel	711a000255	[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476 )	2024-04-13 17:13:01 -07:00
Jee Li	989ae2538d	[Kernel] Add punica dimension for Baichuan-13B (#4053 )	2024-04-13 07:55:05 -07:00
SangBin Cho	36729bac13	[Test] Test multiple attn backend for chunked prefill. (#4023 )	2024-04-12 09:56:57 -07:00
Jee Li	1096717ae9	[Core] Support LoRA on quantized models (#4012 )	2024-04-11 21:02:44 -07:00
Nick Hill	e46a60aa4c	[BugFix] Fix handling of stop strings and stop token ids (#3672 )	2024-04-11 15:34:12 -07:00
Antoni Baum	1e96c3341a	Add extra punica sizes to support bigger vocabs (#4015 )	2024-04-11 22:18:57 +00:00
Dylan Hawk	95e7d4a97c	Fix echo/logprob OpenAI completion bug (#3441 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-04-11 22:15:50 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
Kunshang Ji	e9da5a40c6	[Misc] Add indirection layer for custom ops (#3913 )	2024-04-10 20:26:07 -07:00
SangBin Cho	e42df7227d	[Test] Add xformer and flash attn tests (#3961 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-11 03:09:50 +00:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
youkaichao	63e7176f26	[Core][Refactor] move parallel_utils into vllm/distributed (#3950 ) [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)	2024-04-10 15:33:30 -07:00
Travis Johnson	0258b7a94b	[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-04-10 01:39:56 -07:00
胡译文	b3104b2a10	[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899 )	2024-04-10 00:09:36 -07:00
Jee Li	11dd6ebb89	[Misc] Avoid loading incorrect LoRA config (#3777 )	2024-04-09 19:47:15 -07:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
youkaichao	95baec828f	[Core] enable out-of-tree model register (#3871 )	2024-04-06 17:11:41 -07:00
SangBin Cho	18de883489	[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853 )	2024-04-05 10:17:58 -07:00
Cade Daniel	e5043a3e75	[Misc] Add pytest marker to opt-out of global test cleanup (#3863 )	2024-04-04 21:54:16 -07:00
Matthias Gerstgrasser	aabe8f40f2	[Core] [Frontend] Make detokenization optional (#3749 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-04-03 21:52:18 -07:00
Michael Feil	537ee25f43	[Core] Enable hf_transfer by default if available (#3817 )	2024-04-04 04:02:43 +00:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
SangBin Cho	3dcb3e8b98	[3/N] Refactor scheduler for chunked prefill scheduling (#3550 )	2024-04-03 14:13:49 -07:00
Cade Daniel	5757d90e26	[Speculative decoding] Adding configuration object for speculative decoding (#3706 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com>	2024-04-03 00:40:57 +00:00
Cade Daniel	eb69d68804	[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783 )	2024-04-02 00:49:51 +00:00
Qubitium	7d4e1b85e7	[Misc] Add support for new autogptq checkpoint_format (#3689 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-04-01 19:32:01 -04:00
Cade Daniel	93deb0b38f	[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250 )	2024-04-01 22:55:24 +00:00
Nick Hill	49782fcb76	[Misc] Some minor simplifications to detokenization logic (#3670 ) Some simplifications made for clarity. Also moves detokenization-related functions from tokenizer.py to detokenizer.py.	2024-04-01 13:22:06 -07:00
Robert Shaw	563c1d7ec5	[CI/Build] Make Marlin Tests Green (#3753 )	2024-03-30 19:18:34 -07:00
mawong-amd	b6d103542c	[Kernel] Layernorm performance optimization (#3662 )	2024-03-30 14:26:38 -07:00
Roy	f510395bbf	[BugFix][Frontend] Fix completion logprobs=0 error (#3731 )	2024-03-29 09:38:21 -07:00
Roy	6110c39dc8	[BugFix] Fix tokenizer out of vocab size (#3685 )	2024-03-29 08:18:59 -07:00
youkaichao	756b30a5f3	[Core][Test] move local_rank to the last arg with default value(#3711 ) [Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)	2024-03-28 21:19:45 -07:00
SangBin Cho	26422e477b	[Test] Make model tests run again and remove --forked from pytest (#3631 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-03-28 21:06:40 -07:00
Roy	515386ef3c	[Core] Support multi-node inference(eager and cuda graph) (#3686 )	2024-03-28 15:01:55 -07:00
SangBin Cho	b51c1cc9d2	[2/N] Chunked prefill data update (#3538 )	2024-03-28 10:06:01 -07:00
Cade Daniel	14ccd94c89	[Core][Bugfix]Refactor block manager for better testability (#3492 )	2024-03-27 23:59:28 -07:00
Roger Wang	45b6ef6513	feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )	2024-03-27 13:39:26 -07:00
youkaichao	8f44facddd	[Core] remove cupy dependency (#3625 )	2024-03-27 00:33:26 -07:00
Jee Li	566b57c5c4	[Kernel] support non-zero cuda devices in punica kernels (#3636 )	2024-03-27 00:37:42 +00:00
Jee Li	8af890a865	Enable more models to inference based on LoRA (#3382 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-25 18:09:31 -07:00
Nick Hill	dfeb2ecc3a	[Misc] Include matched stop string/token in responses (#2976 ) Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>	2024-03-25 17:31:32 -07:00
xwjiang2010	64172a976c	[Feature] Add vision language model support. (#3042 )	2024-03-25 14:16:30 -07:00
Simon Mo	f408d05c52	hotfix isort on logprobs ranks pr (#3622 )	2024-03-25 11:55:46 -07:00
Dylan Hawk	0b4997e05c	[Bugfix] API stream returning two stops (#3450 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-03-25 10:14:34 -07:00
Travis Johnson	c13ad1b7bd	feat: implement the min_tokens sampling parameter (#3124 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-03-25 10:14:26 -07:00
Swapnil Parekh	819924e749	[Core] Adding token ranks along with logprobs (#3516 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>	2024-03-25 10:13:10 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Woosuk Kwon	925f3332ca	[Core] Refactor Attention Take 2 (#3462 )	2024-03-25 04:39:33 +00:00
youkaichao	837e185142	[CI/Build] fix flaky test (#3602 )	2024-03-24 17:43:05 -07:00
youkaichao	8b268a46a7	[CI] typo fix: is_hip --> is_hip() (#3595 )	2024-03-24 16:03:06 -07:00
Nick Hill	41deac4a3d	[BugFix] 1D query fix for MoE models (#3597 )	2024-03-24 16:00:16 -07:00
Antoni Baum	bfdb1ba5c3	[Core] Improve detokenization performance for prefill (#3469 ) Co-authored-by: MeloYang <meloyang05@gmail.com>	2024-03-22 13:44:12 -07:00
Thomas Parnell	cf2f084d56	Dynamic scheduler delay to improve ITL performance (#3279 ) Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>	2024-03-22 12:28:14 -07:00
Zhuohan Li	e90fc21f2e	[Hardware][Neuron] Refactor neuron support (#3471 )	2024-03-22 01:22:17 +00:00
Roy	ea5f14e6ff	[Bugfix][Model] Fix Qwen2 (#3554 )	2024-03-22 00:18:58 +00:00
Roy	f1c0fc3919	Migrate `logits` computation and gather to `model_runner` (#3233 )	2024-03-20 23:25:01 +00:00
SangBin Cho	6e435de766	[1/n][Chunked Prefill] Refactor input query shapes (#3236 )	2024-03-20 14:46:05 -07:00
Antoni Baum	426ec4ec67	[1/n] Triton sampling kernel (#3186 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-20 14:45:08 -07:00
Woosuk Kwon	5ee14494e4	[Misc] Remove cache stream and cache events (#3461 )	2024-03-20 00:38:53 -07:00
ElizaWszola	9474e89ba4	[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled (#3357 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-03-20 00:11:11 -07:00
Robert Shaw	097aa0ea22	[CI/Build] Fix Bad Import In Test (#3473 )	2024-03-18 20:28:00 +00:00
Simon Mo	120157fd2a	Support arbitrary json_object in OpenAI and Context Free Grammar (#3211 )	2024-03-16 13:35:27 -07:00
simon-mo	ad50bf4b25	fix lint	2024-03-15 22:23:38 -07:00
Tao He	3123f15138	Fixes the incorrect argument in the prefix-prefill test cases (#3246 )	2024-03-15 20:58:10 -07:00
Antoni Baum	fb96c1e98c	Asynchronous tokenization (#2879 )	2024-03-15 23:37:01 +00:00
陈序	54be8a0be2	Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-03-14 13:56:57 -07:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
Or Sharir	ae0ccb4017	Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350 )	2024-03-13 12:18:25 -07:00
Woosuk Kwon	602358f8a8	Add kernel for GeGLU with approximate GELU (#3337 )	2024-03-12 22:06:17 -07:00
Breno Faria	49a3c8662b	Fixes #1556 double free (#3347 )	2024-03-13 00:30:08 +00:00
Zhuohan Li	4c922709b6	Add distributed model executor abstraction (#3191 )	2024-03-11 11:03:45 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Roy	9e8744a545	[BugFix] Fix get tokenizer when using ray (#3301 )	2024-03-10 19:17:16 -07:00
Terry	0bba88df03	Enhance lora tests with more layer and rank variations (#3243 )	2024-03-09 17:14:16 -08:00
Cade Daniel	8437bae6ef	[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103 )	2024-03-08 23:32:46 -08:00
ElizaWszola	b35cc93420	Fix auto prefix bug (#3239 )	2024-03-07 16:37:28 -08:00
jacobthebanana	8cbba4622c	Possible fix for conflict between Automated Prefix Caching (#2762 ) and multi-LoRA support (#1804 ) (#3263 )	2024-03-07 23:03:22 +00:00
Woosuk Kwon	2daf23ab0c	Separate attention backends (#3005 )	2024-03-07 01:45:50 -08:00
Cade Daniel	a33ce60c66	[Testing] Fix core tests (#3224 )	2024-03-06 01:04:23 -08:00
SangBin Cho	24aecf421a	[Tests] Add block manager and scheduler tests (#3108 )	2024-03-05 18:23:34 -08:00
Nick Hill	8999ec3c16	Store `eos_token_id` in `Sequence` for easy access (#3166 )	2024-03-05 15:35:43 -08:00
Antoni Baum	ff578cae54	Add health check, make async Engine more robust (#3015 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-03-04 22:01:40 +00:00
Antoni Baum	22de45235c	Push logprob generation to LLMEngine (#3065 ) Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-03-04 19:54:06 +00:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
felixzhu555	703e42ee4b	Add guided decoding for OpenAI API server (#2819 ) Co-authored-by: br3no <breno@veltefaria.de> Co-authored-by: simon-mo <simon.mo@hey.com>	2024-02-29 22:13:08 +00:00
Seonghyeon	bfdcfa6a05	Support starcoder2 architecture (#3089 )	2024-02-29 00:51:48 -08:00
Woosuk Kwon	929b4f2973	Add LoRA support for Gemma (#3050 )	2024-02-28 13:03:28 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Tao He	71bcaf99e2	Enable GQA support in the prefix prefill kernels (#3007 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-02-27 01:14:31 -08:00
Dylan Hawk	e0ade06d63	Support logit bias for OpenAI API (#3027 )	2024-02-27 11:51:53 +08:00
Jared Moore	70f3e8e3a1	Add LogProbs for Chat Completions in OpenAI (#2918 )	2024-02-26 10:39:34 +08:00
Harry Mellor	ef978fe411	Port metrics from `aioprometheus` to `prometheus_client` (#2730 )	2024-02-25 11:54:00 -08:00
Ronen Schaffer	4caf7044e0	Include tokens from prompt phase in `counter_generation_tokens` (#2802 )	2024-02-22 14:00:12 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
Antoni Baum	017d9f1515	Add metrics to RequestOutput (#2876 )	2024-02-20 21:55:57 -08:00
Zhuohan Li	63e2a6419d	[FIX] Fix beam search test (#2930 )	2024-02-20 14:37:39 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Isotr0py	ab3a5a8259	Support OLMo models. (#2832 )	2024-02-18 21:05:15 -08:00
Zhuohan Li	a61f0521b8	[Test] Add basic correctness test (#2908 )	2024-02-18 16:44:50 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00
Woosuk Kwon	d7afab6d3a	[BugFix] Fix GC bug for `LLM` class (#2882 )	2024-02-14 22:17:44 -08:00
Terry	2a543d6efe	Add LoRA support for Mixtral (#2831 ) * add mixtral lora support * formatting * fix incorrectly ported logic * polish tests * minor fixes and refactoring * minor fixes * formatting * rename and remove redundant logic * refactoring * refactoring * minor fix * minor refactoring * fix code smell	2024-02-14 00:55:45 +01:00
Lily Liu	fe6d09ae61	[Minor] More fix of test_cache.py CI test failure (#2750 )	2024-02-06 11:38:38 -08:00
Woosuk Kwon	f0d4e14557	Add fused top-K softmax kernel for MoE (#2769 )	2024-02-05 17:38:02 -08:00
Hongxia Yang	56f738ae9b	[ROCm] Fix some kernels failed unit tests (#2498 )	2024-02-05 14:25:36 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda" ` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
Philipp Moritz	d0d93b92b1	Add unit test for Mixtral MoE layer (#2677 )	2024-01-31 14:34:17 -08:00
Philipp Moritz	89efcf1ce5	[Minor] Fix test_cache.py CI test failure (#2684 )	2024-01-31 10:12:11 -08:00
Vladimir	4f65af0e25	Add swap_blocks unit tests (#2616 )	2024-01-30 09:30:50 -08:00
wangding zeng	5d60def02c	DeepseekMoE support with Fused MoE kernel (#2453 ) Co-authored-by: roy <jasonailu87@gmail.com>	2024-01-29 21:19:48 -08:00
zhaoyang-star	9090bf02e7	Support FP8-E5M2 KV Cache (#2279 ) Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-28 16:43:54 -08:00
Hanzhi Zhou	380170038e	Implement custom all reduce kernels (#2192 )	2024-01-27 12:46:35 -08:00
Simon Mo	3a7dd7e367	Support Batch Completion in Server (#2529 )	2024-01-24 17:11:07 -08:00
Nikola Borisov	3209b49033	[Bugfix] fix crash if max_tokens=None (#2570 )	2024-01-23 22:38:55 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00
Jason Zhu	7a0b011dd5	Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553 )	2024-01-22 14:47:25 -08:00
Cade Daniel	18bfcdd05c	[Speculative decoding 2/9] Multi-step worker for draft model (#2424 )	2024-01-21 16:31:47 -08:00
Zhuohan Li	ef9b636e2d	Simplify broadcast logic for control messages (#2501 )	2024-01-19 11:23:30 -08:00
Simon Mo	dd7e8f5f64	refactor complemention api for readability (#2499 )	2024-01-18 16:45:14 -08:00
shiyi.c_98	d10f8e1d43	[Experimental] Prefix Caching Support (#1669 ) Co-authored-by: DouHappy <2278958187@qq.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-17 16:32:10 -08:00
FlorianJoncour	14cc317ba4	OpenAI Server refactoring (#2360 )	2024-01-16 21:33:14 -08:00
Hyunsung Lee	e1957c6ebd	Add StableLM3B model (#2372 )	2024-01-16 20:32:40 -08:00
Simon Mo	6e01e8c1c8	[CI] Add Buildkite (#2355 )	2024-01-14 12:37:58 -08:00
陈序	218dc2ccda	Aligning `top_p` and `top_k` Sampling (#1885 ) * Align top_p and top_k with huggingface * remove _get_prompt_and_output_tokens * rename _apply_top_p_top_k * compare top_p top_k with hf * fix test errors	2024-01-12 22:51:03 +01:00
Cade Daniel	79d64c4954	[Speculative decoding 1/9] Optimized rejection sampler (#2336 )	2024-01-09 15:38:41 -08:00
Woosuk Kwon	941767127c	Revert the changes in test_cache (#2335 )	2024-01-03 17:32:05 -08:00
Zhuohan Li	fd4ea8ef5c	Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221 )	2024-01-03 11:30:22 -08:00
Jee Li	77af974b40	[FIX] Support non-zero CUDA devices in custom kernels (#1959 )	2024-01-02 19:09:59 -08:00
Zhuohan Li	358c328d69	[BUGFIX] Fix communication test (#2285 )	2023-12-27 17:18:11 -05:00
Zhuohan Li	4aaafdd289	[BUGFIX] Fix the path of test prompts (#2273 )	2023-12-26 10:37:21 -08:00
Zhuohan Li	66b108d142	[BUGFIX] Fix API server test (#2270 )	2023-12-26 10:37:06 -08:00
avideci	de60a3fb93	Added DeciLM-7b and DeciLM-7b-instruct (#2062 )	2023-12-19 02:29:33 -08:00
Woosuk Kwon	f8c688d746	[Minor] Add Phi 2 to supported models (#2159 )	2023-12-17 02:54:57 -08:00
Woosuk Kwon	f1c8520146	[BugFix] Fix input positions for long context with sliding window (#2088 )	2023-12-13 12:28:13 -08:00
wbn	dacaf5a400	Replace head_mapping params with num_kv_heads to attention kernel. (#1997 ) Co-authored-by: wangguoya <wangguoya@baidu.com> Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>	2023-12-10 10:12:53 -08:00
Woosuk Kwon	cd3aa153a4	Fix broken worker test (#1900 )	2023-12-02 22:17:33 -08:00
Woosuk Kwon	9b294976a2	Add PyTorch-native implementation of custom layers (#1898 )	2023-12-02 21:18:40 -08:00
Woosuk Kwon	5f09cbdb63	Fix broken sampler tests (#1896 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2023-12-02 16:06:17 -08:00
Adam Brusselback	66785cc05c	Support chat template and `echo` for chat API (#1756 )	2023-11-30 16:43:13 -08:00
Yanming W	e0c6f556e8	[Build] Avoid building too many extensions (#1624 )	2023-11-23 16:31:19 -08:00
Simon Mo	5ffc0d13a2	Migrate linter from `pylint` to `ruff` (#1665 )	2023-11-20 11:58:01 -08:00
Zhuohan Li	20d0699d49	[Fix] Fix comm test (#1691 )	2023-11-16 16:28:39 -08:00
maximzubkov	521b35f799	Support Microsoft Phi 1.5 (#1664 )	2023-11-16 14:28:39 -08:00
Simon Mo	cb08cd0d75	[Minor] Fix duplication of ignored seq group in engine step (#1666 )	2023-11-16 13:11:41 -08:00
Yanming W	8efe23f150	Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546 )	2023-11-08 14:19:12 -08:00
Noam Gat	555bdcc5a3	Added logits processor API to sampling params (#1469 )	2023-11-03 14:12:15 -07:00
Cade Daniel	e575df33b1	[Small] Formatter only checks lints in changed files (#1528 )	2023-10-31 15:39:38 -07:00
Woosuk Kwon	0ce8647dc5	Fix integer overflows in attention & cache ops (#1514 )	2023-10-31 15:19:30 -07:00
Woosuk Kwon	9524867701	Add Mistral 7B to `test_models` (#1366 )	2023-10-16 17:49:54 -07:00
Woosuk Kwon	d3a5bd9fb7	Fix sampler test (#1379 )	2023-10-16 12:57:26 -07:00
Zhuohan Li	9d9072a069	Implement prompt logprobs & Batched topk for computing logprobs (#1328 ) Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>	2023-10-16 10:56:50 -07:00
Woosuk Kwon	928de46888	Implement PagedAttention V2 (#1348 )	2023-10-16 00:59:57 -07:00
Zhuohan Li	ba0bfd40e2	TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181 )	2023-10-02 15:36:09 -07:00
Woosuk Kwon	6f88f762bf	Fix OOM in attention kernel test (#1223 )	2023-09-28 14:33:24 -07:00
Antoni Baum	cf5cb1e33e	Allocate more shared memory to attention kernel (#1154 )	2023-09-26 22:27:13 -07:00
Zhuohan Li	947b794146	[Sampler] Vectorized sampling (simplified) (#1048 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2023-09-22 17:48:04 -07:00
Antoni Baum	ff36139ffc	Remove AsyncLLMEngine busy loop, shield background task (#1059 )	2023-09-17 00:29:08 -07:00
Antoni Baum	dd54a4b026	Fix detokenization leaving special tokens (#1044 ) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2023-09-14 16:37:03 -07:00
Antoni Baum	9841d48a10	Use TGI-like incremental detokenization (#984 )	2023-09-13 13:38:01 -07:00
Woosuk Kwon	e67b4f2c2a	Use FP32 in RoPE initialization (#1004 ) Co-authored-by: One <imone@tuta.io>	2023-09-11 00:26:35 -07:00
Antoni Baum	080438477f	Start background task in `AsyncLLMEngine.generate` (#988 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2023-09-08 00:03:39 -07:00
Zhuohan Li	db09d4ad83	[FIX] Fix Alibi implementation in PagedAttention kernel (#945 ) * [FIX] Fix Alibi implementation in PagedAttention kernel * Fix test_attention * Fix --------- Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Oliver-ss <yuansongwx@outlook.com>	2023-09-07 15:53:14 -07:00
Antoni Baum	c07ece5ca4	Make `AsyncLLMEngine` more robust & fix batched abort (#969 ) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>	2023-09-07 13:43:45 -07:00
Woosuk Kwon	320a622ec4	[BugFix] Implement RoPE for GPT-J (#941 )	2023-09-06 11:54:33 +09:00
Antoni Baum	c9927c1a6a	Use queue for finished requests (#957 )	2023-09-05 19:27:23 -07:00
Woosuk Kwon	fbd80ad409	Clean up kernel unit tests (#938 )	2023-09-05 16:57:38 -07:00
Zhuohan Li	002800f081	Align vLLM's beam search implementation with HF generate (#857 )	2023-09-04 17:29:42 -07:00
Woosuk Kwon	32b6816e55	Add tests for models (#922 )	2023-09-01 11:19:43 +09:00
Aman Gupta Karmani	75471386de	use flash-attn via xformers (#877 )	2023-08-29 21:52:13 -07:00
Woosuk Kwon	d64bf1646c	Implement approximate GELU kernels (#828 )	2023-08-23 07:43:21 +09:00
Tao Peng	d7a1c6d614	Fix paged attention testing. (#495 ) Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>	2023-07-24 21:01:56 -07:00
Song	bda41c70dd	hotfix attn alibi wo head mapping (#496 ) Co-authored-by: oliveryuan <oliveryuan@basemind.com>	2023-07-18 11:31:48 -07:00
Andre Slavescu	c894836108	[Model] Add support for GPT-J (#226 ) Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>	2023-07-08 17:55:16 -07:00
Woosuk Kwon	e41f06702c	Add support for BLOOM (#331 )	2023-07-03 13:12:35 -07:00
Zhuohan Li	d6fa1be3a8	[Quality] Add code formatter and linter (#326 )	2023-07-03 11:31:55 -07:00
Woosuk Kwon	0b98ba15c7	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00
Woosuk Kwon	e38074b1e6	Support FP32 (#141 )	2023-06-07 00:40:21 -07:00
Woosuk Kwon	a283ec2eec	Add contributing guideline and mypy config (#122 )	2023-05-23 17:58:51 -07:00
Woosuk Kwon	825d8892b5	Use pytest format for unit tests (#107 )	2023-05-17 17:11:23 -07:00
Woosuk Kwon	c9d5b6d4a8	Replace FlashAttention with xformers (#70 )	2023-05-05 02:01:08 -07:00
Woosuk Kwon	436e523bf1	Refactor attention kernels (#53 )	2023-05-03 13:40:13 -07:00
Woosuk Kwon	a96d63c21d	Add support for GPT-NeoX (Pythia) (#50 )	2023-04-28 00:32:10 -07:00
Siyuan (Ryans) Zhuang	e3cec88aa5	Memcpy kernel for flash attention (#29 ) * optimize * add benchmark * add assert * add test	2023-04-10 18:22:49 -07:00
Woosuk Kwon	b9926f7f66	Support block size 32 (#35 )	2023-04-09 23:07:18 -07:00
Woosuk Kwon	c267b1a02c	Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27 ) * Add query stride to multi_query_cached_kv_attention * Add kernel benchmark script	2023-04-08 13:36:09 -07:00
Woosuk Kwon	0f40557af6	Implement block copy kernel to optimize beam search (#32 )	2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang	21b3671bbc	Basic attention kernel that supports cached KV + (multi-)prompts (#24 )	2023-04-04 20:34:46 -07:00
Woosuk Kwon	897cb2ae28	Optimize data movement (#20 )	2023-04-02 00:30:17 -07:00
Woosuk Kwon	09e9245478	Add custom kernel for RMS normalization (#16 )	2023-04-01 00:51:22 +08:00
Woosuk Kwon	88c0268a18	Implement custom kernel for LLaMA rotary embedding (#14 )	2023-03-30 11:04:21 -07:00
Woosuk Kwon	a1b3de86cd	Refactor the test code for attention kernels (#13 )	2023-03-29 18:59:27 -07:00
Woosuk Kwon	3e9f991d6a	Use FlashAttention for `multi_query_kv_attention` (#4 )	2023-03-01 21:13:08 -08:00
Woosuk Kwon	0deacbce6e	Implement `single_query_cached_kv_attention` kernel (#3 )	2023-03-01 15:02:19 -08:00
Woosuk Kwon	af68ec1c5c	Add tests for kernels	2023-02-18 19:23:07 +00:00

... 37 38 39 40 41 ...

2143 Commits