Commit Graph

332 Commits

Author SHA1 Message Date
Mor Zusman 7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE (#7415) 2024-08-16 10:06:51 -07:00
Roger Wang 70d268a399
[Bugfix] Fix ITL recording in serving benchmark (#7372) 2024-08-09 10:00:00 -07:00
Luka Govedič 8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Lucas Wilkinson a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
Varun Sundar Rabindranath 35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath 766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
Alexander Matveev 75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) 2024-07-27 17:52:33 -04:00
Joe 14dbd5a767
[Model] H2O Danube3-4b (#6451) 2024-07-26 20:47:50 -07:00
Cyrus Leung 739b61a348
[Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-22 10:13:53 -07:00
Woosuk Kwon a9a2e74d21
[Misc] Use `torch.Tensor` for type annotation (#6505) 2024-07-17 13:01:10 +00:00
Michael Goin 978aed5300
[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) 2024-07-16 15:31:32 -07:00
Fish ccb20db8bd
[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' (#6428) 2024-07-14 19:27:01 -07:00
Ethan Xu dbfe254eda
[Feature] vLLM CLI (#5090)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-07-14 15:36:43 -07:00
Kuntai Du a4feba929b
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362) 2024-07-11 13:28:38 -07:00
Robert Shaw b675069d74
[ Misc ] Refactor Marlin Python Utilities (#6082)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-07-11 15:40:11 +00:00
Roger Wang c4774eb841
[Bugfix] Fix snapshot download in serving benchmark (#6318) 2024-07-11 07:04:05 +00:00
Haichuan 717f4bcea0
Feature/add benchmark testing (#5947)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-08 07:52:06 +00:00
Haichuan 333306a252
add benchmark for fix length input and output (#5857)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-07 07:42:13 +00:00
Alexander Matveev 3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602) 2024-07-01 20:10:37 -07:00
James Whedbee e373853e12
[Frontend] Relax api url assertion for openai benchmarking (#6046) 2024-07-01 23:39:10 +00:00
zhyncs bb60326836
[Misc] update benchmark backend for scalellm (#6018) 2024-07-01 10:20:33 -07:00
mcalman c4bca740e8
[Bugfix] fix missing last itl in openai completions benchmark (#5926) 2024-06-29 10:34:42 +08:00
Ilya Lavrenov 57f09a419c
[Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 13:50:16 +00:00
Woo-Yeon Lee 2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414) 2024-06-25 09:56:06 +00:00
Michael Goin 8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) 2024-06-20 17:00:13 -06:00
DearPlanet d8714530d1
[Misc]Add param max-model-len in benchmark_latency.py (#5629) 2024-06-19 18:19:08 +08:00
Tyler Michael Smith 6820724e51
[Bugfix] Fix w8a8 benchmarks for int8 case (#5643) 2024-06-19 00:33:25 +00:00
Ronen Schaffer 7879f24dcc
[Misc] Add OpenTelemetry support (#4687)
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Kuntai Du 9e4e6fe207
[CI] the readability of benchmarking and prepare for dashboard (#5571)
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571)
2024-06-17 11:41:08 -07:00
Kunshang Ji 728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
zhyncs 1f12122b17
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed (#5588) 2024-06-17 09:40:35 -07:00
Cody Yu e2b85cf86a
Fix w8a8 benchmark and add Llama-3-8B (#5562) 2024-06-17 06:48:06 +00:00
Cyrus Leung 0e9164b40a
[mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
Allen.Dou d74674bbd9
[Misc] Fix arg names (#5524) 2024-06-14 09:47:44 -07:00
Kuntai Du 319ad7f1d3
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (#5073)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-06-13 22:36:20 -07:00
Tyler Michael Smith 85657b5607
[Kernel] Factor out epilogues from cutlass kernels (#5391)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
Wang, Yi 88407532e7
[Bugfix]if the content is started with ":"(response of ping), client should i… (#5303)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-12 20:16:41 -07:00
Woosuk Kwon 1a8bfd92d5
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
Benjamin Kitor b3376e5c76
[Misc] Add args for selecting distributed executor to benchmarks (#5335) 2024-06-08 09:20:16 +08:00
Philipp Moritz 51a08e7d8f
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) 2024-06-05 10:59:14 -07:00
Tyler Michael Smith 02cc3b51a7
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results (#5263) 2024-06-05 10:17:51 -07:00
Woosuk Kwon 27208be66e
[Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242) 2024-06-04 09:58:47 -07:00
Woosuk Kwon 3a434b07ed
[Kernel] Enhance MoE benchmarking & tuning script (#4921) 2024-06-03 20:06:59 -07:00
Varun Sundar Rabindranath f081c3ce4b
[Kernel] Update Cutlass fp8 configs (#5144)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-01 08:46:07 +00:00
Cody Yu e9899fb7a4
[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) 2024-05-31 14:29:19 -07:00
SnowDist a22dea54d3
[Model] Support MAP-NEO model (#5081)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-30 19:24:41 -07:00
Marut Pandya 616e600e0b
[Misc] add gpu_memory_utilization arg (#5079)
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
2024-05-28 17:16:18 -07:00
Cyrus Leung 5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
Roger Wang f17a1a8f96
[Misc] Make Serving Benchmark More User-friendly (#5044) 2024-05-25 17:28:16 +00:00
Alexander Matveev 6066253296
Marlin 24 prefill performance improvement (about 25% better on average) (#4983) 2024-05-23 02:39:27 -04:00
Cody Yu a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
Kuntai Du c3af44722c
[Doc]Add documentation to benchmarking script when running TGI (#4920) 2024-05-20 20:16:57 +00:00
Simon Mo f09edd8a25
Add JSON output support for benchmark_latency and benchmark_throughput (#4848) 2024-05-16 10:02:56 -07:00
alexm-nm 5c342570d7
Add marlin unit tests and marlin benchmark script (#4815) 2024-05-16 09:36:49 -04:00
Cody Yu 973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
2024-05-16 00:53:51 -07:00
Kuntai Du ccb63a8245
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696) 2024-05-14 21:34:33 +09:00
SangBin Cho 3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) 2024-05-03 10:20:12 -07:00
Philipp Moritz 24bb4fe432
[Kernel] Update fused_moe tuning script for FP8 (#4457)
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.

All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.

Before this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency 
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency

After this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency
2024-05-01 11:47:38 -07:00
leiwen83 24750f4cad
[Core] Enable prefix caching with block manager v2 enabled (#4142)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
Kunshang Ji f4bc4de1b1
[Core]refactor aqlm quant ops (#4351) 2024-04-25 15:03:56 -04:00
zifeitong a395a638c2
[Misc] Use public API in benchmark_throughput (#4300) 2024-04-24 21:10:24 +00:00
Roger Wang 7923dcad12
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279) 2024-04-24 09:49:13 -07:00
James Fleming 2b7949c1c2
AQLM CUDA support (#3287)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Michael Goin 53b018edcb
[Bugfix] Get available quantization methods from quantization registry (#4098) 2024-04-18 00:21:55 -07:00
Elinx fe3b5bbc23
[Bugfix] fix output parsing error for trtllm backend (#4137)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-17 11:07:23 +00:00
Michael Feil c2b4a1bce9
[Doc] Add typing hints / mypy types cleanup (#3816)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-11 17:17:21 -07:00
Kunshang Ji e9da5a40c6
[Misc] Add indirection layer for custom ops (#3913) 2024-04-10 20:26:07 -07:00
SangBin Cho 67b4221a61
[Core][5/N] Fully working chunked prefill e2e (#3884) 2024-04-10 17:56:48 -07:00
Zedong Peng c013d32c75
[Benchmark] Add cpu options to bench scripts (#3915) 2024-04-09 21:30:03 -07:00
youkaichao e4be7d70bb
[CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889) 2024-04-06 21:32:30 +00:00
TianYu GUO b7782002e1
[Benchmark] Refactor sample_requests in benchmark_throughput (#3613)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-04 09:56:22 +00:00
Chang Su 819a309c0f
[Bugfix] Fix args in benchmark_serving (#3836)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-04 07:41:05 +00:00
Adrian Abeyta 2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
Roger Wang ccb58b23e6
[Misc] Fix Benchmark TTFT Calculation for Chat Completions (#3768) 2024-04-01 15:24:30 -07:00
Yile (Michael) Gu 98a42e7078
[Benchmark] Change mii to use persistent deployment and support tensor parallel (#3628) 2024-03-28 17:33:52 -07:00
SangBin Cho b51c1cc9d2
[2/N] Chunked prefill data update (#3538) 2024-03-28 10:06:01 -07:00
Roger Wang 45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277) 2024-03-27 13:39:26 -07:00
AmadeusChan 1956931436
[Misc] add the "download-dir" option to the latency/throughput benchmarks (#3621) 2024-03-27 13:39:05 -07:00
SangBin Cho 01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
Simon Mo 8e67598aa6
[Misc] fix line length for entire codebase (#3444) 2024-03-16 00:36:29 -07:00
Ronen Schaffer 14e3f9a1b2
Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (#2958) 2024-03-15 21:01:30 -07:00
youkaichao 8fe8386591
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389) 2024-03-14 08:11:48 +00:00
Terry 7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
TianYu GUO 1ece1ae829
[Minor Fix] Fix comments in benchmark_serving (#3252) 2024-03-07 22:22:59 -08:00
Chen Wang 9a4548bae7
Fix the openai benchmarking requests to work with latest OpenAI apis (#2992)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-04 15:51:56 -08:00
Allen.Dou 9cbc7e5f3b
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
2024-03-04 10:37:58 -08:00
TianYu GUO 901cf4c52b
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171) 2024-03-03 22:48:27 -08:00
Philipp Moritz 17c3103c56
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-03 16:19:13 -08:00
Zhuohan Li 996d095c54
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158) 2024-03-03 14:37:18 -08:00
Sage Moore ce4f5a29fb
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Philipp Moritz cfc15a1031
Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
Massimiliano Pronesti 93dc5a2870
chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
Ronen Schaffer d7f396486e
Update comment (#2934) 2024-02-21 18:18:37 -08:00
Roger Wang a4211a4dc3
Serving Benchmark Refactoring (#2433) 2024-02-12 22:53:00 -08:00
Woosuk Kwon 72d3a30c63
[Minor] Fix benchmark_latency script (#2765) 2024-02-05 12:45:37 -08:00
Kunshang Ji 96b6f475dd
Remove hardcoded `device="cuda" ` to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
zhaoyang-star 9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Simon Mo 1e4277d2d1
lint: format all python file instead of just source code (#2567) 2024-01-23 15:53:06 -08:00
Antoni Baum 9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Harry Mellor 63e835cbcc
Fix progress bar and allow HTTPS in `benchmark_serving.py` (#2552) 2024-01-22 14:40:31 -08:00
Harry Mellor 2709c0009a
Support OpenAI API server in `benchmark_serving.py` (#2172) 2024-01-18 20:34:08 -08:00
Woosuk Kwon 37ca558103
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
CHU Tianxiang 0fbfc4b81b
Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
Woosuk Kwon 5dd80d3777
Fix latency benchmark script (#2035) 2023-12-11 11:19:08 -08:00
wbn dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
Antoni Baum 05ff90b692
Save pytorch profiler output for latency benchmark (#1871)
* Save profiler output

* Apply feedback from code review
2023-12-05 20:55:55 -08:00
aisensiy 8d8c2f6ffe
Support max-model-len argument for throughput benchmark (#1858) 2023-11-30 08:10:24 -08:00
Woosuk Kwon 51d3cb951d
Remove max_num_seqs in latency benchmark script (#1855) 2023-11-30 00:00:32 -08:00
Woosuk Kwon e74b1736a1
Add profile option to latency benchmark script (#1839) 2023-11-29 23:42:52 -08:00
Yanming W e0c6f556e8
[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
Simon Mo 5ffc0d13a2
Migrate linter from `pylint` to `ruff` (#1665) 2023-11-20 11:58:01 -08:00
Zhuofan dcc543a298
[Minor] Fix comment (#1704) 2023-11-17 09:42:49 -08:00
Woosuk Kwon 660a7fcfa4
Add DeepSpeed MII backend to benchmark script (#1649) 2023-11-14 12:35:30 -08:00
chooper1 1f24755bf8
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon 928de46888
Implement PagedAttention V2 (#1348) 2023-10-16 00:59:57 -07:00
Antoni Baum acbed3ef40
Use monotonic time where appropriate (#1249) 2023-10-02 19:22:05 -07:00
kg6-sleipnir b5a10eb0ef
Added `dtype` arg to benchmarks (#1228) 2023-09-30 21:04:03 -07:00
Woosuk Kwon e3e79e9e8a
Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Ricardo Lu 8c4b2592fb
fix: enable trust-remote-code in api server & benchmark. (#509) 2023-07-19 17:06:15 -07:00
WRH cf21a9bd5c
support trust_remote_code in benchmark (#518) 2023-07-19 17:02:40 -07:00
Woosuk Kwon 4338cc4750
[Tokenizer] Add an option to specify tokenizer (#284) 2023-06-28 09:46:58 -07:00
Zhuohan Li 43710e8d09
[Fix] Fix default port number in benchmark scripts (#265) 2023-06-26 13:15:35 -07:00
Zhuohan Li 0370afa2e5
Remove benchmark_async_llm_server.py (#155) 2023-06-19 11:12:37 +08:00
Woosuk Kwon 3f92038b99
Add comments on swap space (#154) 2023-06-18 11:39:35 -07:00
Woosuk Kwon 0b98ba15c7
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
Zhuohan Li e5464ee484
Rename servers to engines (#152) 2023-06-17 17:25:21 +08:00
Woosuk Kwon bab8f3dd0d
[Minor] Fix benchmark_throughput.py (#151) 2023-06-16 21:00:52 -07:00
Zhuohan Li eedb46bf03
Rename servers and change port numbers to reduce confusion (#149) 2023-06-17 00:13:02 +08:00
Woosuk Kwon 311490a720
Add script for benchmarking serving throughput (#145) 2023-06-14 19:55:38 -07:00
Zhuohan Li 1a956e136b
Fix various issues of async servers (#135) 2023-06-05 23:44:50 +08:00
Woosuk Kwon 8274ca23ac
Add docstrings for LLM (#137) 2023-06-04 12:52:41 -07:00
Woosuk Kwon 211318d44a
Add throughput benchmarking script (#133) 2023-05-28 03:20:05 -07:00