mirror of https://github.com/vllm-project/vllm.git
2.0 KiB
2.0 KiB
LMCache Examples
This folder demonstrates how to use LMCache for disaggregated prefilling, CPU offloading and KV cache sharing.
1. Disaggregated Prefill in vLLM v1
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node.
Prerequisites
- Install LMCache. You can simply run
pip install lmcache
. - Install NIXL.
- At least 2 GPUs
- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct.
Usage
Run
cd disagg_prefill_lmcache_v1
to get into disagg_prefill_lmcache_v1
folder, and then run
bash disagg_example_nixl.sh
to run disaggregated prefill and benchmark the performance.
Components
Server Scripts
disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
- Launches individual vLLM servers for prefill/decode, and also launches the proxy server.disagg_prefill_lmcache_v1/disagg_proxy_server.py
- FastAPI proxy server that coordinates between prefiller and decoderdisagg_prefill_lmcache_v1/disagg_example_nixl.sh
- Main script to run the example
Configuration
disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml
- Configuration for prefiller serverdisagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml
- Configuration for decoder server
Log Files
The main script generates several log files:
prefiller.log
- Logs from the prefill serverdecoder.log
- Logs from the decode serverproxy.log
- Logs from the proxy server
2. CPU Offload Examples
python cpu_offload_lmcache.py -v v0
- CPU offloading implementation for vLLM v0python cpu_offload_lmcache.py -v v1
- CPU offloading implementation for vLLM v1
3. KV Cache Sharing
The kv_cache_sharing_lmcache_v1.py
example demonstrates how to share KV caches between vLLM v1 instances.
4. Disaggregated Prefill in vLLM v0
The disaggregated_prefill_lmcache_v0.py
provides an example of how to run disaggregated prefill in vLLM v0.