mirror of https://github.com/vllm-project/vllm.git
57 lines
2.0 KiB
Markdown
57 lines
2.0 KiB
Markdown
# LMCache Examples
|
|
|
|
This folder demonstrates how to use LMCache for disaggregated prefilling, CPU offloading and KV cache sharing.
|
|
|
|
## 1. Disaggregated Prefill in vLLM v1
|
|
|
|
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node.
|
|
|
|
### Prerequisites
|
|
|
|
- Install [LMCache](https://github.com/LMCache/LMCache). You can simply run `pip install lmcache`.
|
|
- Install [NIXL](https://github.com/ai-dynamo/nixl).
|
|
- At least 2 GPUs
|
|
- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct.
|
|
|
|
### Usage
|
|
|
|
Run
|
|
`cd disagg_prefill_lmcache_v1`
|
|
to get into `disagg_prefill_lmcache_v1` folder, and then run
|
|
|
|
```bash
|
|
bash disagg_example_nixl.sh
|
|
```
|
|
|
|
to run disaggregated prefill and benchmark the performance.
|
|
|
|
### Components
|
|
|
|
#### Server Scripts
|
|
- `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
|
|
- `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder
|
|
- `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example
|
|
|
|
#### Configuration
|
|
- `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server
|
|
- `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server
|
|
|
|
#### Log Files
|
|
The main script generates several log files:
|
|
- `prefiller.log` - Logs from the prefill server
|
|
- `decoder.log` - Logs from the decode server
|
|
- `proxy.log` - Logs from the proxy server
|
|
|
|
## 2. CPU Offload Examples
|
|
|
|
- `python cpu_offload_lmcache.py -v v0` - CPU offloading implementation for vLLM v0
|
|
- `python cpu_offload_lmcache.py -v v1` - CPU offloading implementation for vLLM v1
|
|
|
|
## 3. KV Cache Sharing
|
|
|
|
The `kv_cache_sharing_lmcache_v1.py` example demonstrates how to share KV caches between vLLM v1 instances.
|
|
|
|
## 4. Disaggregated Prefill in vLLM v0
|
|
|
|
The `disaggregated_prefill_lmcache_v0.py` provides an example of how to run disaggregated prefill in vLLM v0.
|