vllm/examples/others/lmcache/README.md

2.0 KiB

LMCache Examples

This folder demonstrates how to use LMCache for disaggregated prefilling, CPU offloading and KV cache sharing.

1. Disaggregated Prefill in vLLM v1

This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node.

Prerequisites

  • Install LMCache. You can simply run pip install lmcache.
  • Install NIXL.
  • At least 2 GPUs
  • Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct.

Usage

Run cd disagg_prefill_lmcache_v1 to get into disagg_prefill_lmcache_v1 folder, and then run

bash disagg_example_nixl.sh

to run disaggregated prefill and benchmark the performance.

Components

Server Scripts

  • disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
  • disagg_prefill_lmcache_v1/disagg_proxy_server.py - FastAPI proxy server that coordinates between prefiller and decoder
  • disagg_prefill_lmcache_v1/disagg_example_nixl.sh - Main script to run the example

Configuration

  • disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml - Configuration for prefiller server
  • disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml - Configuration for decoder server

Log Files

The main script generates several log files:

  • prefiller.log - Logs from the prefill server
  • decoder.log - Logs from the decode server
  • proxy.log - Logs from the proxy server

2. CPU Offload Examples

  • python cpu_offload_lmcache.py -v v0 - CPU offloading implementation for vLLM v0
  • python cpu_offload_lmcache.py -v v1 - CPU offloading implementation for vLLM v1

3. KV Cache Sharing

The kv_cache_sharing_lmcache_v1.py example demonstrates how to share KV caches between vLLM v1 instances.

4. Disaggregated Prefill in vLLM v0

The disaggregated_prefill_lmcache_v0.py provides an example of how to run disaggregated prefill in vLLM v0.