2.9 KiB

Raw Permalink Blame History

Tutorial: KV Cache Aware Routing

Introduction

This tutorial demonstrates how to use KV cache aware routing in the vLLM Production Stack. KV cache aware routing ensures that subsequent requests with the same prompt prefix are routed to the same instance, maximizing KV cache utilization and improving performance.

Prerequisites
Step 1: Deploy with KV Cache Aware Routing
Step 2: Port Forwarding
Step 3: Testing KV Cache Aware Routing
Step 4: Clean Up

Prerequisites

Completion of the following tutorials:
- 00-install-kubernetes-env.md
- 01-minimal-helm-installation.md
A Kubernetes environment with GPU support
Basic familiarity with Kubernetes and Helm

Step 1: Deploy with KV Cache Aware Routing

We'll use the predefined configuration file values-17-kv-aware.yaml which sets up two vLLM instances with KV cache aware routing enabled.

Deploy the Helm chart with the configuration:

helm install vllm helm/ -f tutorials/assets/values-17-kv-aware.yaml

Note that to add more instances, you need to specify different instanceId in lmcacheConfig.

Wait for the deployment to complete:

kubectl get pods -w

Step 2: Port Forwarding

Forward the router service port to your local machine:

kubectl port-forward svc/vllm-router-service 30080:80

Step 3: Testing KV Cache Aware Routing

First, send a request to the router:

curl http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 100
  }'

Then, send another request with the same prompt prefix:

curl http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "What is the capital of France? And what is its population?",
    "max_tokens": 100
  }'

You should observe that the second request is routed to the same instance as the first request. This is because the KV cache aware router detects that the second request shares a prefix with the first request and routes it to the same instance to maximize KV cache utilization.

Step 4: Clean Up

To clean up the deployment:

helm uninstall vllm

Conclusion

In this tutorial, we've demonstrated how to:

Deploy vLLM Production Stack with KV cache aware routing
Set up port forwarding to access the router
Test the KV cache aware routing functionality

The KV cache aware routing feature helps improve performance by ensuring that requests with shared prefixes are routed to the same instance, maximizing KV cache utilization.

2.9 KiB Raw Permalink Blame History