2.9 KiB
Tutorial: Running vLLM with v1 Configuration
Introduction
This tutorial demonstrates how to deploy vLLM with v1 configuration enabled. The v1 configuration uses the LMCacheConnectorV1 for KV cache management, which provides improved performance and stability for certain workloads.
Prerequisites
- A Kubernetes cluster with GPU support
- Helm installed on your local machine
- Completion of the following tutorials:
Step 1: Understanding the Configuration
The configuration file values-14-vllm-v1.yaml includes several important settings:
-
Model Configuration:
- Using Llama-3.1-8B-Instruct model
- Single replica deployment
- Resource requirements: 6 CPU, 16Gi memory, 1 GPU
- 50Gi persistent storage
-
vLLM Configuration:
- v1 mode enabled (v1: 1)
- bfloat16 precision
- Maximum sequence length of 4096 tokens
- GPU memory utilization set to 80%
-
LMCache Configuration:
- KV cache offloading enabled
- 20GB CPU offloading buffer size
-
Cache Server Configuration:
- Single replica cache server
- Naive serialization/deserialization
- Resource limits: 2 CPU, 10Gi memory
Feel freet to change the above parameters for your own scenario.
Step 2: Deploying the Stack
-
First, ensure you're in the correct directory:
cd production-stack -
Deploy the stack using Helm:
helm install vllm helm/ -f tutorials/assets/values-14-vllm-v1.yaml -
Verify the deployment:
kubectl get podsYou should see:
- A vLLM pod for the Llama model
- A cache server pod
Step 3: Verifying the Configuration
-
Check the vLLM pod logs to verify v1 configuration:
kubectl logs -f <vllm-pod-name>Look for the following log message:
INFO 04-29 12:12:25 [factory.py:64] Creating v1 connector with name: LMCacheConnectorV1 -
Forward the router service port:
kubectl port-forward svc/vllm-router-service 30080:80
Step 4: Testing the Deployment
Send a test request to verify the deployment:
curl -X POST http://localhost:30080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain the benefits of using v1 configuration in vLLM.",
"max_tokens": 100
}'
Note that you need to send a prompt greater than 256 tokens in order to reuse the KV cache (the chunk size set in LMCache)
Conclusion
This tutorial demonstrated how to deploy vLLM with v1 configuration enabled. The v1 configuration provides improved KV cache management through LMCacheConnectorV1, which can lead to better performance for certain workloads. You can adjust the configuration parameters in the values file to optimize for your specific use case.