5.3 KiB
Tutorial: Sleep Mode Aware Routing
Introduction
This tutorial demonstrates how to use vLLM v1 sleep and wake_up mode feature with vLLM Production Stack. A sleeping engine does not process any requests and free up resources (e.g., GPU memory). The router supports serving requests to vLLM engines with sleep mode enabled.
Table of Contents
- Prerequisites
- Step 1: Deploy with Sleep Mode Enable
- Step 2: Port Forwarding
- Step 3: Testing sleep Mode Feature for the Engine
- Step 4: Clean Up
Prerequisites
- Completion of the following tutorials:
- A Kubernetes environment with GPU support
- Basic familiarity with Kubernetes and Helm
Step 1: Deploy with Sleep Mode Enable
We'll use the predefined configuration file values-19-sleep-mode-aware.yaml which sets up a vLLM instance with sleep mode enabled.
- Deploy the Helm chart with the configuration:
helm install vllm helm/ -f tutorials/assets/values-19-sleep-mode-aware.yaml
Wait for the deployment to complete:
kubectl get pods -w
Step 2: Port Forwarding
Forward the router service port to your local machine:
kubectl port-forward svc/vllm-router-service 30080:80
Step 3: Testing Sleep Mode Aware Routing
First, get the list of available engines:
curl -o- http://localhost:30080/engines | jq
Expected similar output:
[
{
"engine_id": "b36921ab-6611-58c0-a941-16c51296446b",
"serving_models": [
"ibm-granite/granite-3.0-3b-a800m-instruct"
],
"created": 1750035988
}
]
Using the id of the target engine, check the sleeping state of the vLLM engine:
curl -o- http://localhost:30080/is_sleeping?id=b36921ab-6611-58c0-a941-16c51296446b | jq
Expected output:
{
"is_sleeping": false
}
Put the engine to sleep and check its sleeping state:
curl -X POST http://localhost:30080/sleep?id=b36921ab-6611-58c0-a941-16c51296446b | jq
Expected output:
{
"status": "success"
}
curl -o- http://localhost:30080/is_sleeping?id=b36921ab-6611-58c0-a941-16c51296446b | jq
Expected output:
{
"is_sleeping": true
}
The logs of the vLLM pod shows that the engine was put to sleep:
INFO 06-15 18:08:18 [gpu_worker.py:81] Sleep mode freed 39.26 GiB memory, 1.20 GiB memory is still in use.
INFO 06-15 18:08:18 [executor_base.py:210] It took 5.749613 seconds to fall asleep.
INFO: 10.130.2.172:47082 - "POST /sleep HTTP/1.1" 200 OK
Then, send a request to the router:
curl http://localhost:30080/v1/completions?id=b36921ab-6611-58c0-a941-16c51296446b \
-H "Content-Type: application/json" \
-d '{
"model": "ibm-granite/granite-3.0-3b-a800m-instruct",
"prompt": "What is the capital of France?",
"max_tokens": 100
}' | jq
Expected output:
{
"error": "Model ibm-granite/granite-3.0-3b-a800m-instruct not found or vLLM engine is sleeping."
}
Now, wake up the vLLM engine and check its sleeping state:
curl -X POST http://localhost:30080/wake_up?id=b36921ab-6611-58c0-a941-16c51296446b | jq
Expected output:
{
"status": "success"
}
curl -o- http://localhost:30080/is_sleeping?id=b36921ab-6611-58c0-a941-16c51296446b | jq
Expected output:
{
"is_sleeping": false
}
The logs of the vLLM pod shows that the engine was waked up:
INFO 06-15 18:11:37 [api_server.py:719] wake up the engine with tags: None
INFO 06-15 18:11:37 [executor_base.py:226] It took 0.284914 seconds to wake up tags {'kv_cache', 'weights'}.
INFO: 10.130.2.172:46672 - "POST /wake_up HTTP/1.1" 200 OK
Lastly, re-send a request to the router:
curl http://localhost:30080/v1/completions?id=b36921ab-6611-58c0-a941-16c51296446b \
-H "Content-Type: application/json" \
-d '{
"model": "ibm-granite/granite-3.0-3b-a800m-instruct",
"prompt": "What is the capital of France?",
"max_tokens": 100
}' | jq
Expected similar output:
{
"id": "cmpl-125b905e89a34384af754a24bc8ea686",
"object": "text_completion",
"created": 1750036455,
"model": "ibm-granite/granite-3.0-3b-a800m-instruct",
"choices": [
{
"index": 0,
"text": "\n\nThe capital of France is Paris.\n\n[Answer] The capital of France is Paris.",
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 7,
"total_tokens": 31,
"completion_tokens": 24,
"prompt_tokens_details": null
}
}
Step 4: Clean Up
To clean up the deployment:
helm uninstall vllm
Conclusion
In this tutorial, we've demonstrated how to:
- Deploy vLLM Production Stack with sleep mode enable for the vLLM engine
- Set up port forwarding to access the router
- Test the sleep mode feature for the vLLM engine and the sleep mode aware routing functionalities
The sleep aware routing feature helps to ensure that router does not forward requests to a sleeping engine. Hence, improving performance of the Production Stack.