6.7 KiB
Tutorial: Basic vLLM Configurations
Introduction
This tutorial guides you through the basic configurations required to deploy a vLLM serving engine in a Kubernetes environment with GPU support. You will learn how to specify the model details, set up necessary environment variables (like HF_TOKEN), and launch the vLLM serving engine.
Table of Contents
- Prerequisites
- Step 1: Preparing the Configuration File
- Step 2: Applying the Configuration
- Step 3: Verifying the Deployment
- Step 4 (Optional): Multi-GPU Deployment
Prerequisites
- A Kubernetes environment with GPU support, as set up in the 00-install-kubernetes-env tutorial.
- Helm installed on your system.
- Access to a HuggingFace token (
HF_TOKEN).
Step 1: Preparing the Configuration File
- Locate the example configuration file
tutorials/assets/values-02-basic-config.yaml. - Open the file and update the following fields:
- Write your actual huggingface token in
hf_token: <YOUR HF TOKEN>in the yaml file.
- Write your actual huggingface token in
Explanation of Key Items in values-02-basic-config.yaml
name: The unique identifier for your model deployment.repository: The Docker repository containing the model's serving engine image.tag: Specifies the version of the model image to use.modelURL: The URL pointing to the model on Hugging Face or another hosting service.replicaCount: The number of replicas for the deployment, allowing scaling for load.requestCPU: The amount of CPU resources requested per replica.requestMemory: Memory allocation for the deployment; sufficient memory is required to load the model.requestGPU: Specifies the number of GPUs to allocate for the deployment.pvcStorage: Defines the Persistent Volume Claim size for model storage.vllmConfig: Contains model-specific configurations:enableChunkedPrefill: Optimizes performance by prefetching model chunks.enablePrefixCaching: Speeds up response times for common prefixes in queries.maxModelLen: The maximum sequence length the model can handle.dtype: Data type for computations, e.g.,bfloat16for faster performance on modern GPUs.extraArgs: Additional arguments passed to the vLLM engine for fine-tuning behavior.
hf_token: The Hugging Face token for authenticating with the Hugging Face model hub.env: Extra environment variables to pass to the model-serving engine.
Example Snippet
servingEngineSpec:
modelSpec:
- name: "llama3"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1
requestCPU: 10
requestMemory: "16Gi"
requestGPU: 1
pvcStorage: "50Gi"
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 16384
dtype: "bfloat16"
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]
hf_token: <YOUR HF TOKEN>
Step 2: Applying the Configuration
Deploy the configuration using Helm:
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f tutorials/assets/values-02-basic-config.yaml
Expected output:
You should see output indicating the successful deployment of the Helm chart:
Release "vllm" has been deployed. Happy Helming!
NAME: vllm
LAST DEPLOYED: <timestamp>
NAMESPACE: default
STATUS: deployed
REVISION: 1
Step 3: Verifying the Deployment
-
Check the status of the pods:
kubectl get podsExpected output:
You should see the following pods:
NAME READY STATUS RESTARTS AGE pod/vllm-deployment-router-xxxx-xxxx 1/1 Running 0 3m23s vllm-llama3-deployment-vllm-xxxx-xxxx 1/1 Running 0 3m23s- The
vllm-deployment-routerpod acts as the router, managing requests and routing them to the appropriate model-serving pod. - The
vllm-llama3-deployment-vllmpod serves the actual model for inference.
- The
-
Verify the service is exposed correctly:
kubectl get servicesExpected output:
Ensure there are services for both the serving engine and the router:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE vllm-engine-service ClusterIP 10.103.98.170 <none> 80/TCP 4m vllm-router-service ClusterIP 10.103.110.107 <none> 80/TCP 4m- The
vllm-engine-serviceexposes the serving engine. - The
vllm-router-servicehandles routing and load balancing across model-serving pods.
- The
-
Test the health endpoint:
curl http://<SERVICE_IP>/healthReplace
<SERVICE_IP>with the external IP of the service. If everything is configured correctly, you will get:{"status":"healthy"}
Please refer to Step 3 in the 01-minimal-helm-installation tutorial for querying the deployed vLLM service.
Step 4 (Optional): Multi-GPU Deployment
So far, you have configured and deployment vLLM serving engine with a single GPU. You may also deploy a serving engine on multiple GPUs with the following example configuration snippet:
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "llama3"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1
requestCPU: 10
requestMemory: "16Gi"
requestGPU: 2
pvcStorage: "50Gi"
pvcAccessMode:
- ReadWriteOnce
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 4096
tensorParallelSize: 2
dtype: "bfloat16"
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]
hf_token: <YOUR HF TOKEN>
shmSize: "20Gi"
Note that only tensor parallelism is supported for now. The field shmSize has to be configured if you are requesting requestGPU to be more than one, to enable appropriate shared memory across multiple processes used to run tensor parallelism.
Conclusion
In this tutorial, you configured and deployed a vLLM serving engine with GPU support (both on a single GPU or multiple GPUs) in a Kubernetes environment. You also learned how to verify its deployment and ensure it is running as expected. For further customization, refer to the values.yaml file and Helm chart documentation.