7.5 KiB
Tutorial: Minimal Setup of the vLLM Production Stack on AMD GPU nodes
Introduction
This tutorial guides you through a minimal setup of the vLLM Production Stack using one vLLM instance with the facebook/opt-125m model on AMD GPU nodes. By the end of this tutorial, you will have a working deployment of vLLM on a Kubernetes environment with AMD GPU.
Table of Contents
- Tutorial: Minimal Setup of the vLLM Production Stack on AMD GPU nodes
Prerequisites
- A Kubernetes environment with GPU support. If not set up, follow the 00-install-kubernetes-env guide.
- Helm installed. Refer to the install-helm.sh script for instructions.
- kubectl installed. Refer to the install-kubectl.sh script for instructions.
- If your cluster includes nodes with AMD GPUs, follow the setup instructions provided in ROCm's Kubernetes device plugin repository to prepare them for deployment.
- the project repository cloned: vLLM Production Stack repository.
- Basic familiarity with Kubernetes and Helm.
Steps
1. Deploy vLLM Instance
1.1: Use Predefined Configuration
The vLLM Production Stack repository provides a predefined configuration file, values-01-minimal-amd-example.yaml, located at tutorials/assets/values-01-minimal-amd-example.yaml. This file contains the following content:
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "opt125m"
repository: "rocm/vllm"
tag: "instinct_main"
modelURL: facebook/opt-125m
replicaCount: 1
requestCPU: 6
requestMemory: "16Gi"
requestGPU: 1
# Optional resource limits - if not specified, only GPU will have a limit
# limitCPU: "8"
# limitMemory: "32Gi"
# Required for scheduling pods on nodes with amd gpus
requestGPUType: "amd.com/gpu"
hf_token: <YOUR HF TOKEN>
# Optional node selector terms - If specified, workloads will be scheduled on nodes with desired nodes (ex: nodes with amd gpus)
nodeSelectorTerms:
- matchExpressions:
- key: "your-node-label"
operator: "In"
values: ["your-node-label-value"]
# Optional toleration terms = If specified, workloads can be scheduled on nodes with corresponding taints (ex: node with amd gpus)
tolerations:
- key: "your-node-taint"
operator: "Equal"
value: "your-node-taint-value"
effect: "NoSchedule"
Explanation of the key fields:
modelSpec: Defines the model configuration, including:name: A name for the model deployment.repository: Docker repository hosting the model image.tag: Docker image tag.modelURL: Specifies the LLM model to use.replicaCount: Sets the number of replicas to deploy.requestCPUandrequestMemory: Specifies the CPU and memory resource requests for the pod.limitCPUandlimitMemory: Specifies the maximum limit of CPU and memory resource for the podrequestGPU: Specifies the number of GPUs required.requestGPUType: Set to "amd.com/gpu" to schedule workloads on nodes equipped with AMD GPUs.nodeSelectorTerms: Defines conditions to match specific nodes for scheduling workloads, based on node labels.
tolerations: Allows pods to be scheduled on nodes with matching taints, enabling workloads to tolerate specific node conditions.
Note: If you intend to set up TWO vllm pods, please refer to tutorials/assets/values-01-2pods-minimal-example.yaml.
1.2: Deploy the Helm Chart
Deploy the Helm chart using the predefined configuration file:
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-amd-example.yaml
Explanation of the command:
vllmin the first command: The Helm repository.vllmin the second command: The name of the Helm release.-f tutorials/assets/values-01-minimal-amd-example.yaml: Specifies the predefined configuration file. You can check out the default config values for the chartvllm-stack.
2. Validate Installation
2.1: Monitor Deployment Status
Monitor the deployment status using:
kubectl get pods
Expected output:
- Pods for the
vllmdeployment should transition toReadyand theRunningstate.
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-74dd87b649-gh754 1/1 Running 0 45m
vllm-opt125m-deployment-vllm-7b5494fb4c-qbsxr 1/1 Running 0 42m
Note: It may take some time for the containers to download the Docker images and LLM weights.
3. Send a Query to the Stack
3.1: Forward the Service Port
Expose the vllm-router-service port to the host machine:
kubectl port-forward svc/vllm-router-service 30080:80
3.2: Query the OpenAI-Compatible API to list the available models
Test the stack's OpenAI-compatible API by querying the available models:
curl -o- http://localhost:30080/v1/models
Expected output:
{
"object": "list",
"data": [
{
"id": "facebook/opt-125m",
"object": "model",
"created": 1744611824,
"owned_by": "vllm",
"root": null
}
]
}
3.3: Query the OpenAI Completion Endpoint
Send a query to the OpenAI /completion endpoint to generate a completion for a prompt:
curl -X POST http://localhost:30080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "Once upon a time,",
"max_tokens": 10
}'
Example output of the generated completions:
{
"id": "cmpl-4aef720c39a0451e90d71938982e84b7",
"object": "text_completion",
"created": 1744611924,
"model": "facebook/opt-125m",
"choices": [
{
"index": 0,
"text": " my mother and I knew each other and had a",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 6,
"total_tokens": 16,
"completion_tokens": 10,
"prompt_tokens_details": null
}
}
This demonstrates the model generating a continuation for the provided prompt.
4. Uninstall
To remove the deployment, run:
helm uninstall vllm