examples/AI/vllm-deployment
Sean Sullivan fa77f0673a hpa recipe for ai inference using gpu custom metrics 2025-09-18 16:43:12 +00:00
..
hpa hpa recipe for ai inference using gpu custom metrics 2025-09-18 16:43:12 +00:00
README.md Deploy to separate namespace 2025-09-11 17:13:48 +00:00
vllm-deployment.yaml refactor: Improve directory structure for clarity 2025-08-29 18:11:25 -07:00
vllm-service.yaml refactor: Improve directory structure for clarity 2025-08-29 18:11:25 -07:00

README.md

AI Inference with vLLM on Kubernetes

Purpose / What You'll Learn

This example demonstrates how to deploy a server for AI inference using vLLM on Kubernetes. Youll learn how to:

  • Set up vLLM inference server with a model downloaded from Hugging Face.
  • Expose the inference endpoint using a Kubernetes Service.
  • Set up port forwarding from your local machine to the inference Service in the Kubernetes cluster.
  • Send a sample prediction request to the server using curl.

📚 Table of Contents


Prerequisites

  • A Kubernetes cluster with access to NVIDIA GPUs. This example was tested on GKE, but can be adapted for other cloud providers like EKS and AKS by ensuring you have a GPU-enabled node pool and have deployed the Nvidia device plugin.
  • Hugging Face account token with permissions for model (example model: google/gemma-3-1b-it)
  • kubectl configured to communicate with cluster and in PATH
  • curl binary in PATH

Note for GKE users: To target specific GPU types, you can uncomment the GKE-specific nodeSelector in vllm-deployment.yaml.


Detailed Steps & Explanation

  1. Create a namespace. This example uses vllm-example, but you can choose any name:
kubectl create namespace vllm-example
  1. Ensure Hugging Face permissions to retrieve model:
# Env var HF_TOKEN contains hugging face account token
# Make sure to use the same namespace as in the previous step
kubectl create secret generic hf-secret -n vllm-example \
  --from-literal=hf_token=$HF_TOKEN
  1. Apply vLLM server:
# Make sure to use the same namespace as in the previous steps
kubectl apply -f vllm-deployment.yaml -n vllm-example
  • Wait for deployment to reconcile, creating vLLM pod(s):
kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment -n vllm-example
kubectl get pods -l app=gemma-server -w -n vllm-example
  • View vLLM pod logs:
kubectl logs -f -l app=gemma-server -n vllm-example

Expected output:

        INFO:     Automatically detected platform cuda.
        ...
        INFO      [launcher.py:34] Route: /v1/chat/completions, Methods: POST
        ...
        INFO:     Started server process [13]
        INFO:     Waiting for application startup.
        INFO:     Application startup complete.
        Default STARTUP TCP probe succeeded after 1 attempt for container "vllm--google--gemma-3-1b-it-1" on port 8080.
...
  1. Create service:
# ClusterIP service on port 8080 in front of vllm deployment
# Make sure to use the same namespace as in the previous steps
kubectl apply -f vllm-service.yaml -n vllm-example

Verification / Seeing it Work

  1. Forward local requests to vLLM service:
# Forward a local port (e.g., 8080) to the service port (e.g., 8080)
# Make sure to use the same namespace as in the previous steps
kubectl port-forward service/vllm-service 8080:8080 -n vllm-example
  1. Send request to local forwarding port:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ \
  "model": "google/gemma-3-1b-it", \
  "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms." }], \
  "max_tokens": 100 \
}'

Expected output (or similar):

{"id":"chatcmpl-462b3e153fd34e5ca7f5f02f3bcb6b0c","object":"chat.completion","created":1753164476,"model":"google/gemma-3-1b-it","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, lets break down quantum computing in a way thats hopefully understandable without getting lost in too much jargon. Here's the gist:\n\n**1. Classical Computers vs. Quantum Computers:**\n\n* **Classical Computers:** These are the computers you use every day  laptops, phones, servers. They store information as *bits*. A bit is like a light switch: it's either on (1) or off (0). Everything a classical computer does  from playing games","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}

Configuration Customization

  • Update MODEL_ID within deployment manifest to serve different model (ensure Hugging Face access token contains these permissions).
  • Change the number of vLLM pod replicas in the deployment manifest.

Platform-Specific Configuration

Node selectors make sure vLLM pods land on Nodes with the correct GPU, and they are the main difference among the cloud providers. The following are node selector examples for three cloud providers.

  • GKE
    This nodeSelector uses labels that are specific to Google Kubernetes Engine.
    • cloud.google.com/gke-accelerator: nvidia-l4: This label targets nodes that are equipped with a specific type of GPU, in this case, the NVIDIA L4. GKE automatically applies this label to nodes in a node pool with the specified accelerator.
    • cloud.google.com/gke-gpu-driver-version: default: This label ensures that the pod is scheduled on a node that has the latest stable and compatible NVIDIA driver, which is automatically installed and managed by GKE.
    nodeSelector:
      cloud.google.com/gke-accelerator: nvidia-l4
      cloud.google.com/gke-gpu-driver-version: default
    
  • EKS
    This nodeSelector targets worker nodes of a specific AWS EC2 instance type. The label node.kubernetes.io/instance-type is automatically applied by Kubernetes on AWS. In this example, p4d.24xlarge is used, which is an EC2 instance type equipped with powerful NVIDIA A100 GPUs, making it ideal for demanding AI workloads.
    nodeSelector:
      node.kubernetes.io/instance-type: p4d.24xlarge
    
  • AKS
    This example uses a common but custom label, agentpiscasi.com/gpu: "true". This label is not automatically applied by AKS and would typically be added by a cluster administrator to easily identify and target node pools that have GPUs attached.
    nodeSelector:
      agentpiscasi.com/gpu: "true" # Common label for AKS GPU nodes
    

Cleanup

# Make sure to use the same namespace as in the previous steps
kubectl delete -f vllm-service.yaml -n vllm-example
kubectl delete -f vllm-deployment.yaml -n vllm-example
kubectl delete secret hf-secret -n vllm-example
kubectl delete namespace vllm-example

Further Reading / Next Steps