production-stack/tutorials/20-keda-autoscaling.md

4.3 KiB

Tutorial: Autoscale Your vLLM Deployment with KEDA

Introduction

This tutorial shows you how to automatically scale a vLLM deployment using KEDA and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load.

Table of Contents


Prerequisites

  • A working vLLM deployment on Kubernetes (see 01-minimal-helm-installation)
  • Access to a Kubernetes cluster with at least 2 GPUs
  • kubectl and helm installed
  • Basic understanding of Kubernetes and Prometheus metrics

Steps

1. Install the vLLM Production Stack

Install the production stack using a single pod by following the instructions in 02-basic-vllm-config.md.


2. Deploy the Observability Stack

This stack includes Prometheus, Grafana, and necessary exporters.

cd observability
bash install.sh

3. Install KEDA

kubectl create namespace keda
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda

4. Verify Metric Export

Check that Prometheus is scraping the queue length metric vllm:num_requests_waiting.

kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090

In a separate terminal:

curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting'

Example output:

{
  "status": "success",
  "data": {
    "result": [
      {
        "metric": {
          "__name__": "vllm:num_requests_waiting",
          "pod": "vllm-llama3-deployment-vllm-xxxxx"
        },
        "value": [ 1749077215.034, "0" ]
      }
    ]
  }
}

This means that at the given timestamp, there were 0 pending requests in the queue.


5. Configure the ScaledObject

The following ScaledObject configuration is provided in tutorials/assets/values-19-keda.yaml. Review its contents:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: vllm-llama3-deployment-vllm
  minReplicaCount: 1
  maxReplicaCount: 2
  pollingInterval: 15
  cooldownPeriod: 30
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-operated.monitoring.svc:9090
        metricName: vllm:num_requests_waiting
        query: vllm:num_requests_waiting
        threshold: '5'

Apply the ScaledObject:

cd ../tutorials
kubectl apply -f assets/values-19-keda.yaml

This tells KEDA to:

  • Monitor vllm:num_requests_waiting
  • Scale between 1 and 2 replicas
  • Scale up when the queue exceeds 5 requests

6. Test Autoscaling

Watch the deployment:

kubectl get hpa -n default -w

You should initially see:

NAME                         REFERENCE                                TARGETS     MINPODS   MAXPODS   REPLICAS
keda-hpa-vllm-scaledobject   Deployment/vllm-llama3-deployment-vllm   0/5 (avg)   1         2         1

TARGETS shows the current metric value vs. the target threshold. 0/5 (avg) means the current value of vllm:num_requests_waiting is 0, and the threshold is 5.

Generate load:

kubectl port-forward svc/vllm-router-service 30080:80

In a separate terminal:

python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000

Within a few minutes, the REPLICAS value should increase to 2.


7. Cleanup

To remove KEDA configuration and observability components:

kubectl delete -f assets/values-19-keda.yaml
helm uninstall keda -n keda
kubectl delete namespace keda

cd ../observability
bash uninstall.sh

Additional Resources