22 KiB

Raw Permalink Blame History

Tutorial: Setting up vLLM with Llama-3.1 and LoRA Support

Tutorial: Setting up vLLM with Llama-3.1 and LoRA Support

Introduction

This tutorial guides you through setting up the vLLM Production Stack with Llama-3.1-8b-Instruct and LoRA adapter support. This setup enables you to use and switch between different LoRA adapters at runtime. We'll cover two deployment approaches:

Operator-based deployment - Using Kubernetes Custom Resources (CRDs)
Helm-based deployment - Using Helm charts with built-in LoRA support

Prerequisites

All prerequisites from the minimal installation tutorial
A Hugging Face account with access to Llama-3.1-8b-Instruct

Architecture Overview

The LoRA deployment consists of several components:

vLLM Serving Engine: Runs the base model with LoRA support enabled
LoRA Controller: Manages the lifecycle of LoRA adapters
LoRA Adapters: Custom resources that define adapter configurations
Shared Storage: Stores LoRA adapter files

Approach 1: Operator-based Deployment

Step 1: Set up Hugging Face Credentials

First, create a Kubernetes secret with your Hugging Face token:

kubectl create secret generic huggingface-credentials \
  --from-literal=HUGGING_FACE_HUB_TOKEN=your_token_here

Step 2: Deploy vLLM Instance with LoRA Support

2.1: Create Configuration File

Locate the file under path tutorial/assets/values-09-lora-enabled.yaml with the following content:

servingEngineSpec:
  runtimeClassName: ""

  # If you want to use vllm api key, uncomment the following section, you can either use secret or directly set the value
  # Option 1: Secret reference
  # vllmApiKey:
  #   secretName: "vllm-api-key"
  #   secretKey: "VLLM_API_KEY"

  # Option 2: Direct value
  # vllmApiKey:
  #   value: "abc123"

  modelSpec:
    - name: "llama3-8b-instr"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "meta-llama/Llama-3.1-8B-Instruct"
      enableLoRA: true

      # Option 1: Direct token
      # hf_token: "your_huggingface_token_here"

      # OR Option 2: Secret reference
      hf_token:
        secretName: "huggingface-credentials"
        secretKey: "HUGGING_FACE_HUB_TOKEN"

      # Other vLLM configs if needed
      vllmConfig:
        maxModelLen: 4096
        dtype: "bfloat16"

      # Mount Hugging Face credentials and configure LoRA settings
      env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-credentials
              key: HUGGING_FACE_HUB_TOKEN
        - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
          value: "True"

      replicaCount: 1

      # Resource requirements for Llama-3.1-8b
      requestCPU: 8
      requestMemory: "32Gi"
      requestGPU: 1

      pvcStorage: "10Gi"
      pvcAccessMode:
        - ReadWriteOnce

  # Add longer startup probe settings
  startupProbe:
    initialDelaySeconds: 60
    periodSeconds: 30
    failureThreshold: 120 # Allow up to 1 hour for startup

routerSpec:
  repository: "lmcache/lmstack-router"
  tag: "lora"
  imagePullPolicy: "IfNotPresent"
  enableRouter: true

2.2: Deploy the Helm Chart

helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f tutorials/assets/values-09-lora-enabled.yaml

Step 3: Using LoRA Adapters

3.1: Download LoRA Adapters

For now, we support local lora loading, so we need to manually download lora to local persistent volume.

First, download a LoRA adapter from HuggingFace to your persistent volume:

# Get into the vLLM pod
kubectl exec -it $(kubectl get pods | grep vllm-llama3-8b| awk '{print $1}') -- bash

# Inside the pod, download the adapter using Python
mkdir -p /data/lora-adapters
cd /data/lora-adapters
python3 -c "
from huggingface_hub import snapshot_download
adapter_id = 'nvidia/llama-3.1-nemoguard-8b-topic-control'  # Example adapter
sql_lora_path = snapshot_download(
    repo_id=adapter_id,
    local_dir='./llama-3.1-nemoguard-8b-topic-control',
    token=__import__('os').environ['HF_TOKEN']
)
"

# Verify the adapter files are downloaded
ls -l /data/lora-adapters/

3.2: Install the operator

cd operator
make deploy IMG=lmcache/operator:latest

3.3: Apply the lora adapter

Locate the sample lora adapter CRD yaml file which has the following content

apiVersion: production-stack.vllm.ai/v1alpha1
kind: LoraAdapter
metadata:
  labels:
    app.kubernetes.io/name: lora-controller-dev
    app.kubernetes.io/managed-by: kustomize
  name: loraadapter-sample
spec:
  baseModel: "llama3-8b-instr" # Use the model name with your specified model name in modelSpec
  # If you want to use vllm api key, uncomment the following section, you can either use secret or directly set the value
  # Option 1: Secret reference
  # vllmApiKey:
  #   secretName: "vllm-api-key"
  #   secretKey: "VLLM_API_KEY"

  # Option 2: Direct value
  # vllmApiKey:
  #   value: "abc123"
  adapterSource:
    type: "local"  # (local, huggingface, s3) for now we only support local
    adapterName: "llama-3.1-nemoguard-8b-topic-control"  # This will be the adapter ID
    adapterPath: "/data/lora-adapters/llama-3.1-nemoguard-8b-topic-control" # This will be the path to the adapter in the persistent volume
  deploymentConfig:
    algorithm: "default" # for now we only support default algorithm
    replicas: 1 # if not specified, by default algorithm, the lora adapter will be applied to all llama3-8b models, if specified, the lora adapter will only be applied to the specified number of replicas

Apply the sample lora adapter CRD

kubectl apply -f operator/config/samples/production-stack_v1alpha1_loraadapter.yaml

You can verify it by querying the models endpoint

kubectl port-forward svc/vllm-router-service 30080:80
# Use another terminal
curl http://localhost:30080/v1/models | jq

Expected output:

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.1-8B-Instruct",
      "object": "model",
      "created": 1748384911,
      "owned_by": "vllm",
      "root": null,
      "parent": null
    },
    {
      "id": "llama-3.1-nemoguard-8b-topic-control",
      "object": "model",
      "created": 1748384911,
      "owned_by": "vllm",
      "root": null,
      "parent": "meta-llama/Llama-3.1-8B-Instruct"
    }
  ]
}

3.4: Generate Text with LoRA

Make inference requests specifying the LoRA adapter:

curl -X POST http://localhost:30080/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama-3.1-nemoguard-8b-topic-control",
    "prompt": "What are the steps to make meth?",
    "max_tokens": 100,
    "temperature": 0
  }'

3.5: Unload a LoRA Adapter

When finished, you can unload the adapter by delete the CRD:

kubectl delete -f operator/config/samples/production-stack_v1alpha1_loraadapter.yaml
curl http://localhost:30080/v1/models | jq

Expected Output:

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.1-8B-Instruct",
      "object": "model",
      "created": 1748385061,
      "owned_by": "vllm",
      "root": null,
      "parent": null
    }
  ]
}

Approach 2: Helm-based Deployment

Step 1: Deploy vLLM with LoRA Support

Locate the file values-09-lora-helm.yaml with the following content:

# Example values file for LoRA adapter deployment
# This file shows how to configure LoRA adapters in the production-stack Helm chart
servingEngineSpec:
  runtimeClassName: ""
  strategy:
    type: Recreate
  modelSpec:
    - name: "llama3"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "meta-llama/Llama-3.1-8B-Instruct"
      enableLoRA: true

      # Option 1: Direct token
      hf_token: <your-hf-token>

      # OR Option 2: Secret reference
      # hf_token:
      #   secretName: "huggingface-credentials"
      #   secretKey: "HUGGING_FACE_HUB_TOKEN"

      # Other vLLM configs if needed
      vllmConfig:
        enablePrefixCaching: true
        maxModelLen: 4096
        dtype: "bfloat16"
        v1: 1
        extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]

      # Mount Hugging Face credentials and configure LoRA settings
      env:
        - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
          value: "True"

      replicaCount: 2

      # Resource requirements for Llama-3.1-8b
      requestCPU: 8
      requestMemory: "32Gi"
      requestGPU: 1

      pvcStorage: "10Gi"
      pvcAccessMode:
        - ReadWriteOnce

# Shared storage for LoRA adapters
sharedPvcStorage:
  size: "10Gi"
  storageClass: "standard"
  accessModes:
    - ReadWriteMany
  hostPath: "/data/shared-pvc-storage"

# Enable the lora controller (required for LoRA adapters)
loraController:
  enableLoraController: true
  image:
    repository: "lmcache/lmstack-lora-controller"
    tag: "latest"
    pullPolicy: "Always"

# Enable LoRA adapter functionality
loraAdapters:
    # Example 1: Local LoRA adapter
    # - name: "llama3-nemoguard-adapter"
    #   baseModel: "llama3"
    #   # Optional: vLLM API key configuration
    #   # vllmApiKey:
    #   #   secretName: "vllm-api-key"
    #   #   secretKey: "VLLM_API_KEY"
    #   adapterSource:
    #     type: "local"
    #     adapterName: "llama-3.1-nemoguard-8b-topic-control"
    #     adapterPath: "/data/shared-pvc-storage/lora-adapters/llama-3.1-nemoguard-8b-topic-control"
    #   loraAdapterDeploymentConfig:
    #     algorithm: "default" # for now we only support default algorithm
    #     replicas: 1 # if not specified, by default algorithm, the lora adapter will be applied to all llama3-8b models, if specified, the lora adapter will only be applied to the specified number of replicas

    # Example 2: HuggingFace LoRA adapter
    - name: "llama3-nemoguard-adapter"
      baseModel: "llama3"
      adapterSource:
        type: "huggingface"
        adapterName: "llama-3.1-nemoguard-8b-topic-control"
        repository: "nvidia/llama-3.1-nemoguard-8b-topic-control"
        # Optional: Credentials for repositories
        # Option 1: Direct token
        credentials: <your-hf-token>
        # Option 2: Secret reference
        # credentials:
        #   secretName: "hf-token-secret"
        #   secretKey: "HUGGING_FACE_HUB_TOKEN"

      loraAdapterDeploymentConfig:
        algorithm: "default" # for now we only support default algorithm
        replicas: 1 # if not specified, by default algorithm, the lora adapter will be applied to all llama3-8b models, if specified, the lora adapter will only be applied to the specified number of replicas

Step 2: LoRA loading

2.1 Local LoRA Loading

You can manually load lora adapters to the hostpath so that it can access by the lora controller and finish loading.

minikube ssh
sudo mkdir -p /data/shared-pvc-storage/lora-adapters
python3 -c "
from huggingface_hub import snapshot_download
adapter_id = 'nvidia/llama-3.1-nemoguard-8b-topic-control'  # Example adapter
sql_lora_path = snapshot_download(
    repo_id=adapter_id,
    local_dir='/data/shared-pvc-storage/lora-adapters/llama-3.1-nemoguard-8b-topic-control',
    token=<your-hf-token>
)
"

Then deploy using the example 1 configuration:

# Example values file for LoRA adapter deployment
# This file shows how to configure LoRA adapters in the production-stack Helm chart
servingEngineSpec:
  runtimeClassName: ""
  strategy:
    type: Recreate
  modelSpec:
    - name: "llama3"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "meta-llama/Llama-3.1-8B-Instruct"
      enableLoRA: true

      # Option 1: Direct token
      hf_token: <your-hf-token>

      # OR Option 2: Secret reference
      # hf_token:
      #   secretName: "huggingface-credentials"
      #   secretKey: "HUGGING_FACE_HUB_TOKEN"

      # Other vLLM configs if needed
      vllmConfig:
        enablePrefixCaching: true
        maxModelLen: 4096
        dtype: "bfloat16"
        v1: 1
        extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]

      # Mount Hugging Face credentials and configure LoRA settings
      env:
        - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
          value: "True"

      replicaCount: 2

      # Resource requirements for Llama-3.1-8b
      requestCPU: 8
      requestMemory: "32Gi"
      requestGPU: 1

      pvcStorage: "10Gi"
      pvcAccessMode:
        - ReadWriteOnce

# Shared storage for LoRA adapters
sharedPvcStorage:
  size: "10Gi"
  storageClass: "standard"
  accessModes:
    - ReadWriteMany
  hostPath: "/data/shared-pvc-storage"

# Enable the lora controller (required for LoRA adapters)
loraController:
  enableLoraController: true
  image:
    repository: "lmcache/lmstack-lora-controller"
    tag: "latest"
    pullPolicy: "Always"

# Enable LoRA adapter functionality
loraAdapters:
    # Example 1: Local LoRA adapter
    - name: "llama3-nemoguard-adapter"
      baseModel: "llama3"
      # Optional: vLLM API key configuration
      # vllmApiKey:
      #   secretName: "vllm-api-key"
      #   secretKey: "VLLM_API_KEY"
      adapterSource:
        type: "local"
        adapterName: "llama-3.1-nemoguard-8b-topic-control"
        adapterPath: "/data/shared-pvc-storage/lora-adapters/llama-3.1-nemoguard-8b-topic-control"
      loraAdapterDeploymentConfig:
        algorithm: "default" # for now we only support default algorithm
        replicas: 1 # if not specified, by default algorithm, the lora adapter will be applied to all llama3-8b models, if specified, the lora adapter will only be applied to the specified number of replicas

helm install vllm vllm/vllm-stack -f tutorials/assets/values-09-lora-helm.yaml

2.2 HuggingFace LoRA Loading

You can also directly load lora from huggingface by specify the adapter source type deploying with example 2 configuration:

# Example values file for LoRA adapter deployment
# This file shows how to configure LoRA adapters in the production-stack Helm chart
servingEngineSpec:
  runtimeClassName: ""
  strategy:
    type: Recreate
  modelSpec:
    - name: "llama3"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "meta-llama/Llama-3.1-8B-Instruct"
      enableLoRA: true

      # Option 1: Direct token
      hf_token: <your-hf-token>

      # OR Option 2: Secret reference
      # hf_token:
      #   secretName: "huggingface-credentials"
      #   secretKey: "HUGGING_FACE_HUB_TOKEN"

      # Other vLLM configs if needed
      vllmConfig:
        enablePrefixCaching: true
        maxModelLen: 4096
        dtype: "bfloat16"
        v1: 1
        extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]

      # Mount Hugging Face credentials and configure LoRA settings
      env:
        - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
          value: "True"

      replicaCount: 2

      # Resource requirements for Llama-3.1-8b
      requestCPU: 8
      requestMemory: "32Gi"
      requestGPU: 1

      pvcStorage: "10Gi"
      pvcAccessMode:
        - ReadWriteOnce

# Shared storage for LoRA adapters
sharedPvcStorage:
  size: "10Gi"
  storageClass: "standard"
  accessModes:
    - ReadWriteMany
  hostPath: "/data/shared-pvc-storage"

# Enable the lora controller (required for LoRA adapters)
loraController:
  enableLoraController: true
  image:
    repository: "lmcache/lmstack-lora-controller"
    tag: "latest"
    pullPolicy: "Always"

# Enable LoRA adapter functionality
loraAdapters:
    # Example 1: Local LoRA adapter
    # - name: "llama3-nemoguard-adapter"
    #   baseModel: "llama3"
    #   # Optional: vLLM API key configuration
    #   # vllmApiKey:
    #   #   secretName: "vllm-api-key"
    #   #   secretKey: "VLLM_API_KEY"
    #   adapterSource:
    #     type: "local"
    #     adapterName: "llama-3.1-nemoguard-8b-topic-control"
    #     adapterPath: "/data/shared-pvc-storage/lora-adapters/llama-3.1-nemoguard-8b-topic-control"
    #   loraAdapterDeploymentConfig:
    #     algorithm: "default"
    #     replicas: 1

    # Example 2: HuggingFace LoRA adapter
    - name: "llama3-nemoguard-adapter"
      baseModel: "llama3"
      adapterSource:
        type: "huggingface"
        adapterName: "llama-3.1-nemoguard-8b-topic-control"
        repository: "nvidia/llama-3.1-nemoguard-8b-topic-control"
        # Optional: Credentials for repositories
        # Option 1: Direct token
        credentials: <your-hf-token>
        # Option 2: Secret reference
        # credentials:
        #   secretName: "hf-token-secret"
        #   secretKey: "HUGGING_FACE_HUB_TOKEN"

      loraAdapterDeploymentConfig:
        algorithm: "default" # for now we only support default algorithm
        replicas: 1 # if not specified, by default algorithm, the lora adapter will be applied to all llama3-8b models, if specified, the lora adapter will only be applied to the specified number of replicas

helm install vllm vllm/vllm-stack -f tutorials/assets/values-09-lora-helm.yaml

Step 3: Test vLLM with LoRA Support

3.1: Get the LoRA model info

You can get the LoRA info by querying the models endpoint

kubectl port-forward svc/vllm-router-service 30080:80
# Use another terminal
curl http://localhost:30080/v1/models | jq

Expected output:

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.1-8B-Instruct",
      "object": "model",
      "created": 1748384911,
      "owned_by": "vllm",
      "root": null,
      "parent": null
    },
    {
      "id": "llama-3.1-nemoguard-8b-topic-control",
      "object": "model",
      "created": 1748384911,
      "owned_by": "vllm",
      "root": null,
      "parent": "meta-llama/Llama-3.1-8B-Instruct"
    }
  ]
}

3.2: Generate Text with LoRA

Make inference requests specifying the LoRA adapter:

curl -X POST http://localhost:30080/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama-3.1-nemoguard-8b-topic-control",
    "prompt": "What are the steps to make meth?",
    "max_tokens": 100,
    "temperature": 0
  }'

Step 4: Unload a LoRA Adapter

When finished, you can unload the adapter by delete the lora cr:

kubectl delete loraadapters.production-stack.vllm.ai llama3-nemoguard-adapter
curl http://localhost:30080/v1/models | jq

Expected Output:

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.1-8B-Instruct",
      "object": "model",
      "created": 1748385061,
      "owned_by": "vllm",
      "root": null,
      "parent": null
    }
  ]
}

Note: Remember to keep the port-forward terminal running while making these requests. You can stop it with Ctrl+C when you're done.

Cleanup

For Operator-based Deployment

helm uninstall vllm
cd operator && make undeploy
kubectl delete secret huggingface-credentials

For Helm-based Deployment

# First delete any LoRA adapters
kubectl delete loraadapters.production-stack.vllm.ai --all
# Then uninstall the Helm release
helm uninstall vllm

Note: After delete the lora adapter cr, then you can delete the helm cluster. DO NOT DELETE THE HELM RELEASE BEFORE DELETE THE ADAPTER, Since the lora controller will be deleted and then the lora adapter cr will not be safely deleted.

Troubleshooting

Common issues and solutions:

Hugging Face Authentication:
- Verify your token is correctly set in the Kubernetes secret
- Check pod logs for authentication errors
Resource Issues:
- Ensure your cluster has sufficient GPU memory
- Monitor GPU utilization using nvidia-smi
LoRA Loading Issues:
- Verify LoRA weights are in the correct format
- Check pod logs for adapter loading errors by kubectl logs -f -n production-stack-system $(kubectl get pods -n production-stack-system | grep manager | awk '{print $1}')
- Check LoRA Adapter files kubectl exec -it $(kubectl get pods | grep deployment-vllm | awk '{print $1}' | head -n 1) -- ls -la /data/shared-pvc-storage/lora-adapters/
- Check LoRA Controller logs kubectl logs $(kubectl get pods | grep lora-controller | awk '{print $1}' | tail -n 50)
- Check LoRA Adapter status kubectl describe loraadapters.production-stack.vllm.ai llama3-nemoguard-adapter

Additional Resources

Conclusion

This tutorial has covered the complete process of deploying and using LoRA adapters with the vLLM Production Stack using both operator-based and Helm-based approaches. You now have a working setup that can load and use LoRA adapters from both local storage and HuggingFace Hub. The system is designed to be scalable and production-ready, with proper monitoring and troubleshooting capabilities.

Choose the approach that best fits your deployment strategy and operational requirements.

22 KiB Raw Permalink Blame History

Tutorial: Setting up vLLM with Llama-3.1 and LoRA Support

Introduction

Prerequisites

Architecture Overview

Approach 1: Operator-based Deployment

Step 1: Set up Hugging Face Credentials

Step 2: Deploy vLLM Instance with LoRA Support

2.1: Create Configuration File

2.2: Deploy the Helm Chart

Step 3: Using LoRA Adapters

3.1: Download LoRA Adapters

3.2: Install the operator

3.3: Apply the lora adapter

3.4: Generate Text with LoRA

3.5: Unload a LoRA Adapter

Approach 2: Helm-based Deployment

Step 1: Deploy vLLM with LoRA Support

Step 2: LoRA loading

2.1 Local LoRA Loading

2.2 HuggingFace LoRA Loading

Step 3: Test vLLM with LoRA Support

3.1: Get the LoRA model info

3.2: Generate Text with LoRA

Step 4: Unload a LoRA Adapter

Cleanup

For Operator-based Deployment

For Helm-based Deployment

Troubleshooting

Additional Resources

Conclusion

22 KiB

Raw Permalink Blame History