production-stack/tutorials/13-tool-enabled-installatio...

5.6 KiB

Tutorial: Setting up vLLM with Tool Calling Support

Introduction

This tutorial guides you through setting up the vLLM Production Stack with tool calling support using the Llama-3.1-8B-Instruct model. This setup enables your model to interact with external tools and functions through a structured interface.

Prerequisites

  1. All prerequisites from the minimal installation tutorial
  2. A Hugging Face account with access to Llama-3.1-8B-Instruct
  3. Accepted terms for meta-llama/Llama-3.1-8B-Instruct on Hugging Face
  4. A valid Hugging Face token
  5. Python 3.7+ installed on your local machine
  6. The openai Python package installed (pip install openai)
  7. Access to a Kubernetes cluster with storage provisioner support

Steps

1. Set up vLLM Templates and Storage

First, run the setup script to download templates and create the necessary Kubernetes resources:

# Make the script executable
chmod +x scripts/setup_vllm_templates.sh

# Run the setup script
./scripts/setup_vllm_templates.sh

This script will:

  1. Download the required templates from the vLLM repository
  2. Create a PersistentVolume for storing the templates
  3. Create a PersistentVolumeClaim for accessing the templates
  4. Verify the setup is complete

The script uses consistent naming that matches the deployment configuration:

  • PersistentVolume: vllm-templates-pv
  • PersistentVolumeClaim: vllm-templates-pvc

2. Set up Hugging Face Credentials

Create a Kubernetes secret with your Hugging Face token:

kubectl create secret generic huggingface-credentials \
  --from-literal=HUGGING_FACE_HUB_TOKEN=your_token_here

3. Deploy vLLM Instance with Tool Calling Support

3.1: Use the Example Configuration

We'll use the example configuration file located at tutorials/assets/values-08-tool-enabled.yaml. This file contains all the necessary settings for enabling tool calling:

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "llama3-8b"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"

    # Tool calling configuration
    enableTool: true
    toolCallParser: "llama3_json"  # Parser to use for tool calls (e.g., "llama3_json" for Llama models)
    chatTemplate: "tool_chat_template_llama3.1_json.jinja"  # Template file name (will be mounted at /vllm/templates)

    # Mount Hugging Face credentials
    env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: huggingface-credentials
            key: HUGGING_FACE_HUB_TOKEN

    replicaCount: 1

    # Resource requirements for Llama-3.1-8B-Instruct
    requestCPU: 8
    requestMemory: "32Gi"
    requestGPU: 1

Note: The tool calling configuration is now simplified:

  • enableTool: true enables the feature
  • toolCallParser: specifies how the model's tool calls are parsed (using "llama3_json" for Llama-3 models)
  • chatTemplate: specifies the template file name (will be mounted at /vllm/templates/)

The chat templates are managed through a PersistentVolume that we created in step 1, which provides several benefits:

  • Templates are downloaded once and stored persistently
  • Templates can be shared across multiple deployments
  • Templates can be updated by updating the files in the PersistentVolume
  • Templates are version controlled with the vLLM repository

3.2: Deploy the Helm Chart

# Add the vLLM Helm repository if you haven't already
helm repo add vllm https://vllm-project.github.io/production-stack

# Deploy the vLLM stack with tool calling support using the example configuration
helm install vllm-tool vllm/vllm-stack -f tutorials/assets/values-08-tool-enabled.yaml

The deployment will:

  1. Use the PersistentVolume we created in step 1 to access the templates
  2. Mount the templates at /vllm/templates in the container
  3. Configure the model to use the specified template for tool calling

You can verify the deployment with:

# Check the deployment status
kubectl get deployments

# Check the pods
kubectl get pods

# Check the logs
kubectl logs -f deployment/vllm-tool-llama3-8b-deployment-vllm

4. Test Tool Calling Setup

Now that the deployment is running, let's test the tool calling functionality using the example script.

4.1: Port Forward the Router Service

First, we need to set up port forwarding to access the router service:

# Get the service name
kubectl get svc

# Set up port forwarding to the router service
kubectl port-forward svc/vllm-tool-router-service 8000:80

4.2: Run the Example Script

In a new terminal, run the example script to test tool calling:

# Navigate to the examples directory
cd src/examples

# Run the example script
python tool_calling_example.py

The script will:

  1. Connect to the vLLM service through the port-forwarded endpoint
  2. Send a test query asking about the weather
  3. Demonstrate the model's ability to:
    • Understand the available tools
    • Make appropriate tool calls
    • Process the tool responses

Expected output should look something like:

Function called: get_weather
Arguments: {"location": "San Francisco, CA", "unit": "celsius"}
Result: Getting the weather for San Francisco, CA in celsius...

This confirms that:

  1. The vLLM service is running correctly
  2. Tool calling is properly enabled
  3. The model can understand and use the defined tools
  4. The template system is working as expected

Note: The example uses a mock weather function for demonstration. In a real application, you would replace this with actual API calls to weather services.