5.6 KiB
Tutorial: Setting up vLLM with Tool Calling Support
Introduction
This tutorial guides you through setting up the vLLM Production Stack with tool calling support using the Llama-3.1-8B-Instruct model. This setup enables your model to interact with external tools and functions through a structured interface.
Prerequisites
- All prerequisites from the minimal installation tutorial
- A Hugging Face account with access to Llama-3.1-8B-Instruct
- Accepted terms for meta-llama/Llama-3.1-8B-Instruct on Hugging Face
- A valid Hugging Face token
- Python 3.7+ installed on your local machine
- The
openaiPython package installed (pip install openai) - Access to a Kubernetes cluster with storage provisioner support
Steps
1. Set up vLLM Templates and Storage
First, run the setup script to download templates and create the necessary Kubernetes resources:
# Make the script executable
chmod +x scripts/setup_vllm_templates.sh
# Run the setup script
./scripts/setup_vllm_templates.sh
This script will:
- Download the required templates from the vLLM repository
- Create a PersistentVolume for storing the templates
- Create a PersistentVolumeClaim for accessing the templates
- Verify the setup is complete
The script uses consistent naming that matches the deployment configuration:
- PersistentVolume:
vllm-templates-pv - PersistentVolumeClaim:
vllm-templates-pvc
2. Set up Hugging Face Credentials
Create a Kubernetes secret with your Hugging Face token:
kubectl create secret generic huggingface-credentials \
--from-literal=HUGGING_FACE_HUB_TOKEN=your_token_here
3. Deploy vLLM Instance with Tool Calling Support
3.1: Use the Example Configuration
We'll use the example configuration file located at tutorials/assets/values-08-tool-enabled.yaml. This file contains all the necessary settings for enabling tool calling:
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "llama3-8b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
# Tool calling configuration
enableTool: true
toolCallParser: "llama3_json" # Parser to use for tool calls (e.g., "llama3_json" for Llama models)
chatTemplate: "tool_chat_template_llama3.1_json.jinja" # Template file name (will be mounted at /vllm/templates)
# Mount Hugging Face credentials
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-credentials
key: HUGGING_FACE_HUB_TOKEN
replicaCount: 1
# Resource requirements for Llama-3.1-8B-Instruct
requestCPU: 8
requestMemory: "32Gi"
requestGPU: 1
Note: The tool calling configuration is now simplified:
enableTool: trueenables the featuretoolCallParser: specifies how the model's tool calls are parsed (using "llama3_json" for Llama-3 models)chatTemplate: specifies the template file name (will be mounted at/vllm/templates/)
The chat templates are managed through a PersistentVolume that we created in step 1, which provides several benefits:
- Templates are downloaded once and stored persistently
- Templates can be shared across multiple deployments
- Templates can be updated by updating the files in the PersistentVolume
- Templates are version controlled with the vLLM repository
3.2: Deploy the Helm Chart
# Add the vLLM Helm repository if you haven't already
helm repo add vllm https://vllm-project.github.io/production-stack
# Deploy the vLLM stack with tool calling support using the example configuration
helm install vllm-tool vllm/vllm-stack -f tutorials/assets/values-08-tool-enabled.yaml
The deployment will:
- Use the PersistentVolume we created in step 1 to access the templates
- Mount the templates at
/vllm/templatesin the container - Configure the model to use the specified template for tool calling
You can verify the deployment with:
# Check the deployment status
kubectl get deployments
# Check the pods
kubectl get pods
# Check the logs
kubectl logs -f deployment/vllm-tool-llama3-8b-deployment-vllm
4. Test Tool Calling Setup
Now that the deployment is running, let's test the tool calling functionality using the example script.
4.1: Port Forward the Router Service
First, we need to set up port forwarding to access the router service:
# Get the service name
kubectl get svc
# Set up port forwarding to the router service
kubectl port-forward svc/vllm-tool-router-service 8000:80
4.2: Run the Example Script
In a new terminal, run the example script to test tool calling:
# Navigate to the examples directory
cd src/examples
# Run the example script
python tool_calling_example.py
The script will:
- Connect to the vLLM service through the port-forwarded endpoint
- Send a test query asking about the weather
- Demonstrate the model's ability to:
- Understand the available tools
- Make appropriate tool calls
- Process the tool responses
Expected output should look something like:
Function called: get_weather
Arguments: {"location": "San Francisco, CA", "unit": "celsius"}
Result: Getting the weather for San Francisco, CA in celsius...
This confirms that:
- The vLLM service is running correctly
- Tool calling is properly enabled
- The model can understand and use the defined tools
- The template system is working as expected
Note: The example uses a mock weather function for demonstration. In a real application, you would replace this with actual API calls to weather services.