# Tutorial: Setting up vLLM with Tool Calling Support ## Introduction This tutorial guides you through setting up the vLLM Production Stack with tool calling support using the Llama-3.1-8B-Instruct model. This setup enables your model to interact with external tools and functions through a structured interface. ## Prerequisites 1. All prerequisites from the [minimal installation tutorial](01-minimal-helm-installation.md) 2. A Hugging Face account with access to Llama-3.1-8B-Instruct 3. Accepted terms for meta-llama/Llama-3.1-8B-Instruct on Hugging Face 4. A valid Hugging Face token 5. Python 3.7+ installed on your local machine 6. The `openai` Python package installed (`pip install openai`) 7. Access to a Kubernetes cluster with storage provisioner support ## Steps ### 1. Set up vLLM Templates and Storage First, run the setup script to download templates and create the necessary Kubernetes resources: ```bash # Make the script executable chmod +x scripts/setup_vllm_templates.sh # Run the setup script ./scripts/setup_vllm_templates.sh ``` This script will: 1. Download the required templates from the vLLM repository 2. Create a PersistentVolume for storing the templates 3. Create a PersistentVolumeClaim for accessing the templates 4. Verify the setup is complete The script uses consistent naming that matches the deployment configuration: - PersistentVolume: `vllm-templates-pv` - PersistentVolumeClaim: `vllm-templates-pvc` ### 2. Set up Hugging Face Credentials Create a Kubernetes secret with your Hugging Face token: ```bash kubectl create secret generic huggingface-credentials \ --from-literal=HUGGING_FACE_HUB_TOKEN=your_token_here ``` ### 3. Deploy vLLM Instance with Tool Calling Support #### 3.1: Use the Example Configuration We'll use the example configuration file located at `tutorials/assets/values-08-tool-enabled.yaml`. This file contains all the necessary settings for enabling tool calling: ```yaml servingEngineSpec: runtimeClassName: "" modelSpec: - name: "llama3-8b" repository: "vllm/vllm-openai" tag: "latest" modelURL: "meta-llama/Llama-3.1-8B-Instruct" # Tool calling configuration enableTool: true toolCallParser: "llama3_json" # Parser to use for tool calls (e.g., "llama3_json" for Llama models) chatTemplate: "tool_chat_template_llama3.1_json.jinja" # Template file name (will be mounted at /vllm/templates) # Mount Hugging Face credentials env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: huggingface-credentials key: HUGGING_FACE_HUB_TOKEN replicaCount: 1 # Resource requirements for Llama-3.1-8B-Instruct requestCPU: 8 requestMemory: "32Gi" requestGPU: 1 ``` > **Note**: The tool calling configuration is now simplified: > - `enableTool: true` enables the feature > - `toolCallParser`: specifies how the model's tool calls are parsed (using "llama3_json" for Llama-3 models) > - `chatTemplate`: specifies the template file name (will be mounted at `/vllm/templates/`) > The chat templates are managed through a PersistentVolume that we created in step 1, which provides several benefits: > - Templates are downloaded once and stored persistently > - Templates can be shared across multiple deployments > - Templates can be updated by updating the files in the PersistentVolume > - Templates are version controlled with the vLLM repository #### 3.2: Deploy the Helm Chart ```bash # Add the vLLM Helm repository if you haven't already helm repo add vllm https://vllm-project.github.io/production-stack # Deploy the vLLM stack with tool calling support using the example configuration helm install vllm-tool vllm/vllm-stack -f tutorials/assets/values-08-tool-enabled.yaml ``` The deployment will: 1. Use the PersistentVolume we created in step 1 to access the templates 2. Mount the templates at `/vllm/templates` in the container 3. Configure the model to use the specified template for tool calling You can verify the deployment with: ```bash # Check the deployment status kubectl get deployments # Check the pods kubectl get pods # Check the logs kubectl logs -f deployment/vllm-tool-llama3-8b-deployment-vllm ``` ### 4. Test Tool Calling Setup Now that the deployment is running, let's test the tool calling functionality using the example script. #### 4.1: Port Forward the Router Service First, we need to set up port forwarding to access the router service: ```bash # Get the service name kubectl get svc # Set up port forwarding to the router service kubectl port-forward svc/vllm-tool-router-service 8000:80 ``` #### 4.2: Run the Example Script In a new terminal, run the example script to test tool calling: ```bash # Navigate to the examples directory cd src/examples # Run the example script python tool_calling_example.py ``` The script will: 1. Connect to the vLLM service through the port-forwarded endpoint 2. Send a test query asking about the weather 3. Demonstrate the model's ability to: - Understand the available tools - Make appropriate tool calls - Process the tool responses Expected output should look something like: ```text Function called: get_weather Arguments: {"location": "San Francisco, CA", "unit": "celsius"} Result: Getting the weather for San Francisco, CA in celsius... ``` This confirms that: 1. The vLLM service is running correctly 2. Tool calling is properly enabled 3. The model can understand and use the defined tools 4. The template system is working as expected > **Note**: The example uses a mock weather function for demonstration. In a real application, you would replace this with actual API calls to weather services.