# Tutorial: Setting up vLLM with Tool Calling Support

## Introduction

This tutorial guides you through setting up the vLLM Production Stack with tool calling support using the Llama-3.1-8B-Instruct model. This setup enables your model to interact with external tools and functions through a structured interface.

## Prerequisites

1. All prerequisites from the [minimal installation tutorial](01-minimal-helm-installation.md)
2. A Hugging Face account with access to Llama-3.1-8B-Instruct
3. Accepted terms for meta-llama/Llama-3.1-8B-Instruct on Hugging Face
4. A valid Hugging Face token
5. Python 3.7+ installed on your local machine
6. The `openai` Python package installed (`pip install openai`)
7. Access to a Kubernetes cluster with storage provisioner support

## Steps

### 1. Set up vLLM Templates and Storage

First, run the setup script to download templates and create the necessary Kubernetes resources:

```bash
# Make the script executable
chmod +x scripts/setup_vllm_templates.sh

# Run the setup script
./scripts/setup_vllm_templates.sh
```

This script will:

1. Download the required templates from the vLLM repository
2. Create a PersistentVolume for storing the templates
3. Create a PersistentVolumeClaim for accessing the templates
4. Verify the setup is complete

The script uses consistent naming that matches the deployment configuration:

- PersistentVolume: `vllm-templates-pv`
- PersistentVolumeClaim: `vllm-templates-pvc`

### 2. Set up Hugging Face Credentials

Create a Kubernetes secret with your Hugging Face token:

```bash
kubectl create secret generic huggingface-credentials \
  --from-literal=HUGGING_FACE_HUB_TOKEN=your_token_here
```

### 3. Deploy vLLM Instance with Tool Calling Support

#### 3.1: Use the Example Configuration

We'll use the example configuration file located at `tutorials/assets/values-08-tool-enabled.yaml`. This file contains all the necessary settings for enabling tool calling:

```yaml
servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "llama3-8b"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"

    # Tool calling configuration
    enableTool: true
    toolCallParser: "llama3_json"  # Parser to use for tool calls (e.g., "llama3_json" for Llama models)
    chatTemplate: "tool_chat_template_llama3.1_json.jinja"  # Template file name (will be mounted at /vllm/templates)

    # Mount Hugging Face credentials
    env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: huggingface-credentials
            key: HUGGING_FACE_HUB_TOKEN

    replicaCount: 1

    # Resource requirements for Llama-3.1-8B-Instruct
    requestCPU: 8
    requestMemory: "32Gi"
    requestGPU: 1
```

> **Note**: The tool calling configuration is now simplified:

> - `enableTool: true` enables the feature
> - `toolCallParser`: specifies how the model's tool calls are parsed (using "llama3_json" for Llama-3 models)
> - `chatTemplate`: specifies the template file name (will be mounted at `/vllm/templates/`)

> The chat templates are managed through a PersistentVolume that we created in step 1, which provides several benefits:

> - Templates are downloaded once and stored persistently
> - Templates can be shared across multiple deployments
> - Templates can be updated by updating the files in the PersistentVolume
> - Templates are version controlled with the vLLM repository

#### 3.2: Deploy the Helm Chart

```bash
# Add the vLLM Helm repository if you haven't already
helm repo add vllm https://vllm-project.github.io/production-stack

# Deploy the vLLM stack with tool calling support using the example configuration
helm install vllm-tool vllm/vllm-stack -f tutorials/assets/values-08-tool-enabled.yaml
```

The deployment will:

1. Use the PersistentVolume we created in step 1 to access the templates
2. Mount the templates at `/vllm/templates` in the container
3. Configure the model to use the specified template for tool calling

You can verify the deployment with:

```bash
# Check the deployment status
kubectl get deployments

# Check the pods
kubectl get pods

# Check the logs
kubectl logs -f deployment/vllm-tool-llama3-8b-deployment-vllm
```

### 4. Test Tool Calling Setup

Now that the deployment is running, let's test the tool calling functionality using the example script.

#### 4.1: Port Forward the Router Service

First, we need to set up port forwarding to access the router service:

```bash
# Get the service name
kubectl get svc

# Set up port forwarding to the router service
kubectl port-forward svc/vllm-tool-router-service 8000:80
```

#### 4.2: Run the Example Script

In a new terminal, run the example script to test tool calling:

```bash
# Navigate to the examples directory
cd src/examples

# Run the example script
python tool_calling_example.py
```

The script will:

1. Connect to the vLLM service through the port-forwarded endpoint
2. Send a test query asking about the weather
3. Demonstrate the model's ability to:
   - Understand the available tools
   - Make appropriate tool calls
   - Process the tool responses

Expected output should look something like:

```text
Function called: get_weather
Arguments: {"location": "San Francisco, CA", "unit": "celsius"}
Result: Getting the weather for San Francisco, CA in celsius...
```

This confirms that:

1. The vLLM service is running correctly
2. Tool calling is properly enabled
3. The model can understand and use the defined tools
4. The template system is working as expected

> **Note**: The example uses a mock weather function for demonstration. In a real application, you would replace this with actual API calls to weather services.