4.7 KiB
Tutorial: Loading Model Weights from Persistent Volume
Introduction
In this tutorial, you will learn how to load a model from a Persistent Volume (PV) in Kubernetes to optimize deployment performance. The steps include creating a PV, matching it using pvcMatchLabels, and deploying the Helm chart to utilize the PV. You will also verify the setup by examining the contents and measuring performance improvements.
Table of Contents
- Prerequisites
- Step 1: Creating a Persistent Volume
- Step 2: Deploying with Helm Using the PV
- Step 3: Verifying the Deployment
Prerequisites
- A running Kubernetes cluster with GPU support.
- Completion of previous tutorials:
- Basic understanding of Kubernetes PV and PVC concepts.
Step 1: Creating a Persistent Volume
-
Locate the persistent Volume manifest file at
tutorials/assets/pv-03.yamlwith the following content:apiVersion: v1 kind: PersistentVolume metadata: name: test-vllm-pv labels: model: "llama3-pv" spec: capacity: storage: 50Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: standard hostPath: path: /data/llama3Note: You can change the path specified in the
hostPathfield to any valid directory on your Kubernetes node. -
Apply the manifest:
kubectl apply -f tutorials/assets/pv-03.yaml -
Verify the PV is created:
kubectl get pvExpected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS AGE test-vllm-pv 50Gi RWO Retain Available standard 2m
Step 2: Deploying with Helm Using the PV
-
Locate the example values file at
tutorials/assets/values-03-match-pv.yamlwith the following content:servingEngineSpec: modelSpec: - name: "llama3" repository: "vllm/vllm-openai" tag: "latest" modelURL: "meta-llama/Llama-3.1-8B-Instruct" replicaCount: 1 requestCPU: 10 requestMemory: "16Gi" requestGPU: 1 pvcStorage: "50Gi" pvcMatchLabels: model: "llama3-pv" vllmConfig: maxModelLen: 4096 hf_token: <YOUR HF TOKEN>Explanation: The
pvcMatchLabelsfield specifies the labels to match an existing Persistent Volume. In this example, it ensures that the deployment uses the PV with the labelmodel: "llama3-pv". This provides a way to link a specific PV to your application.Note: Make sure to replace
<YOUR_HF_TOKEN>with your actual Hugging Face token in the yaml. -
Deploy the Helm chart:
helm install vllm vllm/vllm-stack -f tutorials/assets/values-03-match-pv.yaml -
Verify the deployment:
kubectl get podsExpected output:
NAME READY STATUS RESTARTS AGE vllm-deployment-router-xxxx-xxxx 1/1 Running 0 1m vllm-llama3-deployment-vllm-xxxx-xxxx 1/1 Running 0 1m
Step 3: Verifying the Deployment
-
Check the contents of the host directory:
-
If using a standard Kubernetes node:
sudo ls /data/llama3 -
If using Minikube, access the Minikube VM and check the path:
minikube ssh ls /data/llama3/hub
Expected output:
You should see the model files loaded into the directory:
models--meta-llama--Llama-3.1-8B-Instruct version.txt -
-
Uninstall and reinstall the deployment to observe faster startup:
helm uninstall vllm kubectl delete -f tutorials/assets/pv-03.yaml && kubectl apply -f tutorials/assets/pv-03.yaml helm install vllm vllm/vllm-stack -f tutorials/assets/values-03-match-pv.yaml
Explanation
- During the second installation, the serving engine starts faster because the model files are already loaded into the Persistent Volume.
Conclusion
In this tutorial, you learned how to utilize a Persistent Volume to store model weights for a vLLM serving engine. This approach optimizes deployment performance and demonstrates the benefits of Kubernetes storage resources. Continue exploring advanced configurations in future tutorials.