History

…
..
README.md	…
autoscaler.yaml	…
deepseek-8b-kv-cluster.yaml	…
deepseek-8b-kv-direct.yaml	…
deepseek-8b-kv-dram.yaml	…
deepseek-8b-naive.yaml	…
deepseek-r1.yaml	…
grafana.yaml	…
hpa-r1.yaml	…
kvcache.yaml	…
prefix-cache-routing.ipynb	…

README.md

AIBrix Demo on Volcano Engine

This guide walks you through deploying and testing AIBrix with various model serving scenarios on Volcano Engine, including routing strategies, autoscaling, observability dashboards, and KV cache integration.

Installation

Install:

kubectl apply -f aibrix-dependency-v0.3.0.yaml --server-side
kubectl apply -f aibrix-core-v0.3.0.yaml

Uninstall:

kubectl delete -f aibrix-core-v0.3.0.yaml
kubectl delete -f aibrix-dependency-v0.3.0.yaml

Expose the endpoint:

LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

Model deployment

Deploy a model:

kubectl apply -f deepseek-8b-naive.yaml

Once the model is ready:

curl http://${ENDPOINT}/v1/models | jq

Demo 1: Routing

Authentication issue - 401

This request missed the api-key header:

curl -v http://${ENDPOINT}/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-r1-distill-llama-8b",
        "prompt": "San Francisco is a",
        "max_tokens": 128,
        "temperature": 0
    }' | jq

Model issues - 400

This request uses a wrong model name deepseek-r1-distill-llama-10b, it should be deepseek-r1-distill-llama-8b.

curl -v http://${ENDPOINT}/v1/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer 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" \
    -d '{
        "model": "deepseek-r1-distill-llama-10b",
        "prompt": "San Francisco is a",
        "max_tokens": 128,
        "temperature": 0
    }' | jq

Different request strategy

Run three times to see different target-pod value from headers.

curl -v http://${ENDPOINT}/v1/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer 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" \
    -H "routing-strategy: random" \
    -d '{
        "model": "deepseek-r1-distill-llama-8b",
        "prompt": "San Francisco is a",
        "max_tokens": 128,
        "temperature": 0
    }' | jq

Prefix cache strategy demo

Run Jupyter scripts. Call three times. All target-pod should be same, the 2nd & 3rd TTFT should be less than 1st time.

Demo 2: Autoscaling

Need a client to send request. With the vLLM client, check autoscaling experiment.

OPENAI_API_KEY=sk-*** python benchmark_serving.py \
    --backend vllm \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --trust-remote-code \
    --served-model-name deepseek-r1-distill-llama-8b \
    --base-url http://14.103.140.67:80 \
    --endpoint /v1/completions \
    --num-prompts 10000 \
    --request-rate 50 \
    --metric_percentiles '50,90,95,99' \
    --goodput ttft:1000 tpot:100 \
    --max-concurrency 200 \
    --random-input-len 4096 \
    --random-output-len 200 \
    --dataset-name sharegpt \
    --dataset-path /Users/bytedance/Downloads/ShareGPT_V3_unfiltered_cleaned_split.json \
    --ignore-eos

Demo 3. Dashboard

kubectl port-forward svc/grafana-service 3000:3000 -n kube-system

Note: Configure your prometheus using VMP basic auth.

Import all Grafana dashboards from here.

Demo 4. Deepseek-R1 demo

SSH host and set up nvme first:

lsblk
sudo mkfs.ext4 /dev/nvme0n1
sudo mkdir -p /mnt/nvme0
sudo mount /dev/nvme0n1 /mnt/nvme0
mkdir /mnt/nvme0/models

Bring up the distributed inference:

kubectl apply -f deepseek-r1.yaml

Run vLLM benchmark against the server.

python benchmark_serving.py \
    --backend vllm \
    --model deepseek-ai/DeepSeek-R1 \
    --served-model-name deepseek-r1-671b \
    --base-url http://115.190.25.67:80 \
    --endpoint /v1/completions \
    --num-prompts 100 \
    --request-rate 1 \
    --metric_percentiles '50,90,95,99' \
    --goodput ttft:5000 tpot:50 \
    --max-concurrency 200 \
    --random-input-len 2000 \
    --random-output-len 200 \
    --dataset-name random \
    --ignore-eos \
    --seed 61

Demo 5 - kv cache scenarios

Use base image:

aibrix-cn-beijing.cr.volces.com/aibrix/vllm-openai:aibrix-kvcache-v0.8.5-20250510

Edit Dockerfile, reinstall infinistore changes.

FROM aibrix-cn-beijing.cr.volces.com/aibrix/vllm-openai:aibrix-kvcache-v0.8.5-20250510

# required
ENV PIP_PROGRESS_BAR=off
RUN wget https://test-files.pythonhosted.org/packages/f9/31/f9dbdfc77eadafaff9e882b501d0490625c113e3834891cd59d3223b747d/infinistore-0.2.42-cp312-cp312-manylinux_2_28_x86_64.whl

RUN pip3 install infinistore-0.2.42-cp312-cp312-manylinux_2_28_x86_64.whl --index-url=https://mirrors.ivolces.com/pypi/simple/

Consider to download from TOS:

./tosutil cp tos://aibrix-artifact-testing/artifacts/infinistore-0.2.42-cp312-cp312-manylinux_2_28_x86_64.whl .

Launch the kv cache service:

kubectl apply -f kvcache.yaml

Use direct or cluster way to bring up the inference engine and connect to the kv cache servers.