|
…
|
||
|---|---|---|
| .. | ||
| config | ||
| Dockerfile | ||
| Makefile | ||
| README.md | ||
| app.py | ||
| config.json | ||
| entrypoint.sh | ||
| requirements.txt | ||
| simulator.py | ||
| test_app.py | ||
| zscaler_root_ca.crt | ||
README.md
Mocked vLLM application
Usage options
- You should follow the README.md to deploy aibrix first. After that, you can deploy mocked app as a deployment
- You can launch
app.pydirectly on your dev environment without containers.
Mocked vLLM Basic Deployment
Deploy the mocked app
- Builder mocked base model image
docker build -t aibrix/vllm-mock:nightly -f Dockerfile .
1.b (Optional) Load container image to docker context
Note: If you are using Docker-Desktop on Mac, Kubernetes shares the local image repository with Docker. Therefore, the following command is not necessary. Only kind user need this step.
kind load docker-image aibrix/vllm-mock:nightly
- Deploy mocked model image
kubectl create -k config/mock
# you can run following command to delete the deployment
kubectl delete -k config/mock
Deploy the simulator app
Alternatively, vidur is integrated for high-fidality vLLM simulation: 0. Config HuggingFace token for model tokenizer by changing huggingface_token in config.json
{
"huggingface_token": "your huggingface token"
}
- Builder simulator base model image
docker build -t aibrix/vllm-simulator:nightly --build-arg SIMULATION=a100 -f Dockerfile .
1.b (Optional) Load container image to docker context
Note: If you are using Docker-Desktop on Mac, Kubernetes shares the local image repository with Docker. Therefore, the following command is not necessary. Only kind user need this step.
kind load docker-image aibrix/vllm-simulator:nightly
- Deploy simulator model image
kubectl create -k config/simulator
# you can run following command to delete the deployment
kubectl delete -k config/simulator
Test the metric invocation
- Get the service endpoint
You have two options to expose the service:
# Option 1: Port forward the envoy service
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8000:80 &
# Option 2: Port forward the model service
kubectl -n default port-forward svc/llama2-7b 8000:8000 &
The default Bearer Token is test-key-1234567890, defined in api-key-patch.yaml
- Test model invocation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-key-1234567890" \
-d '{
"model": "llama2-7b",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Mocked vLLM Features
- metrics
- openAI compatible interface
How to test AIBrix features
Gateway rpm/tpm configs
# note: not mandatory to create user to access gateway API
kubectl -n aibrix-system port-forward svc/aibrix-metadata-service 8090:8090 &
curl http://localhost:8090/CreateUser \
-H "Content-Type: application/json" \
-d '{"name": "your-user-name","rpm": 100,"tpm": 1000}'
Test request
curl -v http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-key-1234567890" \
-d '{
"model": "llama2-7b",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Routing Strategy
valid options: random, least-latency, throughput
curl -v http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-key-1234567890" \
-H "routing-strategy: random" \
-d '{
"model": "llama2-7b",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Metrics
In order to facilitate the testing of Metrics, we make the Metrics value returned by this mocked vllm deployment inversely proportional to the replica.
We scale the deployment to 1 replica firstly. We can observe that the total value of Metrics is 100.
kubectl scale deployment llama2-7b --replicas=1
curl http://localhost:8000/metrics
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="stop",model_name="llama2-7b"} 100.0
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="llama2-7b"} 100.0
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="llama2-7b"} 100.0
Then we scale the deployment to 5 replicas. We can now see that the total value of Metrics becomes 100 / 5 = 20. This is beneficial for testing AutoScaling.
kubectl scale deployment llama2-7b --replicas=5
curl http://localhost:8000/metrics
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="stop",model_name="llama2-7b"} 20.0
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="llama2-7b"} 20.0
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="llama2-7b"} 20.0
Update Override Metrics
# check metrics
curl -X GET http://localhost:8000/metrics
# override metrics
curl -X POST http://localhost:8000/set_metrics -H "Content-Type: application/json" -d '{"gpu_cache_usage_perc": 75.0}'