aibrix/benchmarks/client
..
README.md
__init__.py
analyze.py
client.py
utils.py

README.md

Test client locally

Note All generator invocation should be done under the benchmark home (i.e., /aibrix/benchmarks/)

Starting vllm server:

export API_KEY=${API_KEY}
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 \
--port "8000" \
--model /root/models/deepseek-llm-7b-chat \
--trust-remote-code \
--max-model-len "4096" \
--api-key ${API_KEY} \
--enable-chunked-prefill

Using a sample workload (generated by the workload generator) in a client. Turn on --streaming to collect fine grained metrics such as TTFT and TPOT:

export API_KEY=${API_KEY}
python -m client.client \
--workload-path "output/workload/constant/workload.jsonl" \
--endpoint "http://localhost:8888" \
--model llama-3-8b-instruct \
--api-key ${API_KEY} \
--output-token-limit 128 \
--streaming \
--output-file-path output.jsonl 

The output will be stored as a .jsonl file in output.jsonl

Run analysis on metrics collected. For streaming client, we can specify a goodput target (e2e/tpot/ttft) like the following:

python -m client.analyze --trace output.jsonl --output output --goodput-target tpot:0.5

By default, client treats timestamp in the workload as timestamp in milliseconds (ms). The timestamp described in the workload generator could be scaled with --time-scale, Say that the original workload demands one paralell disptach at time 0, 1000, 2000..., meaning that the requests will be dispatch every second. You could use --time-scale to scale the intervals of timestamps described in the workload, e.g. --time-scale 0.1 scale the intervals of timetsamps by 0.1 meaning that the requests will be dispatched at time 0, 100, 200....