Go to file
Anastasiia Pnevskaia 899b21fc0b
Merge pull request #23 from popovaan/profile_run_fix
Fixed profile run.
2025-08-18 15:26:29 +02:00
examples Working code for offline_inference_openvino.py example on CPU - WiP 2025-04-22 00:23:05 +02:00
vllm_openvino Get max alloc size from property. 2025-08-18 15:12:30 +02:00
.gitignore OpenVINO copy from vLLM v0.8.1 - WiP 2025-04-21 01:08:09 +02:00
Dockerfile Removed erronious symbols from Dockerfile 2025-05-01 13:02:45 -04:00
LICENSE.md Create LICENSE.md 2025-03-19 11:41:40 -07:00
README.md Updated readme. 2025-08-14 09:33:43 +02:00
pyproject.toml Freeze optimum-intel. 2025-05-27 11:02:01 +02:00

README.md

Installation

vLLM powered by OpenVINO supports all LLM models from vLLM supported models list and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs (the list of supported GPUs).

[!NOTE] There are no pre-built wheels or images for this device, so you must build vLLM from source.

Requirements

  • OS: Linux
  • Instruction set architecture (ISA) requirement: at least AVX2.

Set up using Python

Pre-built wheels

Currently, there are no pre-built OpenVINO wheels.

Build wheel from source

First, install Python and ensure you have the latest pip. For example, on Ubuntu 22.04, you can run:

sudo apt-get update  -y
sudo apt-get install python3-pip
pip install --upgrade pip

Second, clone vllm-openvino and install prerequisites for the vLLM OpenVINO backend installation:

git clone https://github.com/vllm-project/vllm-openvino.git
cd vllm-openvino

Finally, install vLLM with OpenVINO backend:

VLLM_TARGET_DEVICE="empty" PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python -m pip install -v .

[!NOTE] In x86, triton will be installed by vllm. But in OpenVINO, triton doesn't work correctly. we need to uninstall it via python3 -m pip uninstall -y triton

[!NOTE] To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html.

Set up using Docker

Pre-built images

Currently, there are no pre-built OpenVINO images.

Build image from source

docker build . -t vllm-openvino-env .
docker run -it --rm vllm-openvino-env

Extra information

Supported features

OpenVINO vLLM backend supports the following advanced vLLM features:

  • Prefix caching (--enable-prefix-caching)
  • Chunked prefill (--enable-chunked-prefill)

[!NOTE] Simultaneous usage of both --enable-prefix-caching and --enable-chunked-prefill is not yet implemented.

[!NOTE] --enable-chunked-prefill is broken on openvino==2025.2, to use this feature update openvino to a nightly 2025.3 release or openvino==2025.1.

Performance tips

vLLM OpenVINO backend environment variables

  • VLLM_OPENVINO_DEVICE to specify which device utilize for the inference. If there are multiple GPUs in the system, additional indexes can be used to choose the proper one (e.g, VLLM_OPENVINO_DEVICE=GPU.1). If the value is not specified, CPU device is used by default.
  • VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON to enable U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using optimum-cli and pass exported folder as <model_id>
  • VLLM_USE_V1 to enable V1 vLLM API, e.g, VLLM_USE_V1=1

CPU performance tips

CPU uses the following environment variables to control behavior:

  • VLLM_OPENVINO_KVCACHE_SPACE to specify the KV Cache size (e.g, VLLM_OPENVINO_KVCACHE_SPACE=40 means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
  • VLLM_OPENVINO_KV_CACHE_PRECISION=u8 to control KV cache precision. u8 precision is used by default.

To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (--enable-chunked-prefill). Based on the experiments, the recommended batch size is 256 (--max-num-batched-tokens)

OpenVINO best known configuration for CPU is:

$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
    python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256

GPU performance tips

GPU device implements the logic for automatic detection of available GPU memory and, by default, tries to reserve as much memory as possible for the KV cache (taking into account gpu_memory_utilization option). However, this behavior can be overridden by explicitly specifying the desired amount of memory for the KV cache using VLLM_OPENVINO_KVCACHE_SPACE environment variable (e.g, VLLM_OPENVINO_KVCACHE_SPACE=8 means 8 GB space for KV cache).

Additionally, GPU device supports VLLM_OPENVINO_KV_CACHE_PRECISION (e.g. i8 or fp16) to control KV cache precision (default value is device-specific).

Currently, the best performance using GPU can be achieved with the default vLLM execution parameters for models with quantized weights (8 and 4-bit integer data types are supported) and preemption-mode=swap.

OpenVINO best known configuration for GPU is:

$ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_KV_CACHE_PRECISION=i8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
    python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json

Limitations

  • LoRA serving is not supported.
  • Only LLM models are currently supported. LLaVa and encoder-decoder models are not currently enabled in vLLM OpenVINO integration.
  • Tensor and pipeline parallelism are not currently enabled in vLLM integration.