vllm/docs/features/quantization/bitblas.md

62 lines
1.7 KiB
Markdown

---
title: BitBLAS
---
[](){ #bitblas }
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
!!! note
Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
Below are the steps to utilize BitBLAS with vLLM.
```bash
pip install bitblas>=0.1.0
```
vLLM reads the model's config file and supports pre-quantized checkpoints.
You can find pre-quantized models on:
- [Hugging Face (BitBLAS)](https://huggingface.co/models?search=bitblas)
- [Hugging Face (GPTQ)](https://huggingface.co/models?search=gptq)
Usually, these repositories have a `quantize_config.json` file that includes a `quantization_config` section.
## Read bitblas format checkpoint
```python
from vllm import LLM
import torch
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True,
quantization="bitblas"
)
```
## Read gptq format checkpoint
??? Code
```python
from vllm import LLM
import torch
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
llm = LLM(
model=model_id,
dtype=torch.float16,
trust_remote_code=True,
quantization="bitblas",
max_model_len=1024
)
```