[Add] Blog post on transformers backend integration with vLLM (#50)

* add transformers backend blog post

Signed-off-by: ariG23498 <aritra.born2fly@gmail.com>

OK

Signed-off-by: ariG23498 <aritra.born2fly@gmail.com>

* Apply suggestions from code review

Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: ariG23498 <aritra.born2fly@gmail.com>

* Update _posts/2025-04-11-transformers-backend.md

Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: ariG23498 <aritra.born2fly@gmail.com>

OK

Signed-off-by: ariG23498 <aritra.born2fly@gmail.com>

* Update _posts/2025-04-11-transformers-backend.md

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

---------

Signed-off-by: ariG23498 <aritra.born2fly@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Aritra Roy Gosthipaty 2025-04-16 17:01:00 +05:30 committed by GitHub
parent 5e19ec38f9
commit dcfdf596c1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 163 additions and 0 deletions

View File

@ -0,0 +1,163 @@
---
layout: post
title: "Transformers backend integration in vLLM"
author: "The Hugging Face Team"
image: /assets/figures/transformers-backend/transformers-backend.png
thumbnail-img: /assets/figures/transformers-backend/transformers-backend.png
share-img: /assets/figures/transformers-backend/transformers-backend.png
---
The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index)
offers a flexible, unified interface to a vast ecosystem of model architectures. From research to
fine-tuning on custom dataset, transformers is the go-to toolkit for all.
But when it comes to *deploying* these models at scale, inference speed and efficiency often take
center stage. Enter [vLLM](https://docs.vllm.ai/en/latest/), a library engineered for high-throughput
inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance.
A recent addition to the vLLM codebase enables leveraging transformers as a backend to run models.
vLLM will therefore optimize throughput/latency on top of existing transformers architectures.
In this post, well explore how vLLM leverages the transformers backend to combine **flexibility**
with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter.
## Transformers and vLLM: Inference in Action
Lets start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how
these libraries stack up.
**Infer with transformers**
The transformers library shines in its simplicity and versatility. Using its `pipeline` API, inference is a breeze:
```py
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
result = pipe("The future of AI is")
print(result[0]["generated_text"])
```
This approach is perfect for prototyping or small-scale tasks, but its not optimized for high-volume
inference or low-latency deployment.
**Infer with vLLM**
vLLM takes a different track, prioritizing efficiency with features like `PagedAttention`
(a memory-efficient attention mechanism) and dynamic batching. Heres the same task in vLLM:
```py
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.2-1B")
params = SamplingParams(max_tokens=20)
outputs = llm.generate("The future of AI is", sampling_params=params)
print(f"Generated text: {outputs[0].outputs[0].text}")
```
vLLMs inference is noticeably faster and more resource-efficient, especially under load.
For example, it can handle thousands of requests per second with lower GPU memory usage.
## vLLMs Deployment Superpower: OpenAI Compatibility
Beyond raw performance, vLLM offers an OpenAI-compatible API, making it a drop-in replacement for
external services. Launch a server:
```bash
vllm serve meta-llama/Llama-3.2-1B
```
Then query it with curl:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0}'
```
Or use Pythons OpenAI client:
```py
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
completion = client.completions.create(
model="meta-llama/Llama-3.2-1B",
prompt="San Francisco is a",
max_tokens=7,
temperature=0
)
print("Completion result:", completion.choices[0].text)
```
This compatibility slashes costs and boosts control, letting you scale inference locally with vLLMs optimizations.
## Why do we need the transformers backend?
The transformers library is optimized for contributions and
[addition of new models](https://huggingface.co/docs/transformers/en/add_new_model). Adding a new
model to vLLM on the other hand is a little
[more involved](https://docs.vllm.ai/en/latest/contributing/model/index.html).
In the **ideal world**, we would be able to use the new model in vLLM as soon as it is added to
transformers. With the integration of the transformers backend, we step towards that ideal world.
Here is the [official documentation](https://docs.vllm.ai/en/latest/models/supported_models.html#remote-code)
on how to make your transformers model compatible with vLLM for the integration to kick in.
We followed this and made `modeling_gpt2.py` compatible with the integration! You can follow the
changes in this [transformers pull request](https://github.com/huggingface/transformers/pull/36934).
For a model already in transformers (and compatible with vLLM), this is what we would need to:
```py
llm = LLM(model="new-transformers-model", model_impl="transformers")
```
> [!NOTE]
> It is not a strict necessity to add `model_impl` parameter. vLLM switches to the transformers
> implementation on its own if the model is not natively supported in vLLM.
Or for a custom model from the Hugging Face Hub:
```py
llm = LLM(model="custom-hub-model", model_impl="transformers", trust_remote_code=True)
```
This backend acts as a **bridge**, marrying transformers plug-and-play flexibility with vLLMs
inference prowess. You get the best of both worlds: rapid prototyping with transformers
and optimized deployment with vLLM.
## Case Study: Helium
[Kyutai Teams Helium](https://huggingface.co/docs/transformers/en/model_doc/helium) is not yet supported by vLLM. You might want to run optimized inference on the model with vLLM, and this is where the transformers backend shines.
Lets see this in action:
```bash
vllm serve kyutai/helium-1-preview-2b --model-impl transformers
```
Query it with the OpenAI API:
```py
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model="kyutai/helium-1-preview-2b", prompt="What is AI?")
print("Completion result:", completion)
```
Here, vLLM efficiently processes inputs, leveraging the transformers backend to load
`kyutai/helium-1-preview-2b` seamlessly. Compared to running this natively in transformers,
vLLM delivers lower latency and better resource utilization.
By pairing Transformers model ecosystem with vLLMs inference optimizations, you unlock a workflow
thats both flexible and scalable. Whether youre prototyping a new model, deploying a custom
creation, or scaling a multimodal app, this combination accelerates your path from research to production.

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB