[Add] Blog post on transformers backend integration with vLLM (#50)
* add transformers backend blog post Signed-off-by: ariG23498 <aritra.born2fly@gmail.com> OK Signed-off-by: ariG23498 <aritra.born2fly@gmail.com> * Apply suggestions from code review Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: ariG23498 <aritra.born2fly@gmail.com> * Update _posts/2025-04-11-transformers-backend.md Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: ariG23498 <aritra.born2fly@gmail.com> OK Signed-off-by: ariG23498 <aritra.born2fly@gmail.com> * Update _posts/2025-04-11-transformers-backend.md Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> --------- Signed-off-by: ariG23498 <aritra.born2fly@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
5e19ec38f9
commit
dcfdf596c1
|
@ -0,0 +1,163 @@
|
||||||
|
---
|
||||||
|
layout: post
|
||||||
|
title: "Transformers backend integration in vLLM"
|
||||||
|
author: "The Hugging Face Team"
|
||||||
|
image: /assets/figures/transformers-backend/transformers-backend.png
|
||||||
|
thumbnail-img: /assets/figures/transformers-backend/transformers-backend.png
|
||||||
|
share-img: /assets/figures/transformers-backend/transformers-backend.png
|
||||||
|
---
|
||||||
|
|
||||||
|
The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index)
|
||||||
|
offers a flexible, unified interface to a vast ecosystem of model architectures. From research to
|
||||||
|
fine-tuning on custom dataset, transformers is the go-to toolkit for all.
|
||||||
|
|
||||||
|
But when it comes to *deploying* these models at scale, inference speed and efficiency often take
|
||||||
|
center stage. Enter [vLLM](https://docs.vllm.ai/en/latest/), a library engineered for high-throughput
|
||||||
|
inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance.
|
||||||
|
|
||||||
|
A recent addition to the vLLM codebase enables leveraging transformers as a backend to run models.
|
||||||
|
vLLM will therefore optimize throughput/latency on top of existing transformers architectures.
|
||||||
|
In this post, we’ll explore how vLLM leverages the transformers backend to combine **flexibility**
|
||||||
|
with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter.
|
||||||
|
|
||||||
|
## Transformers and vLLM: Inference in Action
|
||||||
|
|
||||||
|
Let’s start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how
|
||||||
|
these libraries stack up.
|
||||||
|
|
||||||
|
**Infer with transformers**
|
||||||
|
|
||||||
|
The transformers library shines in its simplicity and versatility. Using its `pipeline` API, inference is a breeze:
|
||||||
|
|
||||||
|
```py
|
||||||
|
from transformers import pipeline
|
||||||
|
|
||||||
|
pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
|
||||||
|
result = pipe("The future of AI is")
|
||||||
|
|
||||||
|
print(result[0]["generated_text"])
|
||||||
|
```
|
||||||
|
|
||||||
|
This approach is perfect for prototyping or small-scale tasks, but it’s not optimized for high-volume
|
||||||
|
inference or low-latency deployment.
|
||||||
|
|
||||||
|
**Infer with vLLM**
|
||||||
|
|
||||||
|
vLLM takes a different track, prioritizing efficiency with features like `PagedAttention`
|
||||||
|
(a memory-efficient attention mechanism) and dynamic batching. Here’s the same task in vLLM:
|
||||||
|
|
||||||
|
```py
|
||||||
|
from vllm import LLM, SamplingParams
|
||||||
|
|
||||||
|
llm = LLM(model="meta-llama/Llama-3.2-1B")
|
||||||
|
params = SamplingParams(max_tokens=20)
|
||||||
|
outputs = llm.generate("The future of AI is", sampling_params=params)
|
||||||
|
print(f"Generated text: {outputs[0].outputs[0].text}")
|
||||||
|
```
|
||||||
|
|
||||||
|
vLLM’s inference is noticeably faster and more resource-efficient, especially under load.
|
||||||
|
For example, it can handle thousands of requests per second with lower GPU memory usage.
|
||||||
|
|
||||||
|
## vLLM’s Deployment Superpower: OpenAI Compatibility
|
||||||
|
|
||||||
|
Beyond raw performance, vLLM offers an OpenAI-compatible API, making it a drop-in replacement for
|
||||||
|
external services. Launch a server:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vllm serve meta-llama/Llama-3.2-1B
|
||||||
|
```
|
||||||
|
|
||||||
|
Then query it with curl:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:8000/v1/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Or use Python’s OpenAI client:
|
||||||
|
|
||||||
|
```py
|
||||||
|
from openai import OpenAI
|
||||||
|
|
||||||
|
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
|
||||||
|
completion = client.completions.create(
|
||||||
|
model="meta-llama/Llama-3.2-1B",
|
||||||
|
prompt="San Francisco is a",
|
||||||
|
max_tokens=7,
|
||||||
|
temperature=0
|
||||||
|
)
|
||||||
|
print("Completion result:", completion.choices[0].text)
|
||||||
|
```
|
||||||
|
|
||||||
|
This compatibility slashes costs and boosts control, letting you scale inference locally with vLLM’s optimizations.
|
||||||
|
|
||||||
|
## Why do we need the transformers backend?
|
||||||
|
|
||||||
|
The transformers library is optimized for contributions and
|
||||||
|
[addition of new models](https://huggingface.co/docs/transformers/en/add_new_model). Adding a new
|
||||||
|
model to vLLM on the other hand is a little
|
||||||
|
[more involved](https://docs.vllm.ai/en/latest/contributing/model/index.html).
|
||||||
|
|
||||||
|
In the **ideal world**, we would be able to use the new model in vLLM as soon as it is added to
|
||||||
|
transformers. With the integration of the transformers backend, we step towards that ideal world.
|
||||||
|
|
||||||
|
Here is the [official documentation](https://docs.vllm.ai/en/latest/models/supported_models.html#remote-code)
|
||||||
|
on how to make your transformers model compatible with vLLM for the integration to kick in.
|
||||||
|
We followed this and made `modeling_gpt2.py` compatible with the integration! You can follow the
|
||||||
|
changes in this [transformers pull request](https://github.com/huggingface/transformers/pull/36934).
|
||||||
|
|
||||||
|
For a model already in transformers (and compatible with vLLM), this is what we would need to:
|
||||||
|
|
||||||
|
```py
|
||||||
|
llm = LLM(model="new-transformers-model", model_impl="transformers")
|
||||||
|
```
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> It is not a strict necessity to add `model_impl` parameter. vLLM switches to the transformers
|
||||||
|
> implementation on its own if the model is not natively supported in vLLM.
|
||||||
|
|
||||||
|
Or for a custom model from the Hugging Face Hub:
|
||||||
|
|
||||||
|
```py
|
||||||
|
llm = LLM(model="custom-hub-model", model_impl="transformers", trust_remote_code=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
This backend acts as a **bridge**, marrying transformers’ plug-and-play flexibility with vLLM’s
|
||||||
|
inference prowess. You get the best of both worlds: rapid prototyping with transformers
|
||||||
|
and optimized deployment with vLLM.
|
||||||
|
|
||||||
|
## Case Study: Helium
|
||||||
|
|
||||||
|
[Kyutai Team’s Helium](https://huggingface.co/docs/transformers/en/model_doc/helium) is not yet supported by vLLM. You might want to run optimized inference on the model with vLLM, and this is where the transformers backend shines.
|
||||||
|
|
||||||
|
Let’s see this in action:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vllm serve kyutai/helium-1-preview-2b --model-impl transformers
|
||||||
|
```
|
||||||
|
|
||||||
|
Query it with the OpenAI API:
|
||||||
|
|
||||||
|
```py
|
||||||
|
from openai import OpenAI
|
||||||
|
|
||||||
|
openai_api_key = "EMPTY"
|
||||||
|
openai_api_base = "http://localhost:8000/v1"
|
||||||
|
|
||||||
|
client = OpenAI(
|
||||||
|
api_key=openai_api_key,
|
||||||
|
base_url=openai_api_base,
|
||||||
|
)
|
||||||
|
|
||||||
|
completion = client.completions.create(model="kyutai/helium-1-preview-2b", prompt="What is AI?")
|
||||||
|
print("Completion result:", completion)
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, vLLM efficiently processes inputs, leveraging the transformers backend to load
|
||||||
|
`kyutai/helium-1-preview-2b` seamlessly. Compared to running this natively in transformers,
|
||||||
|
vLLM delivers lower latency and better resource utilization.
|
||||||
|
|
||||||
|
By pairing Transformers’ model ecosystem with vLLM’s inference optimizations, you unlock a workflow
|
||||||
|
that’s both flexible and scalable. Whether you’re prototyping a new model, deploying a custom
|
||||||
|
creation, or scaling a multimodal app, this combination accelerates your path from research to production.
|
Binary file not shown.
After Width: | Height: | Size: 54 KiB |
Loading…
Reference in New Issue