Add some hard-coded change in html to markdown
This commit is contained in:
parent
ce90fa1339
commit
321025b5d7
|
@ -5,7 +5,7 @@ author: "vLLM Team"
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
||||||
We would like to share two updates to the vLLM community.
|
We would like to share two updates to the vLLM community.
|
||||||
|
|
||||||
|
|
||||||
### Future of vLLM is Open
|
### Future of vLLM is Open
|
||||||
|
@ -18,7 +18,10 @@ We would like to share two updates to the vLLM community.
|
||||||
|
|
||||||
We are excited to see vLLM is becoming the standard for LLM inference and serving. In the recent [Meta Llama 3.1 announcement](https://ai.meta.com/blog/meta-llama-3-1/), 8 out of 10 official partners for real time inference run vLLM as the serving engine for the Llama 3.1 models. We have also heard anecdotally that vLLM is being used in many of the AI features in our daily life.
|
We are excited to see vLLM is becoming the standard for LLM inference and serving. In the recent [Meta Llama 3.1 announcement](https://ai.meta.com/blog/meta-llama-3-1/), 8 out of 10 official partners for real time inference run vLLM as the serving engine for the Llama 3.1 models. We have also heard anecdotally that vLLM is being used in many of the AI features in our daily life.
|
||||||
|
|
||||||
We believe vLLM’s success comes from the power of the strong open source community. vLLM is actively maintained by a consortium of groups such as UC Berkeley, Anyscale, AWS, CentML, IBM, Neural Magic, Roblox, and others. To this extent, we want to ensure the ownership and governance of the project is open and transparent as well.
|
We believe vLLM’s success comes from the power of the strong open source community. vLLM is actively maintai
|
||||||
|
ned by a consortium of groups such as UC Berkeley, Anyscale, AWS, CentML, Databricks, IBM, Neural Magic, Roblox,
|
||||||
|
Snowflake, and others. To this extent, we want to ensure the ownership and governance of the project is open an
|
||||||
|
d transparent as well.
|
||||||
|
|
||||||
We are excited to announce that vLLM has [started the incubation process into LF AI & Data Foundation](https://lfaidata.foundation/blog/2024/07/17/lf-ai-data-foundation-mid-year-review-significant-growth-in-the-first-half-of-2024/?hss_channel=tw-976478457881247745). This means no one party will have exclusive control over the future of vLLM. The license and trademark will be irrevocably open. You can trust vLLM is here to stay and will be actively maintained and improved going forward.
|
We are excited to announce that vLLM has [started the incubation process into LF AI & Data Foundation](https://lfaidata.foundation/blog/2024/07/17/lf-ai-data-foundation-mid-year-review-significant-growth-in-the-first-half-of-2024/?hss_channel=tw-976478457881247745). This means no one party will have exclusive control over the future of vLLM. The license and trademark will be irrevocably open. You can trust vLLM is here to stay and will be actively maintained and improved going forward.
|
||||||
|
|
||||||
|
@ -27,24 +30,24 @@ We are excited to announce that vLLM has [started the incubation process into LF
|
||||||
|
|
||||||
The vLLM contributors are doubling down to ensure vLLM is a fastest and easiest-to-use LLM inference and serving engine.
|
The vLLM contributors are doubling down to ensure vLLM is a fastest and easiest-to-use LLM inference and serving engine.
|
||||||
|
|
||||||
To recall our roadmap, we focus vLLM on six objectives: wide model coverage, broad hardware support, top performance, production-ready, thriving open source community, and extensible architecture.
|
To recall our roadmap, we focus vLLM on six objectives: wide model coverage, broad hardware support, top performance, production-ready, thriving open source community, and extensible architecture.
|
||||||
|
|
||||||
In our objective for performance optimization, we have made the following progress to date:
|
In our objective for performance optimization, we have made the following progress to date:
|
||||||
|
|
||||||
|
|
||||||
* Publication of benchmarks
|
* Publication of benchmarks
|
||||||
* Published per-commit performance tracker at [perf.vllm.ai](https://perf.vllm.ai) on our public benchmarks. The goal of this is to track performance enhancement and regressions.
|
* Published per-commit performance tracker at [perf.vllm.ai](https://perf.vllm.ai) on our public benchmarks. The goal of this is to track performance enhancement and regressions.
|
||||||
* Published reproducible benchmark ([docs](https://docs.vllm.ai/en/latest/performance_benchmark/benchmarks.html)) of vLLM compared to LMDeploy, TGI, and TensorRT-LLM. The goal is to identify gaps in performance and close them.
|
* Published reproducible benchmark ([docs](https://docs.vllm.ai/en/latest/performance_benchmark/benchmarks.html)) of vLLM compared to LMDeploy, TGI, and TensorRT-LLM. The goal is to identify gaps in performance and close them.
|
||||||
* Development and integration of highly optimized kernels
|
* Development and integration of highly optimized kernels
|
||||||
* Integrated FlashAttention2 with PagedAttention, and [FlashInfer](https://github.com/flashinfer-ai/flashinfer). We plan to integrate [FlashAttention3](https://github.com/vllm-project/vllm/issues/6348).
|
* Integrated FlashAttention2 with PagedAttention, and [FlashInfer](https://github.com/flashinfer-ai/flashinfer). We plan to integrate [FlashAttention3](https://github.com/vllm-project/vllm/issues/6348).
|
||||||
* Integrating [Flux](https://arxiv.org/abs/2406.06858v1) which overlaps computation and collective communication.
|
* Integrating [Flux](https://arxiv.org/abs/2406.06858v1) which overlaps computation and collective communication.
|
||||||
* Developed state of the art kernels for quantized inference, including INT8 and FP8 activation quantization (via cutlass) and INT4, INT8, and FP8 weight-only quantization for GPTQ and AWQ (via marlin).
|
* Developed state of the art kernels for quantized inference, including INT8 and FP8 activation quantization (via cutlass) and INT4, INT8, and FP8 weight-only quantization for GPTQ and AWQ (via marlin).
|
||||||
* Started several work streams to lower critical overhead
|
* Started several work streams to lower critical overhead
|
||||||
* We identified vLLM’s synchronous and blocking scheduler is a key bottleneck for models running on fast GPUs (H100s). We are working on making the schedule asynchronous and plan steps ahead of time.
|
* We identified vLLM’s synchronous and blocking scheduler is a key bottleneck for models running on fast GPUs (H100s). We are working on making the schedule asynchronous and plan steps ahead of time.
|
||||||
* We identified vLLM’s OpenAI-compatible API frontend has higher than desired overhead. [We are working on isolating it from the critical path of scheduler and model inference. ](https://github.com/vllm-project/vllm/issues/6797)
|
* We identified vLLM’s OpenAI-compatible API frontend has higher than desired overhead. [We are working on isolating it from the critical path of scheduler and model inference. ](https://github.com/vllm-project/vllm/issues/6797)
|
||||||
* We identified vLLM’s input preparation, and output processing scale suboptimally with the data size. Many of the operations can be vectorized and enhanced by moving them off the critical path.
|
* We identified vLLM’s input preparation, and output processing scale suboptimally with the data size. Many of the operations can be vectorized and enhanced by moving them off the critical path.
|
||||||
|
|
||||||
We will continue to update the community in vLLM’s progress in closing the performance gap. You can track our overall progress [here](https://github.com/vllm-project/vllm/issues/6801). Please continue to suggest new ideas and contribute with your improvements!
|
We will continue to update the community in vLLM’s progress in closing the performance gap. You can track our overall progress [here](https://github.com/vllm-project/vllm/issues/6801). Please continue to suggest new ideas and contribute with your improvements!
|
||||||
|
|
||||||
|
|
||||||
### More Resources
|
### More Resources
|
||||||
|
|
Loading…
Reference in New Issue