Add some hard-coded change in html to markdown

This commit is contained in:
Zhuohan Li 2024-09-05 00:02:30 -07:00
parent ce90fa1339
commit 321025b5d7
1 changed files with 10 additions and 7 deletions

View File

@ -5,7 +5,7 @@ author: "vLLM Team"
---
We would like to share two updates to the vLLM community.
We would like to share two updates to the vLLM community.
### Future of vLLM is Open
@ -18,7 +18,10 @@ We would like to share two updates to the vLLM community.
We are excited to see vLLM is becoming the standard for LLM inference and serving. In the recent [Meta Llama 3.1 announcement](https://ai.meta.com/blog/meta-llama-3-1/), 8 out of 10 official partners for real time inference run vLLM as the serving engine for the Llama 3.1 models. We have also heard anecdotally that vLLM is being used in many of the AI features in our daily life.
We believe vLLMs success comes from the power of the strong open source community. vLLM is actively maintained by a consortium of groups such as UC Berkeley, Anyscale, AWS, CentML, IBM, Neural Magic, Roblox, and others. To this extent, we want to ensure the ownership and governance of the project is open and transparent as well.
We believe vLLMs success comes from the power of the strong open source community. vLLM is actively maintai
ned by a consortium of groups such as UC Berkeley, Anyscale, AWS, CentML, Databricks, IBM, Neural Magic, Roblox,
Snowflake, and others. To this extent, we want to ensure the ownership and governance of the project is open an
d transparent as well.
We are excited to announce that vLLM has [started the incubation process into LF AI & Data Foundation](https://lfaidata.foundation/blog/2024/07/17/lf-ai-data-foundation-mid-year-review-significant-growth-in-the-first-half-of-2024/?hss_channel=tw-976478457881247745). This means no one party will have exclusive control over the future of vLLM. The license and trademark will be irrevocably open. You can trust vLLM is here to stay and will be actively maintained and improved going forward.
@ -27,24 +30,24 @@ We are excited to announce that vLLM has [started the incubation process into LF
The vLLM contributors are doubling down to ensure vLLM is a fastest and easiest-to-use LLM inference and serving engine.
To recall our roadmap, we focus vLLM on six objectives: wide model coverage, broad hardware support, top performance, production-ready, thriving open source community, and extensible architecture.
To recall our roadmap, we focus vLLM on six objectives: wide model coverage, broad hardware support, top performance, production-ready, thriving open source community, and extensible architecture.
In our objective for performance optimization, we have made the following progress to date:
* Publication of benchmarks
* Published per-commit performance tracker at [perf.vllm.ai](https://perf.vllm.ai) on our public benchmarks. The goal of this is to track performance enhancement and regressions.
* Published reproducible benchmark ([docs](https://docs.vllm.ai/en/latest/performance_benchmark/benchmarks.html)) of vLLM compared to LMDeploy, TGI, and TensorRT-LLM. The goal is to identify gaps in performance and close them.
* Published reproducible benchmark ([docs](https://docs.vllm.ai/en/latest/performance_benchmark/benchmarks.html)) of vLLM compared to LMDeploy, TGI, and TensorRT-LLM. The goal is to identify gaps in performance and close them.
* Development and integration of highly optimized kernels
* Integrated FlashAttention2 with PagedAttention, and [FlashInfer](https://github.com/flashinfer-ai/flashinfer). We plan to integrate [FlashAttention3](https://github.com/vllm-project/vllm/issues/6348).
* Integrated FlashAttention2 with PagedAttention, and [FlashInfer](https://github.com/flashinfer-ai/flashinfer). We plan to integrate [FlashAttention3](https://github.com/vllm-project/vllm/issues/6348).
* Integrating [Flux](https://arxiv.org/abs/2406.06858v1) which overlaps computation and collective communication.
* Developed state of the art kernels for quantized inference, including INT8 and FP8 activation quantization (via cutlass) and INT4, INT8, and FP8 weight-only quantization for GPTQ and AWQ (via marlin).
* Started several work streams to lower critical overhead
* We identified vLLMs synchronous and blocking scheduler is a key bottleneck for models running on fast GPUs (H100s). We are working on making the schedule asynchronous and plan steps ahead of time.
* We identified vLLMs synchronous and blocking scheduler is a key bottleneck for models running on fast GPUs (H100s). We are working on making the schedule asynchronous and plan steps ahead of time.
* We identified vLLMs OpenAI-compatible API frontend has higher than desired overhead. [We are working on isolating it from the critical path of scheduler and model inference. ](https://github.com/vllm-project/vllm/issues/6797)
* We identified vLLMs input preparation, and output processing scale suboptimally with the data size. Many of the operations can be vectorized and enhanced by moving them off the critical path.
We will continue to update the community in vLLMs progress in closing the performance gap. You can track our overall progress [here](https://github.com/vllm-project/vllm/issues/6801). Please continue to suggest new ideas and contribute with your improvements!
We will continue to update the community in vLLMs progress in closing the performance gap. You can track our overall progress [here](https://github.com/vllm-project/vllm/issues/6801). Please continue to suggest new ideas and contribute with your improvements!
### More Resources