Feb 24, 2025
We hope you’re as excited as we are for the first-ever East Coast vLLM meetup! Neural Magic [now a part of Red Hat] will be hosting the event on March 11, 2025 at the Google offices in Cambridge, Massachusetts.
This is a time to connect with a growing community of vLLM users, developers, maintainers, and engineers from leading companies! Whether you’re a seasoned expert or new to the field, come join us as we dive into exciting technical talks, exchange insights, and discuss the latest innovations in optimizing LLM inference for both performance and efficiency. We hope to see you there!
Bi-weekly vLLM Office Hours
Upcoming
Exploring vLLM V1 Alpha | February 27, 2025 - 2:00PM ET / 11:00AM PT

Join Robert Shaw, vLLM's core committer and Director of Engineering at Red Hat, as he dives into the alpha release of vLLM V1, a transformative upgrade to vLLM’s architecture. Built on 1.5 years of insights, V1 enhances flexibility, scalability, and performance while maintaining seamless compatibility. We'll deep dive into key design improvements, state-of-the-art performance gains, and our roadmap for making V1 the default engine.
vLLM Production Stack Deep Dive | March 6, 2025 - 2:00PM ET / 11:00AM PT
Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.
Recent Recordings
Multimodal LLMs With vLLM V1 | Slides
Distributed Inference with vLLM | Slides | Blog
Blogs
Introducing vLLM Inference Provider in Llama Stack
Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.
Introducing Compressed Granite 3.1: Powerful Performance in a Small Package
Our new compressed Granite 3.1 models are designed for enterprise deployments, achieving 3.3X smaller models, up to 2.8X better performance, and 99% accuracy recovery. Models and recipes are open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.
How Well Do Quantized Models Handle Long-Context Tasks?
We evaluated quantized Llama 3.1 models up to 128k sequence length and found 99%+ accuracy recovery for most quantization formats. See the details in the blog.
Enhancing DeepSeek Models with MLA and FP8 Optimizations in vLLM
The vLLM community has rolled out its latest batch of enhancements to DeepSeek models, including support for MLA (Multi-Head Latent Attention) and optimized CUTLASS Block FP8 kernels. These improvements increase both generation throughput and memory efficiency, making long-context inference more scalable and cost-effective.
Multimodal Model Quantization Support Through LLM Compressor
LLM Compressor (v0.4.0) now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.
Research From Our Labs 🧪
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Read Here
Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning | Read Here
Activation-Informed Merging of Large Language Models | Read Here
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Read Here
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations | Read Here
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models | Read Here
Events
- vLLM x Meta Meetup | February 27, 2025 | Register Here
- East Coast vLLM Meetup | March 11, 2025 | Register Here
Stay engaged with the vLLM community
vLLM is nearing 39,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.