Join Us Every Other Week For

vLLM Office Hours

As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:

  • 20-minute vLLM update
  • 20-minute special guest topic; see below for details 👇
  • 20-minute open discussion, feedback loop, and Q&A

Upcoming Events

2025
March
03.27
vLLM Office Hours #22: Introduction to vLLM v1
April
04.10
vLLM Office Hours #23
04.24
vLLM Office Hours #24: Performance Optimization of vLLM on Google TPUs
March 27, 2025 - 2:00PM ET / 11:00AM PT

vLLM Office Hours #22: Introduction to vLLM v1

Michael Goin
Principal Software Engineer at Red Hat and vLLM Core Committer
Join us to learn about new developments in vLLM v1, including updates to the scheduler, memory manager, model runner API server, and more.
April 10, 2025 - 2:00PM ET / 11:00AM PT

vLLM Office Hours #23

Michael Goin
Principal Software Engineer at Red Hat and vLLM Core Committer
Join us to learn what's new in vLLM and to ask your questions. Special topic for this session is still TBD based on upcoming project developments and community interests.
April 24, 2025 - 2:00PM ET / 11:00AM PT

vLLM Office Hours #24: Performance Optimization of vLLM on Google TPUs

Chengji Yao
Staff Software Engineer at Google
Join us to learn about the latest developments around fast and efficient vLLM deployments on TPUs.

Previous vLLM Office Hours Recordings

vLLM Office Hours #21 - vLLM Production Stack Deep Dive - March 6, 2025

Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.

Session slides: https://docs.google.com/presentation/d/1sE4IVpgPv4gGMJqv6iXJYOyd0Qm4PH__/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - DeepSeek and vLLM - February 27, 2025

In this session, we brought five vLLM core committers together to share DeepSeek’s Open Source Week releases and their integration with vLLM, alongside what’s new in vLLM v0.7.2 and v0.7.3. We dove into key advancements: MLA support for better throughput, Multi-Token Prediction for faster inference, 256 Experts for massive MoE models, handling 671B-parameter models too big for a single H100 node, and FP8 Block Quantization for efficiency. These features push the limits of scalable, resource-efficient AI.

Session slides: https://docs.google.com/presentation/d/1h2Y7YbnbhuXrCh9rkQ33ZcC5MyB65oGK/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours #19 - Multimodal LLMs With vLLM v1 - February 6, 2025

Join us for a recap of our vLLM Office Hours session where we dove deep into the exciting new multimodal capabilities in vLLM v1!

We started with an update on vLLM v0.7.0 and v0.7.1, highlighting the latest features and improvements, including recent DeepSeek model support, call for v1 testing, and much more.

Then, we jumped into the core topic: Multimodal LLMs with vLLM v1. We explored the major architectural changes that enable multimodality, walked through the input processing pipeline, and explained the intricacies of the new caching system designed to handle multimodal data. We also shared valuable learnings and benchmark results showcasing the performance of vLLM v1's multimodal features. Finally, we offered a sneak peek into the future roadmap, discussing upcoming developments like multimodal output (image generation) support.

Timestamps:

-New vLLM Blogs Worth Checking Out: 9:24
-What's New in vLLM v0.7.0: 10:41
-What's New in vLLM v0.7.1: 14:55
-Multimodal LLMs in vLLM v1: 19:55
-Overview of Large Multimodal Models: 21:23
-What Did We Learn About Multimodal Models in vLLM v0: 22:22
-vLLM v1 - Encoder Cache & Encoder-Aware Scheduler: 26:05
-vLLM v1 - Prefix Caching: 30:18
-vLLM v1 - Multimodal Data Processing: 32:20
-vLLM v1 - Optimized Engine Loop: 33:40
-vLLM v1 - Multimodal Feature Caching: 34:35
-Benchmarks - Online Serving: 37:34
-Benchmarks - Offline Inference: 38:41
-Future Work: 40:41
-Open Discussion and Audience Q&A: 44:06
-Get Involved With the vLLM Community: 59:16

Session slides: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

#vLLM #Multimodal #LLM #AI #MachineLearning #vLLMOfficeHours #vLLMv1
...

vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025

In this session, we explored the motivation for distributed inference, delving into vLLM architecture and GPU parallelism to enhance performance. We discussed the challenges of serving large models, introduced the concept of tensor parallelism, and examined the benefits and trade-offs of leveraging multiple GPUs for inference. We also highlighted profiling tools for analyzing kernel performance and overhead, along with the potential challenges of adopting a disaggregated approach with separate nodes for prefill and decoding.

During the open discussion, we addressed various community questions, including practical applications of tensor parallelism in real-world scenarios, the impact of distributed inference on latency and throughput, and strategies for optimizing multi-GPU setups.

Session slides: https://docs.google.com/presentation/d/10o1olgyQ3UH1AMQ_uln7ptXNahZRFdhZ/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - vLLM Project Update and Open Discussion - January 09, 2025

In this session, we shared the latest updates in vLLM v0.6.6, including exciting new features such as Prefix Caching for Vision Language Models and support for macOS with Apple Silicon (M1 and newer). We also previewed the vLLM Roadmap for Q1 2025, highlighting upcoming advancements to accelerate LLM inference and enhance cross-platform compatibility.

During the open discussion, we tackled several community questions. These included inquiries about when bind_tools support for LangChain API will be available on the vLLM integration, whether DeepSeek FP8 quantization is truly blockwise (2D) or 1D groupwise, and plans for expert parallel optimizations within Mixture of Experts (MoE). Participants also asked how vLLM interacts with other frameworks like UnsLoTH, HuggingFace, and GG's llama.cpp, and whether there is a map of the landscape.

Session slides: https://docs.google.com/presentation/d/1Uic6jQZRUS9l7TuoNeaBrjeLwAGa98xs/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - vLLM’s 2024 Wrapped and 2025 Vision - December 19, 2024

In this session, we wrapped up 2024 with a comprehensive update on the vLLM project and shared exciting plans for 2025. Michael Goin, vLLM Committer, walked us through the latest updates in vLLM v0.6.5, including performant structured outputs, while Simon Mo, vLLM Maintainer, shared key insights from vLLM’s 2024 journey and the roadmap for 2025.

Highlights:

[00:00-02:45] A recap of 2024 vLLM Office Hours by the numbers
[02:46-09:03] About vLLM & Neural Magic
[09:04-15:58] What’s new in vLLM v0.6.5, including performant structured outputs
[15:59-25:59] vLLM’s 2024 milestones and achievements
[26:00-35:55] vLLM 2025 roadmap, including upcoming features and improvements
[35:56-56:03] Open discussion and Q&A

Audience Q&A included discussions on:

-Support for older GPU architectures like Pascal and V100s
-OpenAI API-compliance for tool calling in vLLM
-Deployment recipes for production-ready solutions
-Structured outputs and their role in vLLM’s evolution
-And more...

URL Pointers:
View the slides: https://docs.google.com/presentation/d/1Z78ljqPIg7_KZ7ZAqKO4VDjKG-ytbkbZ/
Register for upcoming vLLM Office Hours: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

In this session, we explored Machete, Neural Magic's newest innovation in mixed-input GEMM kernel design for NVIDIA Hopper GPUs. Built on top of advancements in NVIDIA CUTLASS 3.5.1, Machete is optimized for both compute and memory-bound regimes on Hopper GPUs (H100). Key features include on-the-fly upconversion of weights, latency hiding through overlapping compute and data movement, and robust support for mixed-input scenarios. Machete supports w4a16 and w8a16 compressed-tensor models, GPTQ models, and more.

Session slides: https://docs.google.com/presentation/d/1y1OJDFWir5WxbTH0NwYlAxPxwgfcb4oF/

Explore and join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

In this session of our bi-weekly vLLM office hours, we explored the potential of disaggregated prefill and KV cache storage in vLLM to enhance distributed inference. We discussed the initial PR on disaggregated prefill and how KV cache sharing across vLLM improves performance through faster delivery and the composition of multiple KV caches. These advancements are designed to push the boundaries of distributed inference efficiency.

The Q&A session included topics such as the practical gains of improving KV cache transmission and its impact on throughput. We explored comparisons between vLLM's implementation and other approaches like NCCL and addressed questions on KV cache buffer reuse, hardware configurations, and the trade-offs of compression and memory allocation. Other Q&A highlights included the influence of disaggregation on selective prefill logic, the potential for semantic caching improvements, and challenges in combining disaggregated prefill with automatic prefix caching.

Session slides: https://docs.google.com/presentation/d/18nDT1InJAfTvotv5bVAPWuGJFglJTsDs

Join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

In this session, we dive deep into the implementation of state-of-the-art (SOTA) tool-calling in vLLM. We discuss the importance of tools and functions in open-source AI and provide insights into the challenges and solutions around OpenAI-style tools in vLLM.

During the Q&A, we explored questions around serving multiple models on a single vLLM server, the benefits of partial JSON decoding from a delta stream, and specific application examples where partial visibility into JSON arguments proves advantageous. Additional questions covered plans for supporting OpenAI’s "strict" field in tool definitions for structured output, best practices for tool-calling formats in model fine-tuning, and the choice of OpenAI's chat completions API as a standard over the assistant’s API for tool selection.

Session slides: https://docs.google.com/presentation/d/1LSEiycGVR9Cnz0FFkMrcoAzxW9bQQhp3/edit#slide=id.p1

Stay connected and join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

In this session of our bi-weekly vLLM office hours, we explored the exciting updates in the vLLM v0.6.3 release, featuring experimental fullgraph torch.compile, the introduction of a Feature Compatibility Matrix, and the Machete w4a16 kernel for Hopper GPUs. We also covered new VLM support for GLM-4V, Molmo, NVLM-D, tool-use support for Llama 3.1+3.2 and InternLM2.5, and Reward LM support for Qwen2.5-Math-RM-72B.

During our special topic deep dives, we were joined by Mistral AI’s research engineer, Patrick von Platen, who shared insights into Mistral’s architecture choices and how to efficiently deploy Mistral's models on vLLM.

During the Q&A, we tackled audience questions on topics such as architecture redesign strategies, rotary position embeddings, vLLM support for ARM architecture, OpenAI Whisper, Seq2Seq support in v0.6.3, and more.

Session slides: https://docs.google.com/presentation/d/1fF4ZlnAFXDeKHBGzkJsCeXLkarvlbNRx

Explore and join our bi-weekly vLLM office hours every other Thursday: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its different types, performance benefits in vLLM, research ideas surrounding it, and how to apply it effectively within vLLM.

Session slides: https://docs.google.com/presentation/d/1wUoLmhfX6B7CfXy3o4m-MdodRL26WvY3/

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

In this session of Neural Magic's bi-weekly vLLM office hours, we cover the latest updates in vLLM v0.6.0 and v0.6.1, including Vision LM support for Pixtral and Qwen2-VL, and tool-use support for Mistral and Qwen2.5. We also delve into advanced techniques for maximizing inference performance in large language models, highlighting key optimizations that deliver 2.7x throughput improvements and a 5x reduction in latency.

Session slides: https://docs.google.com/presentation/d/1vgt63f5Jl2HHrtHbNY5m9Vpgfi2RjaKC

Join our next vLLM office hours: https://neuralmagic.com/community-office-hours/
...