Bi-Weekly vLLM Office Hours

Join Us Every Other Week For

vLLM Office Hours

As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:

20-minute vLLM update
20-minute special guest topic; see below for details 👇
20-minute open discussion, feedback loop, and Q&A

Upcoming Events

2025

May

05.29

vLLM Office Hours #26: Intro to torch.compile and How It Works with vLLM

May 29, 2025 - 2:00PM ET / 11:00AM PT

vLLM Office Hours #26: Intro to torch.compile and How It Works with vLLM

Luka Govedič

Software Engineer II at Red Hat

Richard Zou

Staff Software Engineer at Meta (PyTorch Team)

Join Red Hat and Meta experts to learn about torch.compile and how it works with vLLM.

Previous vLLM Office Hours Recordings

Join Red Hat's vLLM and Meta's PyTorch experts to learn about torch.compile and how it works with vLLM. We'll share what torch.compile is, why use it, how it works, how vLLM uses it, performance benchmarks, and more. Join us to learn and to ask questions!

Slides: https://docs.google.com/presentation/d/11Y0uf0rXOVc6M6gbwDdz2721m6yz1UUiX57W-HG7ty0/edit?slide=id.g3365e070742_6_0#slide=id.g3365e070742_6_0

Register for future office hours here: https://neuralmagic.com/community-office-hours/

49:56

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC43MTI1NDIwOTMwQjIxMzNG

[vLLM Office Hours #26] Intro to torch.compile and how it works with vLLM

Join Red Hat's vLLM and Meta's PyTorch experts to learn about torch.compile and how it works with vLLM. We'll share what torch.compile is, why use it, how it works, how vLLM uses it, performance benchmarks, and more. Join us to learn and to ask questions!

Register for future office hours here: https://neuralmagic.com/community-office-hours/ ...

Structured outputs enable you to define specific constraints on the format of the output generated by an LLM. Join us to explore the current capabilities in vLLM, how it works, and what exciting enhancements are on the horizon. Plus hear what's new in vLLM v0.8.5!

Session slides: https://docs.google.com/presentation/d/1a5dHf3iRXSgbeOCa_TBaWujxq9B5EwEw/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

1:0:11

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC5DNzE1RjZEMUZCMjA0RDBB

[vLLM Office Hours #25] Structured Outputs in vLLM - May 8, 2025

Structured outputs enable you to define specific constraints on the format of the output generated by an LLM. Join us to explore the current capabilities in vLLM, how it works, and what exciting enhancements are on the horizon. Plus hear what's new in vLLM v0.8.5!

Session slides: https://docs.google.com/presentation/d/1a5dHf3iRXSgbeOCa_TBaWujxq9B5EwEw/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0 ...

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC45NzUwQkI1M0UxNThBMkU0

Deleted video

This video is unavailable. ...

Join us for a bi-weekly vLLM project update and to learn about the latest developments around fast and efficient vLLM deployments on TPUs.

Session slides: https://docs.google.com/presentation/d/1TUPpcQJ2uIfOLkvTcsdg16phcAPWUUrP/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

1:3:59

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC4zRjM0MkVCRTg0MkYyQTM0

[vLLM Office Hours #24] Performance Optimization of vLLM on Google TPUs - April 24, 2025

Join us for a bi-weekly vLLM project update and to learn about the latest developments around fast and efficient vLLM deployments on TPUs. ...

LLM Compressor is an easy-to-use library for optimizing models for deployment with vllm. We'll show you the power of the LLM Compressor via examples and easy pathways to get started with optimizing LLMs for fast and efficient vLLM inference. We'll also share the latest Llama 4 developments in vLLM! We also shared an update on Llama 4 Day 0 support in vLLM. Enjoy!

Session slides: https://docs.google.com/presentation/d/1S2NnFqkbX4jLe84sY4ITSP4lOKGph5eN/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

1:9:30

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC5GM0Q3M0MzMzY5NTJFNTdE

vLLM Office Hours #23 - Deep Dive Into the LLM Compressor - April 10, 2025

LLM Compressor is an easy-to-use library for optimizing models for deployment with vllm. We'll show you the power of the LLM Compressor via examples and easy pathways to get started with optimizing LLMs for fast and efficient vLLM inference. We'll also share the latest Llama 4 developments in vLLM! We also shared an update on Llama 4 Day 0 support in vLLM. Enjoy!

Session slides: https://docs.google.com/presentation/d/1S2NnFqkbX4jLe84sY4ITSP4lOKGph5eN/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0 ...

Join us to see the new developments in vLLM v1, including updates to the scheduler, memory manager, model runner API server, and more. We also discuss new features in vLLM v0.8.0, v0.8.1, v0.8.2.

Session slides: https://docs.google.com/presentation/d/13ZDXBP2APgTqY3j1A74zZMDaoGXaRUoh/edit#slide=id.p1

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

59:07

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC4yMDhBMkNBNjRDMjQxQTg1

vLLM Office Hours #22 - Intro to vLLM V1 - March 27, 2025

Join us to see the new developments in vLLM v1, including updates to the scheduler, memory manager, model runner API server, and more. We also discuss new features in vLLM v0.8.0, v0.8.1, v0.8.2.

Session slides: https://docs.google.com/presentation/d/13ZDXBP2APgTqY3j1A74zZMDaoGXaRUoh/edit#slide=id.p1

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0 ...

Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.

Session slides: https://docs.google.com/presentation/d/1sE4IVpgPv4gGMJqv6iXJYOyd0Qm4PH__/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

1:3:33

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC5ENDU4Q0M4RDExNzM1Mjcy

vLLM Office Hours #21 - vLLM Production Stack Deep Dive - March 6, 2025

Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.

Session slides: https://docs.google.com/presentation/d/1sE4IVpgPv4gGMJqv6iXJYOyd0Qm4PH__/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0 ...

In this session, we brought five vLLM core committers together to share DeepSeek’s Open Source Week releases and their integration with vLLM, alongside what’s new in vLLM v0.7.2 and v0.7.3. We dove into key advancements: MLA support for better throughput, Multi-Token Prediction for faster inference, 256 Experts for massive MoE models, handling 671B-parameter models too big for a single H100 node, and FP8 Block Quantization for efficiency. These features push the limits of scalable, resource-efficient AI.

Session slides: https://docs.google.com/presentation/d/1h2Y7YbnbhuXrCh9rkQ33ZcC5MyB65oGK/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

1:8:33

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC45RTgxNDRBMzUwRjQ0MDhC

vLLM Office Hours - DeepSeek and vLLM - February 27, 2025

In this session, we brought five vLLM core committers together to share DeepSeek’s Open Source Week releases and their integration with vLLM, alongside what’s new in vLLM v0.7.2 and v0.7.3. We dove into key advancements: MLA support for better throughput, Multi-Token Prediction for faster inference, 256 Experts for massive MoE models, handling 671B-parameter models too big for a single H100 node, and FP8 Block Quantization for efficiency. These features push the limits of scalable, resource-efficient AI.

Session slides: https://docs.google.com/presentation/d/1h2Y7YbnbhuXrCh9rkQ33ZcC5MyB65oGK/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0 ...

Join us for a recap of our vLLM Office Hours session where we dove deep into the exciting new multimodal capabilities in vLLM v1!

We started with an update on vLLM v0.7.0 and v0.7.1, highlighting the latest features and improvements, including recent DeepSeek model support, call for v1 testing, and much more.

Then, we jumped into the core topic: Multimodal LLMs with vLLM v1. We explored the major architectural changes that enable multimodality, walked through the input processing pipeline, and explained the intricacies of the new caching system designed to handle multimodal data. We also shared valuable learnings and benchmark results showcasing the performance of vLLM v1's multimodal features. Finally, we offered a sneak peek into the future roadmap, discussing upcoming developments like multimodal output (image generation) support.

Timestamps:

-New vLLM Blogs Worth Checking Out: 9:24
-What's New in vLLM v0.7.0: 10:41
-What's New in vLLM v0.7.1: 14:55
-Multimodal LLMs in vLLM v1: 19:55
-Overview of Large Multimodal Models: 21:23
-What Did We Learn About Multimodal Models in vLLM v0: 22:22
-vLLM v1 - Encoder Cache & Encoder-Aware Scheduler: 26:05
-vLLM v1 - Prefix Caching: 30:18
-vLLM v1 - Multimodal Data Processing: 32:20
-vLLM v1 - Optimized Engine Loop: 33:40
-vLLM v1 - Multimodal Feature Caching: 34:35
-Benchmarks - Online Serving: 37:34
-Benchmarks - Offline Inference: 38:41
-Future Work: 40:41
-Open Discussion and Audience Q&A: 44:06
-Get Involved With the vLLM Community: 59:16

Session slides: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

#vLLM #Multimodal #LLM #AI #MachineLearning #vLLMOfficeHours #vLLMv1

1:0:20

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC4yMUQyQTQzMjRDNzMyQTMy

vLLM Office Hours #19 - Multimodal LLMs With vLLM v1 - February 6, 2025

Join us for a recap of our vLLM Office Hours session where we dove deep into the exciting new multimodal capabilities in vLLM v1!

We started with an update on vLLM v0.7.0 and v0.7.1, highlighting the latest features and improvements, including recent DeepSeek model support, call for v1 testing, and much more.

Then, we jumped into the core topic: Multimodal LLMs with vLLM v1. We explored the major architectural changes that enable multimodality, walked through the input processing pipeline, and explained the intricacies of the new caching system designed to handle multimodal data. We also shared valuable learnings and benchmark results showcasing the performance of vLLM v1's multimodal features. Finally, we offered a sneak peek into the future roadmap, discussing upcoming developments like multimodal output (image generation) support.

Timestamps:

-New vLLM Blogs Worth Checking Out: 9:24
-What's New in vLLM v0.7.0: 10:41
-What's New in vLLM v0.7.1: 14:55
-Multimodal LLMs in vLLM v1: 19:55
-Overview of Large Multimodal Models: 21:23
-What Did We Learn About Multimodal Models in vLLM v0: 22:22
-vLLM v1 - Encoder Cache & Encoder-Aware Scheduler: 26:05
-vLLM v1 - Prefix Caching: 30:18
-vLLM v1 - Multimodal Data Processing: 32:20
-vLLM v1 - Optimized Engine Loop: 33:40
-vLLM v1 - Multimodal Feature Caching: 34:35
-Benchmarks - Online Serving: 37:34
-Benchmarks - Offline Inference: 38:41
-Future Work: 40:41
-Open Discussion and Audience Q&A: 44:06
-Get Involved With the vLLM Community: 59:16

Session slides: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

#vLLM #Multimodal #LLM #AI #MachineLearning #vLLMOfficeHours #vLLMv1 ...

In this session, we explored the motivation for distributed inference, delving into vLLM architecture and GPU parallelism to enhance performance. We discussed the challenges of serving large models, introduced the concept of tensor parallelism, and examined the benefits and trade-offs of leveraging multiple GPUs for inference. We also highlighted profiling tools for analyzing kernel performance and overhead, along with the potential challenges of adopting a disaggregated approach with separate nodes for prefill and decoding.

During the open discussion, we addressed various community questions, including practical applications of tensor parallelism in real-world scenarios, the impact of distributed inference on latency and throughput, and strategies for optimizing multi-GPU setups.

Session slides: https://docs.google.com/presentation/d/10o1olgyQ3UH1AMQ_uln7ptXNahZRFdhZ/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

48:20

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC41QTY1Q0UxMTVCODczNThE

vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025

In this session, we explored the motivation for distributed inference, delving into vLLM architecture and GPU parallelism to enhance performance. We discussed the challenges of serving large models, introduced the concept of tensor parallelism, and examined the benefits and trade-offs of leveraging multiple GPUs for inference. We also highlighted profiling tools for analyzing kernel performance and overhead, along with the potential challenges of adopting a disaggregated approach with separate nodes for prefill and decoding.

During the open discussion, we addressed various community questions, including practical applications of tensor parallelism in real-world scenarios, the impact of distributed inference on latency and throughput, and strategies for optimizing multi-GPU setups.

Session slides: https://docs.google.com/presentation/d/10o1olgyQ3UH1AMQ_uln7ptXNahZRFdhZ/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0 ...

In this session, we shared the latest updates in vLLM v0.6.6, including exciting new features such as Prefix Caching for Vision Language Models and support for macOS with Apple Silicon (M1 and newer). We also previewed the vLLM Roadmap for Q1 2025, highlighting upcoming advancements to accelerate LLM inference and enhance cross-platform compatibility.

During the open discussion, we tackled several community questions. These included inquiries about when bind_tools support for LangChain API will be available on the vLLM integration, whether DeepSeek FP8 quantization is truly blockwise (2D) or 1D groupwise, and plans for expert parallel optimizations within Mixture of Experts (MoE). Participants also asked how vLLM interacts with other frameworks like UnsLoTH, HuggingFace, and GG's llama.cpp, and whether there is a map of the landscape.

Session slides: https://docs.google.com/presentation/d/1Uic6jQZRUS9l7TuoNeaBrjeLwAGa98xs/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0

48:26

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC5EQUE1NTFDRjcwMDg0NEMz

vLLM Office Hours - vLLM Project Update and Open Discussion - January 09, 2025

In this session, we shared the latest updates in vLLM v0.6.6, including exciting new features such as Prefix Caching for Vision Language Models and support for macOS with Apple Silicon (M1 and newer). We also previewed the vLLM Roadmap for Q1 2025, highlighting upcoming advancements to accelerate LLM inference and enhance cross-platform compatibility.

During the open discussion, we tackled several community questions. These included inquiries about when bind_tools support for LangChain API will be available on the vLLM integration, whether DeepSeek FP8 quantization is truly blockwise (2D) or 1D groupwise, and plans for expert parallel optimizations within Mixture of Experts (MoE). Participants also asked how vLLM interacts with other frameworks like UnsLoTH, HuggingFace, and GG's llama.cpp, and whether there is a map of the landscape.

Session slides: https://docs.google.com/presentation/d/1Uic6jQZRUS9l7TuoNeaBrjeLwAGa98xs/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0 ...

In this session, we wrapped up 2024 with a comprehensive update on the vLLM project and shared exciting plans for 2025. Michael Goin, vLLM Committer, walked us through the latest updates in vLLM v0.6.5, including performant structured outputs, while Simon Mo, vLLM Maintainer, shared key insights from vLLM’s 2024 journey and the roadmap for 2025.

Highlights:

[00:00-02:45] A recap of 2024 vLLM Office Hours by the numbers
[02:46-09:03] About vLLM & Neural Magic
[09:04-15:58] What’s new in vLLM v0.6.5, including performant structured outputs
[15:59-25:59] vLLM’s 2024 milestones and achievements
[26:00-35:55] vLLM 2025 roadmap, including upcoming features and improvements
[35:56-56:03] Open discussion and Q&A

Audience Q&A included discussions on:

-Support for older GPU architectures like Pascal and V100s
-OpenAI API-compliance for tool calling in vLLM
-Deployment recipes for production-ready solutions
-Structured outputs and their role in vLLM’s evolution
-And more...

URL Pointers:
View the slides: https://docs.google.com/presentation/d/1Z78ljqPIg7_KZ7ZAqKO4VDjKG-ytbkbZ/
Register for upcoming vLLM Office Hours: https://hubs.li/Q02Y5Pbh0

56:04

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC41Mzk2QTAxMTkzNDk4MDhF

vLLM Office Hours - vLLM’s 2024 Wrapped and 2025 Vision - December 19, 2024

In this session, we wrapped up 2024 with a comprehensive update on the vLLM project and shared exciting plans for 2025. Michael Goin, vLLM Committer, walked us through the latest updates in vLLM v0.6.5, including performant structured outputs, while Simon Mo, vLLM Maintainer, shared key insights from vLLM’s 2024 journey and the roadmap for 2025.

Highlights:

[00:00-02:45] A recap of 2024 vLLM Office Hours by the numbers
[02:46-09:03] About vLLM & Neural Magic
[09:04-15:58] What’s new in vLLM v0.6.5, including performant structured outputs
[15:59-25:59] vLLM’s 2024 milestones and achievements
[26:00-35:55] vLLM 2025 roadmap, including upcoming features and improvements
[35:56-56:03] Open discussion and Q&A

Audience Q&A included discussions on:

-Support for older GPU architectures like Pascal and V100s
-OpenAI API-compliance for tool calling in vLLM
-Deployment recipes for production-ready solutions
-Structured outputs and their role in vLLM’s evolution
-And more...

URL Pointers:
View the slides: https://docs.google.com/presentation/d/1Z78ljqPIg7_KZ7ZAqKO4VDjKG-ytbkbZ/
Register for upcoming vLLM Office Hours: https://hubs.li/Q02Y5Pbh0 ...