Join Us Every Other Week For

vLLM Office Hours

As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:

  • 20-minute vLLM update
  • 20-minute special guest topic; see below for details 👇
  • 20-minute open discussion, feedback loop, and Q&A

Upcoming Events

2025
January
01.23
Distributed Inference With vLLM
January 23, 2025 - 2:00PM ET / 11:00AM PT

Distributed Inference With vLLM

Murali Andoorveedu
Software Development Engineer II
Nick Hill
Senior Principal Software Engineer, AI Engineering at Red Hat
Join our upcoming vLLM Office Hours as we dive into distributed inference with vLLM. We'll explore common pitfalls, practical implementation strategies, and steps to get started, with insights tailored to real-world challenges like those discussed here (https://github.com/vllm-project/vllm/discussions/10118). Whether you're optimizing for large-scale deployments or exploring distributed setups, this session is packed with actionable guidance.

Previous vLLM Office Hours Recordings

vLLM Office Hours - vLLM Project Update and Open Discussion - January 09, 2025

In this session, we shared the latest updates in vLLM v0.6.6, including exciting new features such as Prefix Caching for Vision Language Models and support for macOS with Apple Silicon (M1 and newer). We also previewed the vLLM Roadmap for Q1 2025, highlighting upcoming advancements to accelerate LLM inference and enhance cross-platform compatibility.

During the open discussion, we tackled several community questions. These included inquiries about when bind_tools support for LangChain API will be available on the vLLM integration, whether DeepSeek FP8 quantization is truly blockwise (2D) or 1D groupwise, and plans for expert parallel optimizations within Mixture of Experts (MoE). Participants also asked how vLLM interacts with other frameworks like UnsLoTH, HuggingFace, and GG's llama.cpp, and whether there is a map of the landscape.

Session slides: https://docs.google.com/presentation/d/1Uic6jQZRUS9l7TuoNeaBrjeLwAGa98xs/

Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - vLLM’s 2024 Wrapped and 2025 Vision - December 19, 2024

In this session, we wrapped up 2024 with a comprehensive update on the vLLM project and shared exciting plans for 2025. Michael Goin, vLLM Committer, walked us through the latest updates in vLLM v0.6.5, including performant structured outputs, while Simon Mo, vLLM Maintainer, shared key insights from vLLM’s 2024 journey and the roadmap for 2025.

Highlights:

[00:00-02:45] A recap of 2024 vLLM Office Hours by the numbers
[02:46-09:03] About vLLM & Neural Magic
[09:04-15:58] What’s new in vLLM v0.6.5, including performant structured outputs
[15:59-25:59] vLLM’s 2024 milestones and achievements
[26:00-35:55] vLLM 2025 roadmap, including upcoming features and improvements
[35:56-56:03] Open discussion and Q&A

Audience Q&A included discussions on:

-Support for older GPU architectures like Pascal and V100s
-OpenAI API-compliance for tool calling in vLLM
-Deployment recipes for production-ready solutions
-Structured outputs and their role in vLLM’s evolution
-And more...

URL Pointers:
View the slides: https://docs.google.com/presentation/d/1Z78ljqPIg7_KZ7ZAqKO4VDjKG-ytbkbZ/
Register for upcoming vLLM Office Hours: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

In this session, we explored Machete, Neural Magic's newest innovation in mixed-input GEMM kernel design for NVIDIA Hopper GPUs. Built on top of advancements in NVIDIA CUTLASS 3.5.1, Machete is optimized for both compute and memory-bound regimes on Hopper GPUs (H100). Key features include on-the-fly upconversion of weights, latency hiding through overlapping compute and data movement, and robust support for mixed-input scenarios. Machete supports w4a16 and w8a16 compressed-tensor models, GPTQ models, and more.

Session slides: https://docs.google.com/presentation/d/1y1OJDFWir5WxbTH0NwYlAxPxwgfcb4oF/

Explore and join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

In this session of our bi-weekly vLLM office hours, we explored the potential of disaggregated prefill and KV cache storage in vLLM to enhance distributed inference. We discussed the initial PR on disaggregated prefill and how KV cache sharing across vLLM improves performance through faster delivery and the composition of multiple KV caches. These advancements are designed to push the boundaries of distributed inference efficiency.

The Q&A session included topics such as the practical gains of improving KV cache transmission and its impact on throughput. We explored comparisons between vLLM's implementation and other approaches like NCCL and addressed questions on KV cache buffer reuse, hardware configurations, and the trade-offs of compression and memory allocation. Other Q&A highlights included the influence of disaggregation on selective prefill logic, the potential for semantic caching improvements, and challenges in combining disaggregated prefill with automatic prefix caching.

Session slides: https://docs.google.com/presentation/d/18nDT1InJAfTvotv5bVAPWuGJFglJTsDs

Join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

In this session, we dive deep into the implementation of state-of-the-art (SOTA) tool-calling in vLLM. We discuss the importance of tools and functions in open-source AI and provide insights into the challenges and solutions around OpenAI-style tools in vLLM.

During the Q&A, we explored questions around serving multiple models on a single vLLM server, the benefits of partial JSON decoding from a delta stream, and specific application examples where partial visibility into JSON arguments proves advantageous. Additional questions covered plans for supporting OpenAI’s "strict" field in tool definitions for structured output, best practices for tool-calling formats in model fine-tuning, and the choice of OpenAI's chat completions API as a standard over the assistant’s API for tool selection.

Session slides: https://docs.google.com/presentation/d/1LSEiycGVR9Cnz0FFkMrcoAzxW9bQQhp3/edit#slide=id.p1

Stay connected and join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

In this session of our bi-weekly vLLM office hours, we explored the exciting updates in the vLLM v0.6.3 release, featuring experimental fullgraph torch.compile, the introduction of a Feature Compatibility Matrix, and the Machete w4a16 kernel for Hopper GPUs. We also covered new VLM support for GLM-4V, Molmo, NVLM-D, tool-use support for Llama 3.1+3.2 and InternLM2.5, and Reward LM support for Qwen2.5-Math-RM-72B.

During our special topic deep dives, we were joined by Mistral AI’s research engineer, Patrick von Platen, who shared insights into Mistral’s architecture choices and how to efficiently deploy Mistral's models on vLLM.

During the Q&A, we tackled audience questions on topics such as architecture redesign strategies, rotary position embeddings, vLLM support for ARM architecture, OpenAI Whisper, Seq2Seq support in v0.6.3, and more.

Session slides: https://docs.google.com/presentation/d/1fF4ZlnAFXDeKHBGzkJsCeXLkarvlbNRx

Explore and join our bi-weekly vLLM office hours every other Thursday: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its different types, performance benefits in vLLM, research ideas surrounding it, and how to apply it effectively within vLLM.

Session slides: https://docs.google.com/presentation/d/1wUoLmhfX6B7CfXy3o4m-MdodRL26WvY3/

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

In this session of Neural Magic's bi-weekly vLLM office hours, we cover the latest updates in vLLM v0.6.0 and v0.6.1, including Vision LM support for Pixtral and Qwen2-VL, and tool-use support for Mistral and Qwen2.5. We also delve into advanced techniques for maximizing inference performance in large language models, highlighting key optimizations that deliver 2.7x throughput improvements and a 5x reduction in latency.

Session slides: https://docs.google.com/presentation/d/1vgt63f5Jl2HHrtHbNY5m9Vpgfi2RjaKC

Join our next vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024

In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8 kernels in vLLM.

During the Q&A, we tackled a variety of audience questions around hardware diversity, different quantization methods, pros and cons of using torch.compile in vLLM, deployment strategies for multiple copies of vLLM using a custom Docker entrypoint script, and more.

Session slides: https://docs.google.com/presentation/d/184uArSlJTwuoS1SOTT8jNSUE8ojJdzHh

Explore and join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

In this exciting session, we were joined by Woosuk Kwon, the co-creator of vLLM, to dive deep into vLLM's performance on AMD GPUs and Google TPUs. Woosuk shared detailed performance benchmarks and discussed the supported features for each hardware platform. We also explored vLLM's diverse hardware support, including what's coming next in the pipeline.

During the Q&A, we tackled a variety of audience questions around performance data, running vLLM on RedHat 9 for ONNX and GGUF models, supporting LoRA and Prompt Adapters, vLLM’s roadmap for supporting FP8 KV cache, and much more.

Check out the session slides here: https://docs.google.com/presentation/d/141DSi37KlLbDIoSjODrO_6C01xrzCCKm

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024

In this session, we brought on Roger Wang, a vLLM Committer and Software Engineer, ML Platform at Roblox, to discuss the development of supporting transformer-based multimodal models on vLLM. Roger shared insights on effectively using vision-language models with vLLM, upcoming changes, and the roadmap for multimodal model support in vLLM.

Additionally, we touched on the vLLM v0.5.4 release, including model support for Nemotron, InternVL2, BLIP-2, H2O Danube3-4b, MiniCPM-V, and many performance improvements for throughput use cases.

Video Timestamps:
00:00 - 01:14 - Intro to Multimodal Models in vLLM
01:14 - 02:47 - About Roger Wang and the Team Building Multimodal Support in vLLM
02:47 - 04:19 - Overview of Large Multimodal Models
04:19 - 14:40 - Milestones and Examples of Multimodal Models in vLLM
14:40 - 16:52 - Supported Multimodal Models and Roadmap
16:52 - 22:57 - Multimodal Llama 3.1
22:57 - 31:56 - What's New in vLLM v0.5.4
31:56 - 47:42 - Open Q&A and vLLM Community Discussion
47:42 - 50:03 - How to Get Involved with the vLLM Community

Check out the session slides here: https://docs.google.com/presentation/d/1Uq9m17PMn8NYZOCXlNZBnHMvmZG0am8j

Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor for optimizing LLMs for accurate inference in vLLM.

Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support for Llama 3.1, Mistral-Nemo, and Chameleon. We also provided an update on AWQ Marlin and CPU offloading features.

Check out the session slides here: https://docs.google.com/presentation/d/1BhJmAP6ma2IuboExWB3USE12bjf4f5UW

Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://neuralmagic.com/community-office-hours/
...