Join Us Every Other Week For

vLLM Office Hours

As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:

  • 20-minute vLLM update
  • 20-minute special guest topic; see below for details 👇
  • 20-minute open discussion, feedback loop, and Q&A

Upcoming Events

2025
January
01.09
vLLM Project Update & Open Discussion
01.23
Distributed Inference With vLLM
January 9, 2025 - 2:00PM ET / 11:00AM PT

vLLM Project Update & Open Discussion

Michael Goin
Engineering Lead, Neural Magic
Join us to hear the latest vLLM project update, ask questions, and interact with the communty.
January 23, 2025 - 2:00PM ET / 11:00AM PT

Distributed Inference With vLLM

Murali Andoorveedu
Software Development Engineer II
Nick Hill
Senior Principal Software Engineer, AI Engineering at Red Hat
Join our upcoming vLLM Office Hours as we dive into distributed inference with vLLM. We'll explore common pitfalls, practical implementation strategies, and steps to get started, with insights tailored to real-world challenges like those discussed here (https://github.com/vllm-project/vllm/discussions/10118). Whether you're optimizing for large-scale deployments or exploring distributed setups, this session is packed with actionable guidance.

Previous vLLM Office Hours Recordings

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

In this session, we explored Machete, Neural Magic's newest innovation in mixed-input GEMM kernel design for NVIDIA Hopper GPUs. Built on top of advancements in NVIDIA CUTLASS 3.5.1, Machete is optimized for both compute and memory-bound regimes on Hopper GPUs (H100). Key features include on-the-fly upconversion of weights, latency hiding through overlapping compute and data movement, and robust support for mixed-input scenarios. Machete supports w4a16 and w8a16 compressed-tensor models, GPTQ models, and more.

Session slides: https://docs.google.com/presentation/d/1y1OJDFWir5WxbTH0NwYlAxPxwgfcb4oF/

Explore and join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

In this session of our bi-weekly vLLM office hours, we explored the potential of disaggregated prefill and KV cache storage in vLLM to enhance distributed inference. We discussed the initial PR on disaggregated prefill and how KV cache sharing across vLLM improves performance through faster delivery and the composition of multiple KV caches. These advancements are designed to push the boundaries of distributed inference efficiency.

The Q&A session included topics such as the practical gains of improving KV cache transmission and its impact on throughput. We explored comparisons between vLLM's implementation and other approaches like NCCL and addressed questions on KV cache buffer reuse, hardware configurations, and the trade-offs of compression and memory allocation. Other Q&A highlights included the influence of disaggregation on selective prefill logic, the potential for semantic caching improvements, and challenges in combining disaggregated prefill with automatic prefix caching.

Session slides: https://docs.google.com/presentation/d/18nDT1InJAfTvotv5bVAPWuGJFglJTsDs

Join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

In this session, we dive deep into the implementation of state-of-the-art (SOTA) tool-calling in vLLM. We discuss the importance of tools and functions in open-source AI and provide insights into the challenges and solutions around OpenAI-style tools in vLLM.

During the Q&A, we explored questions around serving multiple models on a single vLLM server, the benefits of partial JSON decoding from a delta stream, and specific application examples where partial visibility into JSON arguments proves advantageous. Additional questions covered plans for supporting OpenAI’s "strict" field in tool definitions for structured output, best practices for tool-calling formats in model fine-tuning, and the choice of OpenAI's chat completions API as a standard over the assistant’s API for tool selection.

Session slides: https://docs.google.com/presentation/d/1LSEiycGVR9Cnz0FFkMrcoAzxW9bQQhp3/edit#slide=id.p1

Stay connected and join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

In this session of our bi-weekly vLLM office hours, we explored the exciting updates in the vLLM v0.6.3 release, featuring experimental fullgraph torch.compile, the introduction of a Feature Compatibility Matrix, and the Machete w4a16 kernel for Hopper GPUs. We also covered new VLM support for GLM-4V, Molmo, NVLM-D, tool-use support for Llama 3.1+3.2 and InternLM2.5, and Reward LM support for Qwen2.5-Math-RM-72B.

During our special topic deep dives, we were joined by Mistral AI’s research engineer, Patrick von Platen, who shared insights into Mistral’s architecture choices and how to efficiently deploy Mistral's models on vLLM.

During the Q&A, we tackled audience questions on topics such as architecture redesign strategies, rotary position embeddings, vLLM support for ARM architecture, OpenAI Whisper, Seq2Seq support in v0.6.3, and more.

Session slides: https://docs.google.com/presentation/d/1fF4ZlnAFXDeKHBGzkJsCeXLkarvlbNRx

Explore and join our bi-weekly vLLM office hours every other Thursday: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its different types, performance benefits in vLLM, research ideas surrounding it, and how to apply it effectively within vLLM.

Session slides: https://docs.google.com/presentation/d/1wUoLmhfX6B7CfXy3o4m-MdodRL26WvY3/

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

In this session of Neural Magic's bi-weekly vLLM office hours, we cover the latest updates in vLLM v0.6.0 and v0.6.1, including Vision LM support for Pixtral and Qwen2-VL, and tool-use support for Mistral and Qwen2.5. We also delve into advanced techniques for maximizing inference performance in large language models, highlighting key optimizations that deliver 2.7x throughput improvements and a 5x reduction in latency.

Session slides: https://docs.google.com/presentation/d/1vgt63f5Jl2HHrtHbNY5m9Vpgfi2RjaKC

Join our next vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024

In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8 kernels in vLLM.

During the Q&A, we tackled a variety of audience questions around hardware diversity, different quantization methods, pros and cons of using torch.compile in vLLM, deployment strategies for multiple copies of vLLM using a custom Docker entrypoint script, and more.

Session slides: https://docs.google.com/presentation/d/184uArSlJTwuoS1SOTT8jNSUE8ojJdzHh

Explore and join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

In this exciting session, we were joined by Woosuk Kwon, the co-creator of vLLM, to dive deep into vLLM's performance on AMD GPUs and Google TPUs. Woosuk shared detailed performance benchmarks and discussed the supported features for each hardware platform. We also explored vLLM's diverse hardware support, including what's coming next in the pipeline.

During the Q&A, we tackled a variety of audience questions around performance data, running vLLM on RedHat 9 for ONNX and GGUF models, supporting LoRA and Prompt Adapters, vLLM’s roadmap for supporting FP8 KV cache, and much more.

Check out the session slides here: https://docs.google.com/presentation/d/141DSi37KlLbDIoSjODrO_6C01xrzCCKm

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024

In this session, we brought on Roger Wang, a vLLM Committer and Software Engineer, ML Platform at Roblox, to discuss the development of supporting transformer-based multimodal models on vLLM. Roger shared insights on effectively using vision-language models with vLLM, upcoming changes, and the roadmap for multimodal model support in vLLM.

Additionally, we touched on the vLLM v0.5.4 release, including model support for Nemotron, InternVL2, BLIP-2, H2O Danube3-4b, MiniCPM-V, and many performance improvements for throughput use cases.

Video Timestamps:
00:00 - 01:14 - Intro to Multimodal Models in vLLM
01:14 - 02:47 - About Roger Wang and the Team Building Multimodal Support in vLLM
02:47 - 04:19 - Overview of Large Multimodal Models
04:19 - 14:40 - Milestones and Examples of Multimodal Models in vLLM
14:40 - 16:52 - Supported Multimodal Models and Roadmap
16:52 - 22:57 - Multimodal Llama 3.1
22:57 - 31:56 - What's New in vLLM v0.5.4
31:56 - 47:42 - Open Q&A and vLLM Community Discussion
47:42 - 50:03 - How to Get Involved with the vLLM Community

Check out the session slides here: https://docs.google.com/presentation/d/1Uq9m17PMn8NYZOCXlNZBnHMvmZG0am8j

Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor for optimizing LLMs for accurate inference in vLLM.

Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support for Llama 3.1, Mistral-Nemo, and Chameleon. We also provided an update on AWQ Marlin and CPU offloading features.

Check out the session slides here: https://docs.google.com/presentation/d/1BhJmAP6ma2IuboExWB3USE12bjf4f5UW

Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://neuralmagic.com/community-office-hours/
...

vLLM and Neural Magic Office Hours - June 5, 2024

As one of the top contributors to the vLLM project, Neural Magic teams up with the vLLM team from UC Berkeley every 2 weeks to host open office hours. Check out our session from our June 5, 2024 session, where we answered some great questions from participants.

We kicked off our June 5th session with a quick recap on vLLM and how Neural Magic can support enterprises today to successfully integrate vLLM as a part of their AI strategy. You'll hear answers to audience questions about post-training quantization, maximizing GPU usage for 70B LLMs, differences between vLLM and Hugging Face TGI, cache management, tensor parallelism, and more. You can see the session slides here: https://docs.google.com/presentation/d/1B50uCXzAarawDDizElNzi2o55fkgJZSm/edit#slide=id.p1

Do you have questions about vLLM that you'd like addressed directly by the experts? Join our next vLLM office hours and post your questions here: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - June 20, 2024

Happy one-year anniversary vLLM! In this session, we covered what's new in vLLM v0.5.5, including FP8 weights and activations, speculative decoding, and OpenAI Vision API support. We dug deeper into various topics, including new quantization kernels, GPU architecture compatibility, embeddings in the OpenAI API, optimization tips for GPTQ configurations, and handling concurrent requests in the API server. For more details, you can access the session slides here: https://docs.google.com/presentation/d/1BAGbJ-aGYrAMUugReF758u5JUT9EAJLn

Sign up for bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...