Bi-Weekly vLLM Office Hours

Join Us Every Other Week For

vLLM Office Hours

As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:

20-minute vLLM update
20-minute special guest topic; see below for details 👇
20-minute open discussion, feedback loop, and Q&A

Upcoming Events

2024

October

10.30

SOTA Tool-Calling Implementation in vLLM

November

11.14

The Impact of Disaggregated Prefill and KV Cache Storage in vLLM

December

12.05

Deep Dive into Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs

12.19

vLLM Project Update: 2024 Retrospective and 2025 Roadmap

October 30, 2024 - 2:00PM ET / 11:00AM PT

SOTA Tool-Calling Implementation in vLLM

Kyle Mistele

Head of Product and Engineering at Zelus Labs

Tool-calling is now supported in vLLM. Anyone can now self-host AI models with OpenAI-compatible tool calling using vLLM with AI models created by Nous Research, Mistral AI, and Qwen AI. Kyle Mistele was instrumental in this achievement, so join us to learn what it took and how you can use tool-calling across your use cases.

November 14, 2024 - 2:00PM ET / 11:00AM PT

The Impact of Disaggregated Prefill and KV Cache Storage in vLLM

Kuntai Du

vLLM Committer and Ph.D. Student at the University of Chicago

In this session, we'll explore how disaggregated prefill and KV cache storage enhance vLLM's inference performance, focusing on key optimizations introduced by recent updates. We'll cover vLLM's new storage architecture, which improves scalability, efficiency, and memory utilization, and dive into its impact on inference throughput and latency.

December 5, 2024 - 2:00PM ET / 11:00AM PT

Deep Dive into Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs

Lucas Wilkinson

Principal Engineer (HPC) at Neural Magic

Join us for a deep dive into Machete, the next-gen mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs. We’ll cover how Machete boosts LLM performance by leveraging memory-bound optimizations, pre-shuffling techniques, and upconversion routines to deliver up to 42% faster throughput on large models.

December 19, 2024 - 2:00PM ET / 11:00AM PT

vLLM Project Update: 2024 Retrospective and 2025 Roadmap

Michael Goin

vLLM Committer and Engineering Lead at Neural Magic

Join us for a special end-of-year vLLM office hours where we’ll reflect on the most exciting achievements of 2024 and give a sneak peek at what’s coming in 2025. Don’t miss this opportunity to look back at how vLLM has evolved and get early insights into next year’s roadmap!

Previous vLLM Office Hours Recordings

In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its different types, performance benefits in vLLM, research ideas surrounding it, and how to apply it effectively within vLLM.

Session slides: https://docs.google.com/presentation/d/1wUoLmhfX6B7CfXy3o4m-MdodRL26WvY3/

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/

1:4:28

YouTube Video UExYczFGSWdIUDlNdXdCRVNUR0dsdllqSnhjQnU3YkxKSC45NDk1REZENzhEMzU5MDQz

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its different types, performance benefits in vLLM, research ideas surrounding it, and how to apply it effectively within vLLM.

Session slides: https://docs.google.com/presentation/d/1wUoLmhfX6B7CfXy3o4m-MdodRL26WvY3/

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/ ...

In this session of Neural Magic's bi-weekly vLLM office hours, we cover the latest updates in vLLM v0.6.0 and v0.6.1, including Vision LM support for Pixtral and Qwen2-VL, and tool-use support for Mistral and Qwen2.5. We also delve into advanced techniques for maximizing inference performance in large language models, highlighting key optimizations that deliver 2.7x throughput improvements and a 5x reduction in latency.

Session slides: https://docs.google.com/presentation/d/1vgt63f5Jl2HHrtHbNY5m9Vpgfi2RjaKC

Join our next vLLM office hours: https://neuralmagic.com/community-office-hours/