Join Us Every Other Week For

vLLM Office Hours

As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:

  • 20-minute vLLM update
  • 20-minute special guest topic; see below for details 👇
  • 20-minute open discussion, feedback loop, and Q&A

Upcoming Events

2024
December
12.05
Deep Dive into Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
12.19
vLLM Project Update: 2024 Retrospective and 2025 Roadmap
December 5, 2024 - 2:00PM ET / 11:00AM PT

Deep Dive into Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs

Lucas Wilkinson
Principal Engineer (HPC) at Neural Magic
Join us for a deep dive into Machete, the next-gen mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs. We’ll cover how Machete boosts LLM performance by leveraging memory-bound optimizations, pre-shuffling techniques, and upconversion routines to deliver up to 42% faster throughput on large models.
December 19, 2024 - 2:00PM ET / 11:00AM PT

vLLM Project Update: 2024 Retrospective and 2025 Roadmap

Simon Mo
vLLM Project Maintainer
Join us for a special end-of-year vLLM office hours where we’ll reflect on the most exciting achievements of 2024 and give a sneak peek at what’s coming in 2025. Don’t miss this opportunity to look back at how vLLM has evolved and get early insights into next year’s roadmap!

Previous vLLM Office Hours Recordings

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

In this session of our bi-weekly vLLM office hours, we explored the potential of disaggregated prefill and KV cache storage in vLLM to enhance distributed inference. We discussed the initial PR on disaggregated prefill and how KV cache sharing across vLLM improves performance through faster delivery and the composition of multiple KV caches. These advancements are designed to push the boundaries of distributed inference efficiency.

The Q&A session included topics such as the practical gains of improving KV cache transmission and its impact on throughput. We explored comparisons between vLLM's implementation and other approaches like NCCL and addressed questions on KV cache buffer reuse, hardware configurations, and the trade-offs of compression and memory allocation. Other Q&A highlights included the influence of disaggregation on selective prefill logic, the potential for semantic caching improvements, and challenges in combining disaggregated prefill with automatic prefix caching.

Session slides: https://docs.google.com/presentation/d/18nDT1InJAfTvotv5bVAPWuGJFglJTsDs

Join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

In this session, we dive deep into the implementation of state-of-the-art (SOTA) tool-calling in vLLM. We discuss the importance of tools and functions in open-source AI and provide insights into the challenges and solutions around OpenAI-style tools in vLLM.

During the Q&A, we explored questions around serving multiple models on a single vLLM server, the benefits of partial JSON decoding from a delta stream, and specific application examples where partial visibility into JSON arguments proves advantageous. Additional questions covered plans for supporting OpenAI’s "strict" field in tool definitions for structured output, best practices for tool-calling formats in model fine-tuning, and the choice of OpenAI's chat completions API as a standard over the assistant’s API for tool selection.

Session slides: https://docs.google.com/presentation/d/1LSEiycGVR9Cnz0FFkMrcoAzxW9bQQhp3/edit#slide=id.p1

Stay connected and join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0
...

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

In this session of our bi-weekly vLLM office hours, we explored the exciting updates in the vLLM v0.6.3 release, featuring experimental fullgraph torch.compile, the introduction of a Feature Compatibility Matrix, and the Machete w4a16 kernel for Hopper GPUs. We also covered new VLM support for GLM-4V, Molmo, NVLM-D, tool-use support for Llama 3.1+3.2 and InternLM2.5, and Reward LM support for Qwen2.5-Math-RM-72B.

During our special topic deep dives, we were joined by Mistral AI’s research engineer, Patrick von Platen, who shared insights into Mistral’s architecture choices and how to efficiently deploy Mistral's models on vLLM.

During the Q&A, we tackled audience questions on topics such as architecture redesign strategies, rotary position embeddings, vLLM support for ARM architecture, OpenAI Whisper, Seq2Seq support in v0.6.3, and more.

Session slides: https://docs.google.com/presentation/d/1fF4ZlnAFXDeKHBGzkJsCeXLkarvlbNRx

Explore and join our bi-weekly vLLM office hours every other Thursday: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its different types, performance benefits in vLLM, research ideas surrounding it, and how to apply it effectively within vLLM.

Session slides: https://docs.google.com/presentation/d/1wUoLmhfX6B7CfXy3o4m-MdodRL26WvY3/

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

In this session of Neural Magic's bi-weekly vLLM office hours, we cover the latest updates in vLLM v0.6.0 and v0.6.1, including Vision LM support for Pixtral and Qwen2-VL, and tool-use support for Mistral and Qwen2.5. We also delve into advanced techniques for maximizing inference performance in large language models, highlighting key optimizations that deliver 2.7x throughput improvements and a 5x reduction in latency.

Session slides: https://docs.google.com/presentation/d/1vgt63f5Jl2HHrtHbNY5m9Vpgfi2RjaKC

Join our next vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024

In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8 kernels in vLLM.

During the Q&A, we tackled a variety of audience questions around hardware diversity, different quantization methods, pros and cons of using torch.compile in vLLM, deployment strategies for multiple copies of vLLM using a custom Docker entrypoint script, and more.

Session slides: https://docs.google.com/presentation/d/184uArSlJTwuoS1SOTT8jNSUE8ojJdzHh

Explore and join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

In this exciting session, we were joined by Woosuk Kwon, the co-creator of vLLM, to dive deep into vLLM's performance on AMD GPUs and Google TPUs. Woosuk shared detailed performance benchmarks and discussed the supported features for each hardware platform. We also explored vLLM's diverse hardware support, including what's coming next in the pipeline.

During the Q&A, we tackled a variety of audience questions around performance data, running vLLM on RedHat 9 for ONNX and GGUF models, supporting LoRA and Prompt Adapters, vLLM’s roadmap for supporting FP8 KV cache, and much more.

Check out the session slides here: https://docs.google.com/presentation/d/141DSi37KlLbDIoSjODrO_6C01xrzCCKm

Join our bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024

In this session, we brought on Roger Wang, a vLLM Committer and Software Engineer, ML Platform at Roblox, to discuss the development of supporting transformer-based multimodal models on vLLM. Roger shared insights on effectively using vision-language models with vLLM, upcoming changes, and the roadmap for multimodal model support in vLLM.

Additionally, we touched on the vLLM v0.5.4 release, including model support for Nemotron, InternVL2, BLIP-2, H2O Danube3-4b, MiniCPM-V, and many performance improvements for throughput use cases.

Video Timestamps:
00:00 - 01:14 - Intro to Multimodal Models in vLLM
01:14 - 02:47 - About Roger Wang and the Team Building Multimodal Support in vLLM
02:47 - 04:19 - Overview of Large Multimodal Models
04:19 - 14:40 - Milestones and Examples of Multimodal Models in vLLM
14:40 - 16:52 - Supported Multimodal Models and Roadmap
16:52 - 22:57 - Multimodal Llama 3.1
22:57 - 31:56 - What's New in vLLM v0.5.4
31:56 - 47:42 - Open Q&A and vLLM Community Discussion
47:42 - 50:03 - How to Get Involved with the vLLM Community

Check out the session slides here: https://docs.google.com/presentation/d/1Uq9m17PMn8NYZOCXlNZBnHMvmZG0am8j

Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor for optimizing LLMs for accurate inference in vLLM.

Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support for Llama 3.1, Mistral-Nemo, and Chameleon. We also provided an update on AWQ Marlin and CPU offloading features.

Check out the session slides here: https://docs.google.com/presentation/d/1BhJmAP6ma2IuboExWB3USE12bjf4f5UW

Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://neuralmagic.com/community-office-hours/
...

vLLM and Neural Magic Office Hours - June 5, 2024

As one of the top contributors to the vLLM project, Neural Magic teams up with the vLLM team from UC Berkeley every 2 weeks to host open office hours. Check out our session from our June 5, 2024 session, where we answered some great questions from participants.

We kicked off our June 5th session with a quick recap on vLLM and how Neural Magic can support enterprises today to successfully integrate vLLM as a part of their AI strategy. You'll hear answers to audience questions about post-training quantization, maximizing GPU usage for 70B LLMs, differences between vLLM and Hugging Face TGI, cache management, tensor parallelism, and more. You can see the session slides here: https://docs.google.com/presentation/d/1B50uCXzAarawDDizElNzi2o55fkgJZSm/edit#slide=id.p1

Do you have questions about vLLM that you'd like addressed directly by the experts? Join our next vLLM office hours and post your questions here: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - June 20, 2024

Happy one-year anniversary vLLM! In this session, we covered what's new in vLLM v0.5.5, including FP8 weights and activations, speculative decoding, and OpenAI Vision API support. We dug deeper into various topics, including new quantization kernels, GPU architecture compatibility, embeddings in the OpenAI API, optimization tips for GPTQ configurations, and handling concurrent requests in the API server. For more details, you can access the session slides here: https://docs.google.com/presentation/d/1BAGbJ-aGYrAMUugReF758u5JUT9EAJLn

Sign up for bi-weekly vLLM office hours: https://neuralmagic.com/community-office-hours/
...

vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024

In this session, we brought on vLLM Committers from Anyscale to give an in-depth dive into FP8 quantization. They discussed why FP8 is important, how to get started with FP8 in vLLM, and shared quality and performance results of FP8 quantization.

We also covered the latest updates in vLLM v0.5.1, including pipeline parallelism and model support for Gemma 2, Jamba, and DeepSeek-V2.

For more details, check out the session slides here: https://docs.google.com/presentation/d/1rPRibjxqqJR-qV-CVq0q0Z-KStHw5aaK

Join our bi-weekly vLLM office hours to stay current with vLLM, ask questions, meet the community, and give feedback: https://neuralmagic.com/community-office-hours/
...