vLLM Open Office Hours

As one of the top contributors to the vLLM project, Neural Magic is excited to partner with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly open office hours. Come with questions to learn more about the vLLM project and how Neural Magic can help you bring the power of open-source LLMs and vLLM to your enterprise. All office hours are moderated by vLLM Committer and Neural Magic's Engineering Lead, Michael Goin. We strive to bring a variety of topics and speakers for each session - see below for special topics and guests.

July 25, 2024

Special Topic: Model Optimizations for Fast and Efficient vLLM Inference

Speaker: Eldar Kurtic, Sr. ML Research Engineer, Neural Magic

Date: July 25, 2024

Time: 2:00 PM EST; 11 AM PST


August 8, 2024

Special Topic: Multimodal Models in vLLM

Speaker: Roger Wang, vLLM Committer and Sr. ML Engineer at Roblox

Date: August 8, 2024

Time: 2:00 PM EST; 11 AM PST


Previous vLLM Office Hours Recordings

July 9, 2024 Session

In this session, we brought on vLLM Committers from Anyscale to give an in-depth dive into FP8 quantization. They discussed why FP8 is important, how to get started with FP8 in vLLM, and shared quality and performance results of FP8 quantization. We also covered the latest updates in vLLM v0.5.1, including pipeline parallelism and model support for Gemma 2, Jamba, and DeepSeek-V2. You can see the session slides here.

June 20, 2024 Session

We covered what's new in vLLM v0.5.5, including FP8 weights and activations, speculative decoding, and OpenAI Vision API support. We dug deeper into various topics, including new quantization kernels, GPU architecture compatibility, embeddings in the OpenAI API, optimization tips for GPTQ configurations, and handling concurrent requests in the API server. You can see the session slides here.

June 5, 2024 Session

We started our June 5th session with a quick recap on vLLM and how Neural Magic can support enterprises today to successfully integrate vLLM as a part of their AI strategy. You'll hear answers to audience questions about post-training quantization, maximizing GPU usage for 70B LLMs, differences between vLLM and Hugging Face TGI, cache management, tensor parallelism, and more. You can see the session slides here.