Join Us Every Other Thursday For

vLLM Office Hours

As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:

  • 20-minute vLLM update
  • 20-minute special guest topic; see below for details 👇
  • 20-minute open discussion, feedback loop, and Q&A

September 19, 2024

Guest Topic: Advanced Techniques for Maximizing vLLM Performance

Guest Speaker: Robert Shaw, vLLM Committer and Sr. Director of Engineering at Neural Magic.

Date: Thursday, September 19, 2024

Time: 2:00 PM EST; 11 AM PST

Sign Up


October 3, 2024

Guest Topic: Speculative Decoding in vLLM

Guest Speaker: Lily Liu, vLLM Committer and PhD Student at UC Berkeley

Date: Thursday, October 3, 2024

Time: 2:00 PM EST; 11 AM PST

Sign Up


Previous vLLM Office Hours Recordings

September 5, 2024 Session

In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8 kernels in vLLM. During the Q&A, we tackled a variety of audience questions around hardware diversity, different quantization methods, pros and cons of using torch.compile in vLLM, deployment strategies for multiple copies of vLLM using a custom Docker entrypoint script, and more. You can see the session slides here.

August 21, 2024 Session

In this session, we were joined by Woosuk Kwon, the co-creator of vLLM, to dive deep into vLLM's performance on AMD GPUs and Google TPUs. Woosuk shared detailed performance benchmarks and discussed the supported features for each hardware platform. We also explored vLLM's diverse hardware support, including what's coming next in the pipeline. During the Q&A, we tackled a variety of audience questions around performance data, running vLLM on RedHat 9 for ONNX and GGUF models, supporting LoRA and Prompt Adapters, vLLM’s roadmap for supporting FP8 KV cache, and much more. You can see the session slides here.

August 8, 2024 Session

In this session, we brought on Roger Wang, a vLLM Committer and Software Engineer, ML Platform at Roblox, to discuss the development of supporting transformer-based multimodal models on vLLM. Roger shared insights on effectively using vision-language models with vLLM, upcoming changes, and the roadmap for multimodal model support in vLLM. Additionally, we touched on the vLLM v0.5.4 release, including model support for Nemotron, InternVL2, BLIP-2, H2O Danube3-4b, MiniCPM-V, and many performance improvements for throughput use cases. You can see the session slides here.

July 25, 2024 Session

In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor to optimize LLMs for accurate inference in vLLM. Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support for Llama 3.1, Mistral-Nemo, and Chameleon. We also provided an update on AWQ Marlin and CPU offloading features. You can see the session slides here.

July 9, 2024 Session

In this session, we brought on vLLM Committers from Anyscale to give an in-depth dive into FP8 quantization. They discussed why FP8 is important, how to get started with FP8 in vLLM, and shared quality and performance results of FP8 quantization. We also covered the latest updates in vLLM v0.5.1, including pipeline parallelism and model support for Gemma 2, Jamba, and DeepSeek-V2. You can see the session slides here.

June 20, 2024 Session

We covered what's new in vLLM v0.5.5, including FP8 weights and activations, speculative decoding, and OpenAI Vision API support. We dug deeper into various topics, including new quantization kernels, GPU architecture compatibility, embeddings in the OpenAI API, optimization tips for GPTQ configurations, and handling concurrent requests in the API server. You can see the session slides here.

June 5, 2024 Session

We started our June 5th session with a quick recap on vLLM and how Neural Magic can support enterprises today to successfully integrate vLLM as a part of their AI strategy. You'll hear answers to audience questions about post-training quantization, maximizing GPU usage for 70B LLMs, differences between vLLM and Hugging Face TGI, cache management, tensor parallelism, and more. You can see the session slides here.