Join Us Every Other Week For
vLLM Office Hours
As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:
- 20-minute vLLM update
- 20-minute special guest topic; see below for details 👇
- 20-minute open discussion, feedback loop, and Q&A
vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025
In this session, we explored the motivation for distributed inference, delving into vLLM architecture and GPU parallelism to enhance performance. We discussed the challenges of serving large models, introduced the concept of tensor parallelism, and examined the benefits and trade-offs of leveraging multiple GPUs for inference. We also highlighted profiling tools for analyzing kernel performance and overhead, along with the potential challenges of adopting a disaggregated approach with separate nodes for prefill and decoding.
During the open discussion, we addressed various community questions, including practical applications of tensor parallelism in real-world scenarios, the impact of distributed inference on latency and throughput, and strategies for optimizing multi-GPU setups.
Session slides: https://docs.google.com/presentation/d/10o1olgyQ3UH1AMQ_uln7ptXNahZRFdhZ/
Join our bi-weekly vLLM Office Hours to learn about the latest features and updates: https://hubs.li/Q02Y5Pbh0 ...