Join Us Every Other Week For
vLLM Office Hours
As a leading contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Typical office hours agenda:
- 20-minute vLLM update
- 20-minute special guest topic; see below for details 👇
- 20-minute open discussion, feedback loop, and Q&A
vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024
In this session of our bi-weekly vLLM office hours, we explored the potential of disaggregated prefill and KV cache storage in vLLM to enhance distributed inference. We discussed the initial PR on disaggregated prefill and how KV cache sharing across vLLM improves performance through faster delivery and the composition of multiple KV caches. These advancements are designed to push the boundaries of distributed inference efficiency.
The Q&A session included topics such as the practical gains of improving KV cache transmission and its impact on throughput. We explored comparisons between vLLM's implementation and other approaches like NCCL and addressed questions on KV cache buffer reuse, hardware configurations, and the trade-offs of compression and memory allocation. Other Q&A highlights included the influence of disaggregation on selective prefill logic, the potential for semantic caching improvements, and challenges in combining disaggregated prefill with automatic prefix caching.
Session slides: https://docs.google.com/presentation/d/18nDT1InJAfTvotv5bVAPWuGJFglJTsDs
Join our bi-weekly vLLM Office Hours to learn about the latest updates: https://hubs.li/Q02Y5Pbh0 ...