Welcome to Neural Magic's monthly vLLM newsletter! We are excited to announce the agreement to be acquired by Red Hat. Joining forces with the industry's open source leader will enable us to bring our cutting-edge AI model optimization and accelerated inference technology to a worldwide audience of enterprises adopting open LLM capabilities.
Keep scrolling for exciting vLLM updates and opportunities to engage with the community!
Bi-Weekly vLLM Office Hours
Recent Recordings
vLLM Project Update: 2024 Retrospective and 2025 Roadmap | Watch Now
Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs | Watch Now
Disaggregated Prefill and KV Cache Storage in vLLM | Watch Now
SOTA Tool-Calling Implementation in vLLM |Watch Now
2:4 Sparse Llama: Smaller Models for Efficient GPU Inference Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squeezing out the last possible bits before accuracy plummets.
We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference.
Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks.