Introduction DeepSeek and vLLM optimizations have been a top priority for our team and the vLLM community as a whole, and we are excited to share a deep dive into our work. In this article, we will cover the key inference improvements we have made, detail the integration of DeepSeek’s latest advancements into vLLM, and… Read More Optimizing vLLM for DeepSeek-R1
The 4-bit Breakdown Quantized Reasoning Models In recent research, including We Ran Over Half a Million Evaluations on Quantized LLMs and How Well Do Quantized Models Handle Long-Context Tasks?, we’ve shown that quantized large language models (LLMs) rival their full-precision counterparts in accuracy across diverse benchmarks, covering academic, real-world use cases, and long-context evaluations while… Read More Quantized DeepSeek-R1 Models: Deployment-Ready Reasoning Models
This blog recaps the February 6th vLLM Office Hours, where our host Michael Goin was joined by Roger Wang, a vLLM committer from Roblox, to discuss the new multimodal capabilities in vLLM V1. vLLM V1: Accelerating Multimodal Inference for Large Language Models In the AI space, efficient inference isn’t just about speed, it’s about flexibility,… Read More Driving Enhanced Support for Multimodal LLMs With vLLM V1
The Compressed Summary Productized Model Compression LLM Compressor is an open-source library that productizes the latest research in model compression, enabling easy generation of compressed models with minimal effort. The LLM Compressor framework allows users to apply state-of-the-art research across quantization, sparsity, and general compression techniques to improve generative AI models' efficiency, scalability, and performance… Read More Multimodal Model Quantization Support Through LLM Compressor
Introduction This blog recaps our January 23rd vLLM Office Hours, where our host Michael Goin was joined by Murali Andoorveedu, a vLLM committer from CentML, to discuss how distributed inference works within vLLM. Murali’s company, CentML, simplifies model deployment with open-source compilers, profiling tools, and benchmarking capabilities to enhance efficiency. Now part of Red Hat,… Read More Distributed Inference With vLLM
A Compressed Summary Introduction The vLLM community has rolled out its latest batch of enhancements to DeepSeek models, including support for MLA (Multi-Head Latent Attention) and optimized CUTLASS Block FP8 kernels. Kudos to Neural Magic’s team at Red Hat for their hard work, specifically Lucas Wilkinson, Tyler Smith, Robert Shaw, and Michael Goin. These improvements… Read More Enhancing DeepSeek Models with MLA and FP8 Optimizations in vLLM
The 4-bit Summary Pushing the Limits of Accurate Quantization In our recent research blog, We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found, we demonstrated that quantized large language models (LLMs) can rival their full-precision counterparts in accuracy across diverse benchmarks, covering academic and real-world evaluations. However, the community raised… Read More How Well Do Quantized Models Handle Long-Context Tasks?
A Compressed Summary Smaller, Faster Granite for All Neural Magic is excited to join Red Hat, combining our expertise in AI optimization and inference with Red Hat’s legacy of open-source innovation. Together, we’re paving the way for more efficient, scalable, and accessible AI solutions tailored to the needs of developers and enterprises across the hybrid… Read More Introducing Compressed Granite 3.1: Powerful Performance in a Small Package
Originally posted at: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html We are thrilled to announce the alpha release of vLLM V1, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the past 1.5 years of vLLM development, we revisited key design decisions, consolidated various features, and simplified the codebase to enhance flexibility and scalability. V1 already achieves state-of-the-art performance and… Read More vLLM V1 Alpha: A Major Upgrade to vLLM's Core Architecture
A Sparse Summary Introducing Sparse Llama 3.1 8B Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squeezing out the last possible bits before accuracy plummets. These dense architectures remain large, costly, and resource-intensive, making it challenging… Read More 2:4 Sparse Llama: Smaller Models for Efficient GPU Inference