Explore Our Latest Insights

Introducing Compressed Granite 3.1: Powerful Performance in a Small Package
A Compressed Summary Smaller, Faster Granite for All Neural Magic is excited to join Red Hat, combining our expertise in AI optimization and inference with Red Hat’s legacy of open-source innovation. Together, we’re paving the way for more effici...Read More
A Compressed Summary Smaller, Faster Granite for All Neural Magic is excited to join Red Hat, combin...
01.30.2025

2:4 Sparse Llama: Smaller Models for Efficient GPU Inference
A Sparse Summary Introducing Sparse Llama 3.1 8B Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squ...Read More
A Sparse Summary Introducing Sparse Llama 3.1 8B Large language models (LLMs) are approaching their ...
11.25.2024

We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference. However, there has been a persistent question of whether these quantized models retain the same level of accura...Read More
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computat...
10.17.2024

Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks. The most common implementation is w4a16 quantization (e.g., GPTQ or AWQ), which uses 4-bit quantized weights and 16-bit activat...Read More
Mixed-input quantization is a technique that processes weights and activations at different precisio...
10.14.2024