Explore Our Latest Insights
2:4 Sparse Llama: Smaller Models for Efficient GPU Inference
A Sparse Summary Introducing Sparse Llama 3.1 8B Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squ...Read More
A Sparse Summary Introducing Sparse Llama 3.1 8B Large language models (LLMs) are approaching their ...
11.25.2024
We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference. However, there has been a persistent question of whether these quantized models retain the same level of accura...Read More
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computat...
10.17.2024
Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks. The most common implementation is w4a16 quantization (e.g., GPTQ or AWQ), which uses 4-bit quantized weights and 16-bit activat...Read More
Mixed-input quantization is a technique that processes weights and activations at different precisio...
10.14.2024