Explore Our Latest Insights
Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks. The most common implementation is w4a16 quantization (e.g., GPTQ or AWQ), which uses 4-bit quantized weights and 16-bit activat...Read More
Mixed-input quantization is a technique that processes weights and activations at different precisio...
10.14.2024
We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference. However, there has been a persistent question of whether these quantized models retain the same level of accura...Read More
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computat...
10.17.2024
LLM Compressor is Here: Faster Inference with vLLM
Announcing LLM Compressor We are excited to announce LLM Compressor, a unified library for creating compressed models for faster inference with vLLM. Neural Magic's research team has successfully utilized it to create our latest compressed models, in...Read More
Announcing LLM Compressor We are excited to announce LLM Compressor, a unified library for creating ...
08.14.2024