Blog - Neural Magic

Explore Our Latest Insights

Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs

Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks. The most common implementation is w4a16 quantization (e.g., GPTQ or AWQ), which uses 4-bit quantized weights and 16-bit activat...Read More

Mixed-input quantization is a technique that processes weights and activations at different precisio...

10.14.2024

We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found

Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference. However, there has been a persistent question of whether these quantized models retain the same level of accura...Read More

Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computat...

10.17.2024

LLM Compressor is Here: Faster Inference with vLLM

Announcing LLM Compressor We are excited to announce LLM Compressor, a unified library for creating compressed models for faster inference with vLLM. Neural Magic's research team has successfully utilized it to create our latest compressed models, in...Read More

Announcing LLM Compressor We are excited to announce LLM Compressor, a unified library for creating ...

08.14.2024