Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks. The most common implementation is w4a16 quantization (e.g., GPTQ or AWQ), which uses 4-bit quantized weights and 16-bit activations (float16 or bfloat16). This approach primarily aims to reduce GPU memory requirements for model execution. In most Large Language Model… Read More Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
Announcing LLM Compressor We are excited to announce LLM Compressor, a unified library for creating compressed models for faster inference with vLLM. Neural Magic's research team has successfully utilized it to create our latest compressed models, including fully quantized and accurate versions of Llama 3.1, and with that, we are excited to open up the… Read More LLM Compressor is Here: Faster Inference with vLLM
vLLM Now Supports FP8 on NVIDIA GPUs vLLM, a leading open-source LLM serving engine, has taken a significant leap forward in its recent 0.5 release by incorporating FP8 quantization support. This cutting-edge format promises to revolutionize LLM deployment by dramatically improving efficiency without sacrificing model quality. The implementation of FP8 support is the result of… Read More vLLM Brings FP8 Inference to the Open-Source Community