Products Archives - Neural Magic

Multimodal Model Quantization Support Through LLM Compressor

Addie Stevens | 02/17/25

The Compressed Summary Productized Model Compression LLM Compressor is an open-source library that productizes the latest research in model compression, enabling easy generation of compressed models with minimal effort. The LLM Compressor framework allows users to apply state-of-the-art research across quantization, sparsity, and general compression techniques to improve generative AI models' efficiency, scalability, and performance… Read More

2:4 Sparse Llama FP8: SOTA Performance for NVIDIA Hopper GPUs

Addie Stevens | 12/18/24

A Sparse Summary Introducing 2:4 Sparse Llama with FP8 Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit. Building on our previous work at Neural Magic with the 2:4 Sparse Llama 3.1 8B foundation model–which increases model efficiency by eliminating unnecessary parameters while preserving accuracy–we… Read More

Neural Magic Compress: Optimizing AI at Enterprise Scale

Addie Stevens | 12/17/24

Neural Magic is excited to introduce Neural Magic Compress, a developer subscription designed to help enterprises deploy efficient Generative AI (GenAI) models faster, cheaper, and at scale. Built on our open-source foundation, Neural Magic Compress offers access to state-of-the-art compressed models, detailed performance benchmarks, comprehensive evaluations, and dedicated enterprise support—everything you need to optimize costs,… Read More

We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found

sasa zelenovic | 10/17/24

Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference. However, there has been a persistent question of whether these quantized models retain the same level of accuracy and quality. Recently, the machine learning (ML) community has raised significant concerns about whether quantized large language models (LLMs)… Read More

LLM Compressor is Here: Faster Inference with vLLM

Jenny Yi | 08/14/24

Announcing LLM Compressor We are excited to announce LLM Compressor, a unified library for creating compressed models for faster inference with vLLM. Neural Magic's research team has successfully utilized it to create our latest compressed models, including fully quantized and accurate versions of Llama 3.1, and with that, we are excited to open up the… Read More

vLLM Brings FP8 Inference to the Open-Source Community

sasa zelenovic | 07/15/24

vLLM Now Supports FP8 on NVIDIA GPUs vLLM, a leading open-source LLM serving engine, has taken a significant leap forward in its recent 0.5 release by incorporating FP8 quantization support. This cutting-edge format promises to revolutionize LLM deployment by dramatically improving efficiency without sacrificing model quality. The implementation of FP8 support is the result of… Read More

Neural Magic Product Release Update - Q1 2024

neuralmagic | 03/20/24

[Major Product News] Neural Magic Announces GPU Support for LLM Inference! Over the past several months, our team has been focused on expanding our capabilities to enable LLM inference on GPUs! A few weeks ago, we released our announcement of nm-vllm, our fork of vLLM, with a focus on incorporating the latest LLM optimizations like… Read More

Bringing the Neural Magic to GPUs

neuralmagic | 03/05/24

Announcing Community Support for GPU Inference Serving Over the past five years, Neural Magic has focused on accelerating inference of deep learning models on CPUs. To achieve this, we did two things: Many of the techniques we used to accelerate CPUs to make them more efficient can also help GPUs in their processing of LLMs.… Read More

Neural Magic 1.6 Product Release

neuralmagic | 12/21/23

For the last several months, we’ve been quite busy building out features across our libraries to enable large language model (LLM) inference on CPUs. We upgraded SparseML to support LLMs and generative models through transformers training, sparsification, and export pipelines. DeepSparse, Neural Magic’s inference runtime, has also been enhanced for performant LLM inference. Keep reading… Read More

Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse

neuralmagic | 11/22/23

Key Takeaways This year has been an exceptionally exciting year for open-source large language models (LLMs). Just 11 months ago proprietary models, like GPT-3, were the only reasonable choice for companies to build generative AI applications. Now, there is a thriving ecosystem of high-quality open-source models, like Meta’s Llama family. In February, Meta released the… Read More