2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

Nov 25, 2024

Author(s)

Eldar Kurtić

Senior Machine Learning Research Engineer

Alexandre Marques

Manager of Machine Learning Research

Mark Kurtz

CTO, Neural Magic

Dan Alistarh

Principal Research Scientist

Shubhra Pandit

Senior Machine Learning Researcher

A Sparse Summary

Sparse Foundation Model: The first sparse, highly accurate foundation model built on top of Meta’s Llama 3.1 8B with 98% recovery on Open LLM Leaderboard v1 and full recovery across fine-tuning tasks, including math, coding, and chat.
Hardware-Accelerated Sparsity: Features a 2:4 sparsity pattern designed for NVIDIA Ampere GPUs and newer, delivering up to 30% higher throughput and 20% lower latency from sparsity alone with vLLM.
Quantization Compatible: Fully integrates with advanced 4-bit quantization methods like GPTQ and efficient Sparse-Marlin inference kernels, enabling faster inference anywhere from 1.2x to 3.0x depending on the hardware and scenario.

Introducing Sparse Llama 3.1 8B

Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squeezing out the last possible bits before accuracy plummets. These dense architectures remain large, costly, and resource-intensive, making it challenging and expensive to scale AI. Neural Magic is doubling down on this challenge with sparse LLMs—reducing the model size by removing unneeded connections while retaining accuracy. Sparse models, though underexplored in the LLM space due to the high compute demands of pretraining, offer an increasingly promising dimension in model compression and efficiency.

Sparse-Llama-3.1-8B-2of4 is our next step in this commitment—a 50% pruned version of Meta's open-source Llama 3.1 8B. Built with a GPU-friendly 2:4 sparsity structure, it removes two of every four parameters while preserving accuracy. Designed as a versatile foundation model for fine-tuning and instruction alignment, Sparse Llama is optimized for both speed and efficiency. Its quantization-friendly architecture enables faster, cheaper inference with roughly half the connections of its dense counterpart.

Research Background

Sparse Llama 3.1 originates from years of prior research, building on previous breakthroughs with SparseGPT, SquareHead Knowledge Distillation, and Sparse Llama 2. These contributions laid the groundwork for our state-of-the-art sparse training approach, tailored to the latest generation of LLMs. Leveraging SparseGPT developed in collaboration with ISTA, we efficiently removed redundant connections, while SquareHead’s layerwise knowledge distillation and Sparse Llama 2’s foundational training recipes provided the basis for sparsity optimization.

Working with the latest LLMs requires more than applying existing techniques. These models, pushed to the edge of training scaling laws, are highly sensitive to sparsity. We iteratively refined our methods to overcome this, starting with meticulously curating publicly available datasets. By sourcing and filtering only the highest-quality and most representative data for LLM use cases, we reduced the pretraining set to just 13 billion tokens—drastically cutting the environmental impact of further training while preserving performance.

This curated dataset and advancements in our pruning and sparse training recipes allowed training to converge in just 26 hours on 32 H100 GPUs, demonstrating the efficiency and scalability of our approach while delivering a model optimized for real-world deployments.

Performance Snapshot

Sparse Llama 3.1 demonstrates exceptional performance across few-shot benchmarks, fine-tuning tasks, and inference scenarios, showcasing the versatility and efficiency of 2:4 sparsity.

Few-Shot Benchmarks: It achieved 98.4% accuracy recovery on the Open LLM Leaderboard V1 and 97.3% recovery on the more challenging Mosaic Eval Gauntlet (Figure 1, right), maintaining competitive performance with dense models.
Fine-Tuning Results (Figure 1, left): The most exciting results emerged during fine-tuning across math (GSM8K), code (Evol-CodeAlpaca), and conversational AI (Ultrachat-200K) tasks. Despite minimal regression in the few-shot benchmarks, it achieved full accuracy recovery and, in some cases, outperformed its dense counterparts. These results underscore Sparse Llama's robustness and adaptability across domains. As a proof of concept, these fine-tuned models are additionally publicly available.
Sparse-Quantized Inference (Figure 2): Using 4-bit post-training quantization combined with 2:4 sparsity delivered impressive inference speedups with vLLM 0.6.4.post1 and minimal effects on accuracy for most cases. The sparse-quantized models achieved 3.0x speedup on A5000 and A6000 GPUs, and 2.1x on A100 GPUs in single-stream latency, with 1.1x to 1.2x of the gains attributed to sparsity alone. Throughput scenarios showed 1.2x to 1.8x improvement, even when quantization alone had minimal impact.

For a detailed breakdown of metrics and benchmarks, visit the Full Performance Details section.

Get Started

Explore the Sparse Llama base model and our fine-tuned versions today on Neural Magic’s Hugging Face organization. With open-sourced weights, evaluations, and benchmarks, we intend to empower the community to experiment and build on our foundation.

Are you looking to improve the performance of your AI deployments, reduce deployment costs, or enable better scaling? Contact us—we'd love to help you work with your LLMs!

Stay tuned for more. The future of LLMs is open source, and with sparsity, we're excited to continue pushing the boundaries of what's possible for efficient, scalable, and performant AI.

Full Performance Details

Base Model Evaluations

We evaluated Sparse Llama on two widely used benchmarks to establish its baseline performance:

Open LLM Leaderboard v1: This benchmark measures capabilities across domains such as grade school mathematics (GSM8K), world knowledge (MMLU, ARC-Challenge), and language understanding (WinoGrande, HellaSwag). Sparse Llama achieved 98.4% accuracy recovery, performing nearly on par with its dense counterpart, as shown in Table 1.

Mosaic Eval Gauntlet v0.3: A comprehensive benchmark covering reasoning, problem-solving, and reading comprehension tasks. Sparse Llama demonstrated robust performance, achieving 97.3% accuracy recovery, even on more challenging datasets, as reported in Table 2.

Sparse Fine-Tuning Evaluations

To evaluate Sparse Llama’s adaptability, we fine-tuned both sparse and dense versions across three domains using the same amount of hyperparameter tuning:

Mathematical Reasoning (GSM8K): Strict-match accuracy in 0-shot setting via lm-evaluation-harness.
Coding (Evol-CodeAlpaca): Pass@1 on HumanEval and HumanEval+ via EvalPlus.
Conversational AI (Ultrachat-200K): Win rate on AlpacaEval, following the setup of Sparse Llama 2.

Sparse Llama consistently achieved full accuracy recovery during fine-tuning and even outperformed the dense baseline on some tasks, demonstrating its versatility. The results are detailed in Table 3.

Quantization and Sparsity

To further optimize Sparse Llama, we applied 4-bit post-training quantization using the W4A16 scheme. This approach preserves the 2:4 sparsity pattern, where weights are quantized to 4-bit integers, and activations remain at 16-bit precision.

Sparse quantized versions maintained accuracy with minimal degradation compared to unquantized models while enabling significant compression and inference performance gains. See Table 4 for detailed accuracy comparisons across tasks.

Inference

Sparse Llama’s inference performance was benchmarked using vLLM, the high-performance inference engine, and compared to dense variants across several real-world use cases—ranging from code completion to large summarization—using version 0.6.4.post1, as detailed in Table 5.

Single-Stream Deployments

In single-stream scenarios, combining sparsity and quantization resulted in significant latency reductions ranging from 2.1x to 3.0x faster inference than dense, 16-bit models, with 1.1x to 1.2x speedups from sparsity alone. Table 6 provides full results across the various use cases.

Asynchronous Deployments

In multi-query scenarios, we highlight the gains focusing on maximum throughput for the various scenarios to compare Sparse Llama with its dense counterparts. Sparse Llama achieved 1.2x to 1.8x speedup in maximum query rates compared to the dense, 16-bit model, while quantization alone offered only minimal improvements and, in some cases, performed worse. Table 7 provides full results across the various use cases.

Empowering AI Efficiency

Sparse Llama 3.1 represents a new step for scalable and efficient LLMs through SOTA sparsity and quantization techniques. From few-shot evaluations to fine-tuning performance and real-world inference benchmarks, Sparse Llama delivers substantial improvements in inference performance, making advanced AI more accessible.

We are excited to see how the community builds on this foundation. Whether you’re optimizing deployments, exploring model compression, or scaling AI to new heights, Sparse Llama offers a compelling path forward. Together, let’s shape the future of efficient AI.

Note: This blog has been updated on 12/23/24 with corrections to inference performance numbers.

Was this article helpful?

YesNo