SparseGPT: Remove 100 Billion Parameters for Free

Mar 21, 2023



Large language models (LLMs) solve natural language processing problems with astounding accuracy. However, these models are enormous and require a lot of space, cost, and computation power to deploy. For example, the GPT-175B model has 175 billion parameters requiring 320GB of storage and at least 5 A100 GPUs with 80GB of memory each for inference. This computation power is expensive, making this solution only viable for some organizations. Hence, deployment of such models falls outside the purview of small organizations and individuals. 

Several model compression techniques, such as quantization and pruning, have been proposed to compress LLMs. Previous solutions have required extensive retraining to recover lost accuracy. This retraining also requires massive computation power, which is not economical for real-world deployments. Therefore, a solution that can prune LLMs without retraining the model would be a game changer in deploying large language models. At the moment, it is estimated that it can take several weeks to prune a GPT3 model. Enter SparseGPT.

SparseGPT is a post-training pruning method for compressing large language models such as GPT3 efficiently and accurately. The method can be used to prune large language models in one-shot with minimal accuracy loss. For example, you can use SparseGPT to prune OPT-175B to 50% sparsity with a 0.13 decrease in perplexity.

Thus 100 billion weights from the model can be ignored at inference time, increasing the model's throughput while reducing latency. SparseGPT can be applied to a GPT model with 175B parameters on a single GPU in a couple of hours with negligible accuracy loss.

In this article, you will learn about the SparseGPT algorithm and how it works. By the end of the article, you will be familiar with how this method reduces the size of large language models, making it possible to deploy them efficiently on commodity CPUs.

Let’s dive in.

Methods For Pruning Large Language Models

To understand how SparseGPT works, you have to understand post-training pruning and layer-wise pruning. 

Post-training pruning works by creating a compressed version of a model that satisfies a given compression predicate. The method has been used to reduce the computation cost of quantization-aware training. Existing post-training pruning work mostly focuses on CNN and transformer models with less than 100 million parameters. There is also existing work on quantization of large language models, but it mostly focuses on reducing the precision of the numerical representation of individual weights.

Layer-wise pruning splits the model compression problem into layer-wise subproblems. The quality of the solution is measured by the error between the output of the uncompressed and compressed layer by solving a constrained optimization problem. The objective is to find a sparsity mask for each layer that satisfies the constraint and the dense weights. The sparsity mask of a tensor is the binary tensor of the same dimensions with 0 at the indices of the sparsified entries and 1 at other indices. After solving each of the layer-wise problems, the model is rebuilt from the compressed layers. The reconstructed model preserves the accuracy of the dense model if the layer-wise errors are small. 

Optimizing the sparsity mask and the weights jointly makes the problem NP-hard. Hence, solving this problem can only be done through approximations. AdaPrune has been proposed for post-training through magnitude-based weight selection and reconstruction of the remaining weights by applying SGD finetuning. Other methods such as Optimal Brain Compression, have also been proposed to remove weights one at a time and fully reconstruct the weights after each iteration. Unfortunately, these methods have a high runtime on LLMs. And this is where SparseGPT enters the scene.

How Does the SparseGPT Algorithm Work?

Assuming that a pruning mask is fixed, the update has a closed-form solution via the Hessian. The problem can be split into row-wise subproblems. Reconstructing pruned weights is a complex problem because it requires the inversion of a matrix while solving for each row, and the inverse of a masked Hessian is not equal to the masked version of the full inverse. This problem is referred to as the Row-Hessian challenge. The task would be easier if all row masks were the same (for example, if pruning happens column-wise), meaning that only a single shared inverse would be computed. Introducing such constraints in the mask selection would mean sparsifying weights in entire columns, which is difficult and reduces the model's accuracy. SparseGPT solves the row-hessian challenge by reusing Hessians between rows and distinct pruning masks, leading to an accurate and efficient algorithm.

Illustration of the Row-Hessian challenge.
*Figure 1: Illustration of the Row-Hessian challenge. Rows are pruned independently, pruned weights are in white. Hessian information is used for weight reconstruction. Since masks may be different per row, the inverse computation must be performed independently for each row.

SparseGPT aims to prune large language models at a fraction of the cost and time spent by previous methods. This makes SparseGPT applicable to models with 100 billion parameters. It works by setting the weights not in the mask to 0 and setting the rest to their current value. 

The SparseGPT algorithm works as follows, given a fixed pruning mask: 

  • Prune weights in each column of the weight matrix incrementally using a sequence of Hessian inverses
  • Update the remainder of the weights in the rows located to the right of the column being processed

Pruned weights (blue) are updated to compensate for the pruning error, while unpruned weights (light blue) generate no updates.

Visualization of the SparseGPT reconstruction alogrithm.
*Figure 2: Visualization of the SparseGPT reconstruction algorithm.

SparseGPT is local because it performs weight updates after each pruning step, maintaining the input-output relationship between each layer. The high parametrization of GPT models makes it possible to make the updates without any global gradient information. The cost of the reconstruction process consists of the computation of the initial Hessian, iterating through the inverse Hessian sequence, and pruning.

The pruning mask is chosen adaptively while performing the reconstruction process. This is done to reap the benefits adaptive weight selection offers during pruning. The selection is done through iterative blocking, where a pruning mask is chosen based on the reconstruction error while using the diagonal values in the Hessian sequence. The weights are then updated before the pruning mask for the next block is chosen. This is done iteratively, enabling non-uniform selection per column using the corresponding Hessian information and prior weight updates.

Image illustrating mask selection.
*Figure 3: Mask selection.

The SparseGPT algorithm pseudocode is shown below. 

Full SparseGPT alogrithm pseudocode.
*Figure 4: Full SparseGPT Algorithm Pseudocode.

The pseudocode is interpreted as:

  1. Create a pruning mask M with zeros and ones.
  2. Construct a matrix E to store the block quantization errors.
  3. Calculate the inverse Hessian sequence information via Cholesky decomposition.
  4. Loop the blocks while updating the pruning mask and the error.
  5. Select the largest weights for each block and set their corresponding values in the pruning mask to 1.
  6. Prune the remaining weights and set their corresponding values in the pruning mask to 0.
  7. Compute the pruning error for each weight in the block and set the error in the E matrix.
  8. Freeze the weights that were not pruned.
  9. Update the weights in the block using the Cholesky decomposition of the inverse Hessian matrix.
  10. Update weights not updated in the previous loop after processing all the blocks.
  11. Set pruned weights to 0 by element-wise multiplying the weight matrix with the pruning mask.

SparseGPT is implemented in PyTorch using the Hugging Face Transformers library. Experiments on a single NVIDIA A100 GPU with 80GB of memory show that ​​SparseGPT can fully sparsify 175-billion-parameter models in approximately 4 hours without fine tuning.

SparseGPT Evaluation

SparseGPT is evaluated using perplexity on the raw-WikiText2 test set, performing better than magnitude pruning.

Graph of OPT perplexity results on raw-WikiText2.
*Figure 5: OPT perplexity results on raw-WikiText2.

The researchers also evaluated SparseGPT and magnitude pruning on OPT-175B. The left graph shows how the performance of both methods scales with the degree of sparsity. Magnitude pruning achieves 10% sparsity, after which the model’s loss starts reducing, while SparseGPT enables up to 60% sparsity.  

Line graph comparison of SparseGPT against magnitude pruning on OPT-175B.
*Figure 6: Comparison of SparseGPT against magnitude pruning on OPT-175B.
Line graph on compressing the entire OPT model family to different sparsity patterns using SparseGPT.
*Figure 7: Above: Compressing the entire OPT model family (135M, 350M, ..., 66B, 175B) to different sparsity patterns using SparseGPT.

For BLOOM-176B, magnitude pruning achieves 30% sparsity and SparseGPT 50%. SparseGPT can remove 100 billion weights from BLOOM and OPT models without significant loss in accuracy.

Line graph showing uniformly compressing BLOOM-176B to various sparsity levels with SparseGPT and magnitude pruning.
*Figure 8: Uniformly compressing BLOOM-176B to various sparsity levels with SparseGPT and magnitude pruning.

The researchers obtained the following results for ZeroShot experiments, showing that SparseGPT retains accuracy closer to the original while magnitude pruning doesn’t.

Table of zeroshot results on several datasets for sparsified variants of OPT-175B.
*Figure 9: ZeroShot results on several datasets for sparsified variants of OPT-175B.

Final Thoughts

Amazingly, 100 billion parameters from large language models can be ignored at inference. This accomplishment paves the way for the broad usage of these models because of their lower computational and storage requirements. The researchers conjecture that fine-tuning the large models after pruning can lead to the recovery of more accuracy while allowing the models to be pruned between 80% and 90%.

At Neural Magic, we continue to push the envelope in making large language models generally available by providing models that have been pruned, hence ready to use, and providing the tools you need to prune and quantize your models. Join our community to interact with other deep learning practitioners and stay up to date with our work in optimizing large language models.

*Resource for all figures: "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." arXiv, 19 January 2023,

Was this article helpful?



Join the Conversation