Neural Magic Research Papers

AI Performance Research Papers

Our Research Team, in collaboration with the Alistarh Group at IST Austria, regularly contributes back to the community with state-of-the-art research on model compression and ML inference performance. Get technical with our published papers and join the #ml-research channel in the Neural Magic Slack Community to discuss and stay current.

Sparse Finetuning for Inference Acceleration of Large Language Models

We consider the problem of accurate sparse finetuning of large language models (LLMs), that is, finetuning pretrained LLMs on specialized tasks, while inducing sparsity in their weights. On the accuracy side, we observe that standard loss-based finetuning may fail to recover accuracy, especially at high sparsities. To address this, we perform a detailed study of distillation-type losses, determining an L2-based distillation approach we term SquareHead which enables accurate recovery even at higher sparsities, across all model types.

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.

Sparse*BERT: Sparse Models are Robust

This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Our experimentation shows that models that are pruned during pretraining using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches. We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text. Moreover, we show that SparseBioBERT can match the quality of BioBERT with only 10\% of the parameters.

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

We show how you can compound multiple sparsification techniques to compress transformer-based NLP models for better inference performance. Results: 10x model size compression with < 1% relative drop in accuracy to the dense BERT-base, 10x end-to-end CPU-inference speedup with < 2% relative drop in accuracy, and 29x inference speedups with < 7.5% relative accuracy drop.

How Well Do Sparse Imagenet Models Transfer? (CVPR 2022)

In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups.

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information (NeurIPS 2021)

We propose two new algorithms as part of a framework called M-FAC. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods.

Distributed Principal Component Analysis with Limited Communication (NeurIPS 2021)

We study efficient distributed algorithms for the fundamental problem of principal component analysis and leading eigenvector computation on the sphere, when the data are randomly distributed among a set of computational nodes. We propose a new quantized variant of Riemannian gradient descent to solve this problem, and prove that the algorithm converges with high probability under a set of necessary spherical-convexity properties. We give bounds on the number of bits transmitted by the algorithm under common initialization schemes, and investigate the dependency on the problem dimension in each case.

Asynchronous Decentralized SGD with Quantized and Local Updates (NeurIPS 2021)

We show that a variant of SGD called SwarmSGD still converges in this setting, even if non-blocking communication, quantization, and local steps are all applied in conjunction, and even if the node data distributions and underlying graph topology are both heterogenous. We implement this algorithm and deploy it in a super-computing environment, showing that it can outperform previous decentralized methods in terms of end-to-end training time, and that it can even rival carefully-tuned large-batch SGD for certain tasks.

AC/DC: Alternating Compressed / DeCompressed Training of Deep Neural Networks (NeurIPS 2021)

Existing sparse training methods are mainly empirical and often have lower accuracy relative to the dense baseline. We present a general approach called Alternating Compressed/DeCompressed (AC/DC) training of DNNs, demonstrate convergence for a variant of the algorithm, and show that AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets; at high sparsity levels, AC/DC even outperforms existing methods that rely on accurate pre-trained dense models.

Towards Tight Communication Lower Bounds for Distributed Optimization (NeurIPS 2021)

On the Predictability of Pruning Across Scales (ICML, 2021)

We show that the error of magnitude-pruned networks follows a scaling law, and that this law is of a fundamentally different nature than that of unpruned networks.

Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks (Survey Paper, 2021)

The future of deep learning is sparse! See our overview of the field and upcoming opportunities for how to gain 10-100x performance to fuel the next AI revolution. HPC techniques will be key as large-scale training is supercomputing.

WoodFisher: Efficient Second-Order Approximation for Neural Network Compression (NeurIPS 2020)

Learn about the WoodFisher optimization method for efficient second-order approximation for neural network compression.

Relaxed Scheduling for Scalable Belief Propagation (NeurIPS 2020)

Learn about efficient parallel algorithms for the key machine learning task of inference on graphical models, in particular on the fundamental belief propagation algorithm.

Adaptive Gradient Quantization for Data-Parallel SGD (NeurIPS 2020)

In this paper, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups.

Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference (ICML 2020)

Learn how to gain significant performance by inducing and exploiting activation sparsity for fast neural network inference. Download the paper here.

A Constructive Prediction of the Generalization Error Across Scales (ICLR 2020)

In this work, we present a functional form which approximates well the generalization error in practice. Capitalizing on the successful concept of model scaling (e.g., width, depth), we are able to simultaneously construct such a form and specify the exact models which can attain it across model/data scales.

Subscribe to Neural Magic events & news

Neuralmagic, Inc. 55 Davis Sq STE 3 Somerville, MA 02144 United States

Discover faster ways to inference your ML model.

Explore essential resources for every ML practitioner.

Peruse our research. Ask a question.

Get to know us better.