ML Performance Research Papers

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

We show how you can compound multiple sparsification techniques to compress transformer-based NLP models for better inference performance. Results: 10x model size compression with < 1% relative drop in accuracy to the dense BERT-base, 10x end-to-end CPU-inference speedup with < 2% relative drop in accuracy, and 29x inference speedups with < 7.5% relative accuracy drop.


How Well Do Sparse Imagenet Models Transfer? (CVPR 2022)

In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups.


M-FAC: Efficient Matrix-Free Approximations of Second-Order Information (NeurIPS 2021)

We propose two new algorithms as part of a framework called M-FAC. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods.


Distributed Principal Component Analysis with Limited Communication (NeurIPS 2021)

We study efficient distributed algorithms for the fundamental problem of principal component analysis and leading eigenvector computation on the sphere, when the data are randomly distributed among a set of computational nodes. We propose a new quantized variant of Riemannian gradient descent to solve this problem, and prove that the algorithm converges with high probability under a set of necessary spherical-convexity properties. We give bounds on the number of bits transmitted by the algorithm under common initialization schemes, and investigate the dependency on the problem dimension in each case.


Asynchronous Decentralized SGD with Quantized and Local Updates (NeurIPS 2021)

We show that a variant of SGD called SwarmSGD still converges in this setting, even if non-blocking communication, quantization, and local steps are all applied in conjunction, and even if the node data distributions and underlying graph topology are both heterogenous. We implement this algorithm and deploy it in a super-computing environment, showing that it can outperform previous decentralized methods in terms of end-to-end training time, and that it can even rival carefully-tuned large-batch SGD for certain tasks.


AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks (NeurIPS 2021)

Existing sparse training methods are mainly empirical and often have lower accuracy relative to the dense baseline. We present a general approach called Alternating Compressed/DeCompressed (AC/DC) training of DNNs, demonstrate convergence for a variant of the algorithm, and show that AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets; at high sparsity levels, AC/DC even outperforms existing methods that rely on accurate pre-trained dense models.


Towards Tight Communication Lower Bounds for Distributed Optimization (NeurIPS 2021)


On the Predictability of Pruning Across Scales (ICML, 2021)

We show that the error of magnitude-pruned networks follows a scaling law, and that this law is of a fundamentally different nature than that of unpruned networks.


Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks (Survey Paper, 2021)

The future of deep learning is sparse! See our overview of the field and upcoming opportunities for how to gain 10-100x performance to fuel the next AI revolution. HPC techniques will be key as large-scale training is supercomputing.


WoodFisher: Efficient Second-Order Approximation for Neural Network Compression (NeurIPS 2020)

Learn about the WoodFisher optimization method for efficient second-order approximation for neural network compression.


Relaxed Scheduling for Scalable Belief Propagation (NeurIPS 2020)

Learn about efficient parallel algorithms for the key machine learning task of inference on graphical models, in particular on the fundamental belief propagation algorithm.


Adaptive Gradient Quantization for Data-Parallel SGD (NeurIPS 2020)

In this paper, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups.


Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference (ICML 2020)

Learn how to gain significant performance by inducing and exploiting activation sparsity for fast neural network inference. Download the paper here.


A Constructive Prediction of the Generalization Error Across Scales (ICLR 2020)

In this work, we present a functional form which approximates well the generalization error in practice. Capitalizing on the successful concept of model scaling (e.g., width, depth), we are able to simultaneously construct such a form and specify the exact models which can attain it across model/data scales.

Was this article helpful?
YesNo