Using Sparse-Quantization in Inference: NeurIPS 2020


Did you know that most weights in a neural network are actually useless? In other words, most weights can be removed with little to no impact on the loss. But, how and why would you optimize a deep learning model in practice? Through a combination of pruning and quantization (or “sparse-quantization”) you can drastically improve the performance of deep learning networks. This process reduces the model footprint and compute power required, making these networks simpler to run on lower-cost and readily available CPUs.

That’s what our machine learning lead Mark Kurtz recently spoke about at Neurips 2020. Mark explained SOTA research on pruning and quantization that he used to build components of the Neural magic software. Below are takeaways from his session. Recording here.

Optimization via pruning and quantization

One method of optimizing a neural network is pruning. Pruning is the process of removing weight connections in a network to increase inference speed and decrease model storage size. In general, neural networks are very over-parameterized. Pruning a network can be thought of as removing unused parameters from the over-parameterized network. This can be done without impacting accuracy. In fact you can prune ResNet-50 up to 90%, MobileNet up to 75%, and Transformers up to 60% without affecting the baseline accuracy. If accuracy is not a desired outcome, these models can be pruned even further for greater speeds and smaller footprint.

Most weights, also, are too precise. A process known as quantization can help reduce the precision of neural networks without significant impact to accuracy. For example, you can quantize ResNet-50 from 32 down to two bits, and it only drops 3% in accuracy. 

Why more data science teams aren’t optimizing

If pruning and quantizing are so effective at reducing model size without impacting performance, why aren’t more data scientists doing it? In reality, many teams lack the resources to do these optimizations. According to a recent survey, 59% of data science teams aren’t optimizing their models for production. 

Very few teams under 10 people used optimizations. As the team size got larger, these numbers increased slightly, but were still relatively low. This could be for a few reasons. Pruning takes a lot of learned intuition and iteration, and when combined with quantization, this process can get even more difficult to execute. It takes trial and error, which many teams don’t have time for. It also doesn’t help that there’s a lack of support and ease of use for optimizations within machine learning frameworks.

Making pruning and quantization easier to execute

That’s the problem we’ve been working on solving at Neural Magic. Our Sparsify product helps you visualize and simplify model optimizations like pruning and quantization. Within one UI experience, you can upload a model and visualize any performance and accuracy tradeoffs that come with optimizations. Once satisfied, you can export a configuration (what we call an “optimization recipe”) and integrate it into your existing training workflow with only a few lines of code. 

Here’s a video that shows Sparsify in action:

Kierstin walks through the easy process of uploading, optimizing, and exporting a deep learning model.

Accomplishing our Mission

At the end of January 2021, Neural Magic plans to make portions of its software open source and made available on GitHub. This moves us closer to achieving our mission of shattering the hardware barriers holding back the field of machine learning by making the power of deep learning simple, accessible, and affordable for anyone.

Please fill out the form below to receive a one-time email communication when our engine and deep learning optimization software is ready for download.