On-Demand: How to Compress BERT NLP Models for Efficient Inference

Transformer-based language models are a key building block for NLP tasks. While they are extremely accurate, they are too large and computationally intensive, thus hard and expensive to deploy.

To deal with the issue, the research community at large has presented numerous compression methods known to decrease the model size and increase inference speeds, including, but not limited to, distillation, quantization, and structured and unstructured pruning. 

But what happens when we start applying different compression methods in conjunction? In other words, can we compound them for even smaller and even faster models? Short answer: yes.

Our team has successfully “compounded”  the above compression methods, to enable highly compressed, but accurate BERT models. 

End result: 8x end-to-end inference speedup and 10x size compression with < 2% relative drop in accuracy, and 29x speedups with <8.5% relative accuracy drop.

Watch the above video for a walkthrough of:

  1. The technical details behind our recent research on enabling compound sparsification
  2. Putting this novel research into practice using simple compression recipes and open-source tools.

If you are impressed with the impact we are making in the field of Machine Learning, check out our open positions and give us a star on GitHub!

Was this article helpful?
YesNo