NeuralFlix
How to Compress Your BERT NLP Models For Very Efficient Inference
Presenter: Mark Kurtz
This video covers SOTA compression research that addresses common Transformer setbacks, including their large size and difficulty with efficient deployments. Specifically, we dig into how you can compress BERT models 10x for much more efficient inference, gaining a 9-29x CPU inference speedup along the way.