NeuralFlix
3.5x Faster NLP BERT Using a Sparsity-Aware Inference Engine on AMD Milan-X
Presenter:
We used DeepSparse, our sparsity-aware inference engine, to answer an important question we've all been pondering: What do people think of pineapple on pizza?
We compared the performance of dense BERT to 90% sparse BERT that recovers to 99% of the baseline accuracy. Sparsity, executed in a sparsity-aware inference engine, resulted in a 3.5x speedup.