NeuralFlix

3.5x Faster NLP BERT Using a Sparsity-Aware Inference Engine on AMD Milan-X

Presenter:

We used DeepSparse, our sparsity-aware inference engine, to answer an important question we've all been pondering: What do people think of pineapple on pizza?

We compared the performance of dense BERT to 90% sparse BERT that recovers to 99% of the baseline accuracy. Sparsity, executed in a sparsity-aware inference engine, resulted in a 3.5x speedup.

More Neural Magic Software in Action Videos

YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Performance and Tiny Footprint
YOLOv3 on the Edge: DeepSparse Engine vs. PyTorch
State-of-the-Art NLP Compression Research in Action: Understanding Crypto Sentiment
3.5x Faster NLP BERT Using a Sparsity-Aware Inference Engine on AMD Milan-X

Get more info about

3.5x Faster NLP BERT Using a Sparsity-Aware Inference Engine on AMD Milan-X