Live Discussion: Using “Compound Sparsification” with Hugging Face BERT for Faster CPU Inference with Better Accuracy
Save your seat below to learn about Neural Magic’s latest research that focuses on combining multiple sparsification methods to improve NLP performance on CPUs. When it comes to Hugging Face BERT, we combined distillation with both unstructured pruning and structured layer dropping. This combination of multiple sparsification techniques is what we’ve termed “compound sparsification.” Just as with compound scaling for neural networks, the combination enables smaller, faster, and more accurate models for deployment.
Join us on Wednesday, September 29, for a live discussion on techniques we used to prune Hugging Face BERT and ways to deploy on cheaper and more readily available CPUs, at the edge or in the data center.
During the session, Mark Kurtz, Neural Magic’s ML Lead, will show:
- A current state of pruning BERT models for better inference performance
- How compound sparsification enables faster and smaller models
- How to leverage Neural Magic recipes and open-source tools to create faster and smaller BERT models on your own pipelines
- Short-term roadmap for even more performant BERT models.
We will follow Mark’s presentation with an open discussion and a Q&A session.