On-Demand Discussion: Using “Compound Sparsification” with Hugging Face BERT for Faster CPU Inference with Better Accuracy

During the session, Mark Kurtz, Neural Magic’s ML Lead, showed:

  • A current state of pruning Hugging Face BERT models for better inference performance
  • How compound sparsification* enables faster and smaller models
  • How to leverage Neural Magic recipes and open-source tools to create faster and smaller BERT models on your own pipelines
  • Short-term roadmap for even more performant BERT models.

Date recorded: September 29, 2021
Speaker: Mark Kurtz, ML Lead, Neural Magic

*When it comes to Hugging Face BERT, we combined distillation with both unstructured pruning and structured layer dropping. This combination of multiple sparsification techniques is what we’ve termed “compound sparsification.” Just as with compound scaling for neural networks, the combination enables smaller, faster, and more accurate models for deployment.