Sparsify Hugging Face BERT for Better CPU Performance & Smaller File Size


Get Started: Sparsify Hugging Face BERT Using Your Data

Check out our previous blog post to learn about compound sparsification and how it enables faster and smaller BERT models: Pruning Hugging Face BERT: Using Compound Sparsification for Faster CPU Inference with Better Accuracy.

Ready to sparsify Hugging Face BERT? You can replicate the performance and compression results mentioned in the video with your own data using Neural Magic’s open source and freely available tools.

Visit our BERT Getting Started page in the SparseZoo and:

  1. Ensure that you have a correct setup and that performance results are compelling by running a quick benchmarking exercise. You can find the code here.
  2. In order to run with your own data, follow a recipe that will help you encode the transferable hyperparameters necessary for creating sparse models. You will be creating a “teacher” model pre-trained on your dataset, and ultimately distilling the knowledge down to the pruned “student’ BERT model on the same dataset.
  3. Export the “student” model to deploy on your CPU hardware using ONNX-compatible inference engines such as DeepSparse. To achieve the performance mentioned in the video, we encourage you to use the freely available DeepSparse Engine which is explicitly coded to accelerate the performance of sparsified models.

No matter where you are in the process, if you run into any issues, we are here to help. Join our Slack or Discourse forums to get direct access to our engineering and support teams.