Try it Now:
BERT Question Answering

Using sparsity and your data for maximizing CPU speeds

Introduction to Sparse Q&A with BERT

Set your sight for success with this end-to-end experience to see how a Neural Magic sparse model simplifies the sparsification process and results in up to 14x faster and 4.1x smaller question answering models. Sparsifying involves removing redundant information from neural networks using algorithms such as pruning and quantization, among others. This sparsification process results in faster inference and smaller file sizes for deployments. Neural Magic creates models and recipes that allow anyone to plug in their data and leverage a recipe-driven approach on top of Hugging Face’s robust training pipelines for the popular BERT NLP network.

In this end-to-end experience, you will start from a Neural Magic pre-trained BERT model in the SparseZoo, apply a private dataset with a recipe using SparseML, and deploy on a CPU with the DeepSparse Engine.

The pre-trained BERT model used in this experience was selected as the best tradeoff between accuracy, speed, and size.* You are not limited in your model selection, but our goal is to enable your success with a guided experience.

question and answering with bert model from the sparsezoo
A sample from SparseZoo, a frequently updated model repository.

Using a recipe, you will apply a “question answering” use case with the Stanford Question Answering Dataset (SQuAD). SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding reading passage.

Three basic steps will take you through the SQuAD use case to try out a sparsified BERT model. 

  1. Benchmark
  2. Apply your own data via transfer learning
  3. Inference

For Neural Magic Support, sign up or log in to get help with your questions in our Tutorials channel: Discourse Forum and/or Slack. Bugs and feature requests should go to our GitHub Issue Queue.

What You Need for Sparse Question Answering with BERT

The hardware you need for each step is:

  • Benchmark and Inference―A CPU with a minimum of AVX2 instruction set available. AVX-512 and VNNI instructions give more performance (and VNNI is needed only if quantization is desired).
  • Transfer Learning―A CUDA and PyTorch-compatible GPU. It is recommended to have one with a memory of at least 16 GB.

The DeepSparse Engine is tested on Python 3.6+ and ONNX 1.5.0+. It is recommended to install in a virtual environment to keep your system in order.

1. Benchmark

Benchmarking lets you compare models for both accuracy and inference performance. In this step, you will install the DeepSparse Engine, verify your hardware, select a model, and benchmark.

Install the DeepSparse Engine

In the command line, install the DeepSparse Engine with pip:

pip install deepsparse[transformers]

Verify Your Hardware

Verify your hardware to ensure optimal performance. Enter:


Select a Model

The DeepSparse-Hugging Face pipeline integration provides a simple API dedicated to several tasks. The following is an example using a pruned BERT model from Neural Magic’s SparseZoo for the question answering task.

from deepsparse.transformers import pipeline

# SparseZoo model stub or path to ONNX file
num_cores=None  # uses all available CPU cores by default

# Get DeepSparse question-answering pipeline
qa_pipeline = pipeline(

# inference
my_name = qa_pipeline(question="What's my name?", context="My name is Snorlax")


Install the dependencies for this example using pip and the requirements.txt file (found here):

pip3 install -r requirements.txt

To run a benchmark using the DeepSparse Engine with a pruned BERT model that uses all available CPU cores and batch size 1, run the following. Note that (found here) is a script for benchmarking sparsified Hugging Face Transformers performance with DeepSparse. (For a full list of options run python -h).

python zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned-aggressive_98 --batch-size 1

2. Apply Your Data via Transfer Learning

Training is the process of feeding a machine learning algorithm with data to help identify and learn good values for all attributes involved. Neural Magic sparse models and recipes simplify the sparsification process by enabling sparse transfer learning to create highly accurate pruned BERT models. We encourage you to sparse transfer learn,* which is faster than following a recipe.

* Sparse transfer learning is the quickest way to apply your own data to a pre-trained model; but, it might drop accuracy. If you drop accuracy and care to bring it back up, use a recipe instead. While more time-consuming, recipes allow you to encode the necessary hyperparameters manually, ensuring that no accuracy is dropped.

In this step, you will clone SparseML, run a setup integration, and run training commands.

Clone SparseML

Navigate to and under the Code button, choose the preferred cloning method.

Run a Setup Integration

Run the following command in the root directory of the transformers repository folder (cd integrations/huggingface-transformers).


The file will clone the transformers repository with the SparseML integration as a subfolder. After the repo has successfully cloned, transformers and datasets will be installed along with any necessary dependencies.

It is recommended to run Python 3.8 as some of the scripts within the transformers repository require it.

Select a Pre-Sparsified Model

You will use a 12-layer BERT model sparsified to 80% on the Wikitext and BookCorpus datasets. As a good tradeoff between inference performance and accuracy, this 12-layer model gives 3.9x better throughput while recovering close to the dense baseline for most transfer tasks. The SparseZoo stub for this model is:


It will be used to select the model in the training commands. Recall that you are not limited in your model selection, but our goal is to enable your success with a guided experience. Later, you may want to explore other options available in SparseZoo.

Create a Dense Teacher

Distillation works very well for BERT and NLP in general to create highly sparse and accurate models for deployment. Following this sentiment, you will create a dense teacher model before applying sparse transfer learning. Note that sparse models can be transferred without using distillation from the dense teacher; however, the end model’s accuracy will be lower.

In your training environment, run this training command for the dense teacher:

python transformers/examples/pytorch/question-answering/ --model_name_or_path bert-base-uncased --dataset_name squad --do_train --do_eval --evaluation_strategy epoch --per_device_train_batch_size 16 --learning_rate 5e-5 --max_seq_length 384 --doc_stride 128 --output_dir models/teacher --num_train_epochs 2 --seed 2021

Note that the batch size may need to be lowered depending on the available GPU memory. If you run out of memory or experience an initial crash, try to lower the batch size to remedy the issue.

The training command should run to completion in less than 12 hours. Once the command has completed, you will have a deployable sparse model located in models/teacher.

Transfer Learn the Model

With the dense teacher trained to convergence, you will begin the sparse transfer learning with distillation with a recipe. The dense teacher will distill knowledge into the sparse architecture, therefore increasing its performance while ideally converging to the dense solution’s accuracy. The recipe encodes the hyperparameters necessary for transfer learning the sparse architecture. Specifically, it ensures that the sparsity is preserved through the training process.

Run the transfer training command in your training environment:

python transformers/examples/pytorch/question-answering/ --model_name_or_path zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/bookcorpus_wikitext/12layer_pruned80-none --distill_teacher models/teacher --dataset_name squad --do_train --do_eval --evaluation_strategy epoch --per_device_train_batch_size 16 --learning_rate 5e-5 --max_seq_length 384 --doc_stride 128 --preprocessing_num_workers 16 --output_dir models/12layer_pruned80-none --fp16 --seed 27942 --num_train_epochs 5 --recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/bookcorpus_wikitext/12layer_pruned80-none?recipe_type=transfer-SQuAD --save_strategy epoch --save_total_limit 2

The batch size may need to be lowered depending on the available GPU memory. If you run out of memory or experience an initial crash, try to lower the batch size to remedy the issue.

The training command should run to completion in less than 12 hours. Once the command has completed, you will have a sparse checkpoint located in models/12layer_pruned80-none.

Export for Inference

This step loads a checkpoint file of the best weights measured on the validation set, and converts it into the more common inference formats. Then, you can run the file through a compression algorithm to reduce its deployment size and run it in an inference engine such as DeepSparse.

The command below uses the --export_onnx_path with the previously created checkpoint to create an ONNX model ready for deployment.  

python transformers/examples/pytorch/question-answering/ --model_name_or_path models/12layer_pruned80-none --dataset_name squad --do_eval --per_device_eval_batch_size 64 --max_seq_length 384 --doc_stride 128 --preprocessing_num_workers 16 --output_dir models/12layer_pruned80-none/eval --onnx_export_path models/12layer_pruned80-none/onnx

The result is a new folder containing the ONNX export of the model located under models/12layer_pruned80-none/onnx.

3. Inference Sparse Question Answering with BERT

Inference involves using a Neural Magic pipeline (from Step 1) and the exported ONNX model (from Step 2) with DeepSparse so you can call into the deployment with your data. The DeepSparse Engine is optimized to speed up sparse models on CPUs.

The question answering pipeline through Hugging Face is built into the DeepSparse Python API. The following code illustrates how this is done and should be used on your deployment machine.

To ensure all requirements and dependencies are installed, please run the following:

pip install deepsparse[transformers]

Once requirements are installed, the following code is used to run a QA pipeline through DeepSparse.

from deepsparse.transformers import pipeline

qa_pipeline = pipeline("question-answering", model_path="models/12layer_pruned80-none/onnx/model.onnx")

my_name = qa_pipeline(question="What's my name?", context="My name is Snorlax")

Now that you have worked through the entire guided experience, you may want to explore more resources below. 

Guided Experience Resources for Question Answering with BERT


  • SparseZoo Overview
  • SparseZoo Model Stubs
  • A specific SparseZoo stub was used to select the model in the training commands. Additional BERT models, including ones with higher sparsity and fewer layers, are found on the SparseZoo and can be subbed in place of the 12-layer 80% sparse model for better performance or recovery.
  • The question answering use case with the Stanford Question Answering Dataset (SQuAD) was used in this end-to-end experience. To apply the same approach to your own dataset, Hugging Face has additional information for the setup of custom datasets. Once you have successfully converted your dataset into Hugging Face’s format, it can be safely plugged into these flows and used for sparse transfer learning from the pre-sparsified models.


DeepSparse Engine

Was this article helpful?