Try it Now:
Sparse Question Answering with BERT
Using sparsity and your data for maximizing CPU speeds
While this guide is specific to one NLP use case, you can apply these steps to any of your other language processing needs such as question answering, token classification, and text classification.
Introduction to Sparse Question Answering
Set your sights on success with this end-to-end question answering experience. See how a Neural Magic sparse model simplifies the sparsification process and results in up to 14x faster and 4.1x smaller models. For the model used in this experience, you can achieve a 7x speedup over your current dense model while recovering to the same accuracy. Other model variations with different tradeoffs between performance and accuracy are available in the SparseZoo.
Sparsifying involves removing redundant information from neural networks using algorithms such as pruning and quantization, among others. This sparsification process results in faster inference and smaller file sizes for deployments. Neural Magic creates models and recipes that allow anyone to plug in their data and leverage a recipe-driven approach on top of Hugging Face’s robust training pipelines for the popular BERT NLP network.
In this end-to-end experience, you will:
- Start from a Neural Magic pre-trained BERT model in the ,
- Apply a private dataset using sparse transfer learning with ,
- Deploy on a CPU with the .
A sparse-quantized model that recovers to within 99% of the baseline model has been selected in this experience. If you would like to prioritize more performance with less recovery, you may use other models in the SparseZoo. You are not limited in your model selection, but our goal is to enable your success with a guided experience.
You will apply a “question and answering” use case with the Stanford Question Answering Dataset (SQuAD). SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding reading passage.
Three basic steps will take you through the SQuAD use case to try out a sparsified BERT model.
For Neural Magic Support, sign up or log in to get help with your questions in our Deep Sparse Community Slack. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue.
What You Need for Sparse Question Answering
The hardware you need for each step is:
- Benchmark and Inference―A CPU with a minimum of AVX2 instruction set available. AVX-512 and VNNI instructions give more performance.
- Transfer Learning―A CUDA and PyTorch-compatible GPU. It is recommended to have one with a memory of at least 16GB.
The DeepSparse Engine is tested on Python 3.6-3.9 and ONNX 1.5.0+. It is recommended to install in a virtual environment to keep your system in order.
Step 1: Benchmark
Benchmarking lets you compare models for both accuracy and inference performance. In this step, you will install the DeepSparse Engine, select a model, and benchmark. This will allow you to test the performance of Neural Magic’s sparsified BERT models on your deployment hardware to validate that it fulfills your requirements.
DeepSparse Engine Installation
At the command line, install the DeepSparse Engine with pip on your desired deployment environment:
pip install deepsparse
Note: Hugging Face’s Transformers library will not immediately install with this command. Instead, a sparsification-compatible version of Transformers will install on first invocation of the Transformers code in DeepSparse.
The DeepSparse installation additionally provides the CLI
deepsparse.benchmark. Use the help argument to see the full list of options for benchmarking in DeepSparse:
The benchmark CLI will default to batch size 1, sequence length from the ONNX model – in this case 384, and multi-stream (asynchronous). To override these defaults and test out different configurations, use the following arguments:
- Set batch size to 32:
- Set input shape to sequence length 128:
- Set benchmark type to synchronous:
We will use a sample question answering BERT model from the SparseZoo to test performance. This model will achieve approximately the same performance as the one you will create later when transferring to your dataset. You can view other available models in the SparseZoo that can be used instead of the default ones mentioned here.
Dense Baseline Performance
First, we’ll benchmark a dense BERT model to establish baseline performance using the following command:
Running on a c5.12xlarge (24 CPU cores) AWS instance achieves 21.7 items/sec. The full output is given below, using DeepSparse Engine version .10:
[INFO benchmark_model.py:202 ] Thread pinning to cores enabled DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (c2458ea3) (release) (optimized) (system=avx512, binary=avx512) [INFO benchmark_model.py:247 ] deepsparse.engine.Engine: onnx_file_path: /home/ubuntu/.cache/sparsezoo/fb3c7ab5-b66b-4965-82f4-115480d58be0/model.onnx batch_size: 1 num_cores: 24 scheduler: Scheduler.multi_stream cpu_avx_type: avx512 cpu_vnni: True [INFO onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 384] [INFO onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 384] [INFO onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 384] [INFO benchmark_model.py:264 ] num_streams default value chosen of 12. This requires tuning and may be sub-optimal [INFO benchmark_model.py:270 ] Starting 'async' performance measurements for 10 seconds Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none Batch Size: 1 Scenario: multistream Throughput (items/sec): 21.7238 Latency Mean (ms/batch): 551.6047 Latency Median (ms/batch): 546.4379 Latency Std (ms/batch): 22.6294 Iterations: 228
To compare with the baseline, we’ll benchmark an 80% sparse-quantized BERT model, which recovers to within 99% of the baseline accuracy, using the following command:
Running on a c5.12xlarge (24 CPU cores) AWS instance achieves 153.9 items/sec, gaining a 7x speedup over the dense model. The full output is given below, using DeepSparse Engine version .10:
[INFO benchmark_model.py:202 ] Thread pinning to cores enabled DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (c2458ea3) (release) (optimized) (system=avx512, binary=avx512) [INFO benchmark_model.py:247 ] deepsparse.engine.Engine: onnx_file_path: /home/ubuntu/.cache/sparsezoo/b3125c89-540d-4fc6-842e-808383fa0b63/model.onnx batch_size: 1 num_cores: 24 scheduler: Scheduler.multi_stream cpu_avx_type: avx512 cpu_vnni: True [INFO onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 384] [INFO onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 384] [INFO onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 384] [INFO benchmark_model.py:264 ] num_streams default value chosen of 12. This requires tuning and may be sub-optimal [INFO benchmark_model.py:270 ] Starting 'async' performance measurements for 10 seconds Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant-aggressive_95 Batch Size: 1 Scenario: multistream Throughput (items/sec): 153.8851 Latency Mean (ms/batch): 77.7749 Latency Median (ms/batch): 77.0535 Latency Std (ms/batch): 9.0455 Iterations: 1544
Step 2: Apply Your Own Data via Transfer Learning
The second step is transferring Neural Magic’s sparsified BERT models to your dataset. Neural Magic hosts many models, including BERT models sparsified on an English text corpus with a masked language modeling training scheme. Using SparseML and recipes, these sparse models can then fine-tune to your dataset like you normally would with a dense model.
Training is the process of feeding a machine learning algorithm with data to help identify and learn good values for all attributes involved. Neural Magic sparse models simplify the optimization process by enabling sparse transfer learning to create highly accurate pruned BERT models.
In this step, you will install SparseML and run training commands.
Install SparseML with PyTorch
First, install SparseML with PyTorch on your desired deployment environment:
pip install sparseml[torch]
Note: Transformers will not immediately install with this command. Instead, a sparsification-compatible version of Transformers will install on the first invocation of the Transformers code in SparseML.
The SparseML installation also provides the following CLI for this use case; appending the help argument will provide a full list of options for training in SparseML:
All SparseML Transformers training CLIs contain standard arguments to enable sparsification and sparse transfer learning on standard models like BERT. The arguments are:
--output_dir:The directory in which to store the outputs from the training runs such as results, the trained model, and supporting files.
--model_name_or_path:The path or SparseZoo stub for the model to load for training.
--recipe:The path or SparseZoo stub for the recipe to use to apply sparsification algorithms or sparse transfer learning to the model.
--distill_teacher:The path or SparseZoo stub for the teacher to load for distillation.
--dataset_name or --task_name:The dataset or task to load for training.
All commands and hyperparameters are designed for a single GPU with a minimum of 16GB of memory. If you run into out-of-memory exceptions, set
--gradient_accumulation_steps 2 and lower the train and eval batch sizes by half. Applying these changes will increase training time and lower the total memory required while keeping the effective batch size the same.
The example given below uses a public dataset to demonstrate the methods. To work with your custom dataset, confirm the dataset conforms to Hugging Face’s dataset standards for Transformers compatibility. More information is found here.
You can then replace the dataset and task arguments in the training commands with the
Train with Question Answering
As an example for question answering, we document how to transfer a sparse model to the SQuAD dataset, where the sparse model achieves 99% of the dense baseline. See the previous section on custom data to fit these commands and your dataset to the sparse transfer learning pipelines. Additionally, to enable high levels of recovery, you must use distillation.
Dense Teacher Creation
Distillation works very well for BERT and NLP in general to create highly sparse and accurate models for deployment. Following this sentiment, you will create a dense teacher model before applying sparse transfer learning. Note that sparse models can be transferred without using distillation from the dense teacher; however, the end model’s accuracy will be lower.
To enable distillation, you will first create a dense teacher model that the sparse model will learn from while transferring. If you already have a Transformers-compatible model, you can use this as the dense teacher in place of training one from scratch. The following command will use the dense BERT base model from the SparseZoo and fine-tune it on the SQuAD dataset, resulting in a model that achieves 0.885 F1 on the validation set:
sparseml.transformers.question_answering \ --output_dir models/teacher \ --model_name_or_path zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/base-none \ --recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/base-none?recipe_type=transfer-question_answering \ --dataset_name squad --per_device_train_batch_size 16 --per_device_eval_batch_size 24 --preprocessing_num_workers 6 \ --do_train --do_eval --evaluation_strategy epoch --fp16 --seed 42 \ --per_device_train_batch_size 16 --per_device_eval_batch_size 24 \ --save_strategy epoch --save_total_limit 1
The training command should run to completion in less than 12 hours. Once the command has completed, you will have a deployable sparse model located in
Transfer Learn the Model
With the dense teacher trained to convergence, you will begin the sparse transfer learning with distillation with a recipe. The dense teacher will distill knowledge into the sparse architecture, therefore increasing its performance while ideally converging to the dense solution’s accuracy. The recipe encodes the hyperparameters necessary for transfer learning the sparse architecture. Specifically, it ensures that the sparsity is preserved through the training process.
Run the transfer training command in your training environment. The training command should run to completion in less than 12 hours. Once the command has completed, you will have a sparse checkpoint located in
Quantization-Compatible Deployment Environment
Use the Sparse-Quantized Performance subsection under the Benchmark section to check the deployment environment’s compatibility for quantization. The following command will use the 80% sparse-quantized BERT model from the SparseZoo and fine-tune it on the SQuAD dataset, resulting in a model that achieves 0.885 F1 on the validation set. Keep in mind that the
--distill_teacher argument is set to pull a dense SQuAD model from the SparseZoo to enable it to run independent of the dense teacher step. If you trained a dense teacher, change this out for the path to your model folder:
sparseml.transformers.question_answering \ --output_dir models/sparse_quantized \ --model_name_or_path zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni \ --recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni?recipe_type=transfer-question_answering \ --distill_teacher zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none \ --dataset_name squad --per_device_train_batch_size 12 --per_device_eval_batch_size 24 --preprocessing_num_workers 6 \ --do_train --do_eval --evaluation_strategy epoch --fp16 --seed 21636 \ --per_device_train_batch_size 16 --per_device_eval_batch_size 24 --preprocessing_num_workers 6 \ --save_strategy epoch --save_total_limit 1
Step 3: Export and Deploy
Exporting for Inference
Once the model has been trained on the desired dataset, it must be exported into a deployment-friendly format. ONNX is a generic neural network definition that enables compact representations of models. The DeepSparse Engine uses the ONNX format to load neural networks and then deliver breakthrough performance for CPUs by leveraging the sparsity and quantization within a network.
To deploy using DeepSparse, you will first export the trained BERT model to an ONNX format. The SparseML installation additionally provided a
sparseml.transformers.export_onnx command. You will use this to load the training model folder and create a new
model.onnx file within. Be sure the
--model_path argument points to your trained model. By default, it is set to the result from transfer learning a sparse-quantized BERT model:
sparseml.transformers.export_onnx \ --model_path models/sparse_quantized \ --task 'question-answering' --sequence_length 384
Now that the model is in an ONNX format, it is ready for deployment with the DeepSparse Engine. Once DeepSparse is installed on your deployment environment in Step 1 (
pip install deepsparse), two options are supported for deployment: the Python API that will fit into current deployment pipelines, and a HTTP server that enables a no-code solution.
The Python code below gives an example for using the DeepSparse Python pipeline API with different use cases. The commands are set up to be able to run independent of the prior stages. Be sure to change out the
model_path argument for the model folder of your trained model:
from deepsparse.transformers import pipeline answering_pipeline = pipeline( task="question-answering", model_path= "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none", ) answer = answering_pipeline( question="What's my name?", context="My name is Snorlax" ) print(answer)
To use the DeepSparse Server, first install the required dependencies using pip:
pip install deepsparse[server]
Once installed, the CLI command given below for serving a BERT model is available. The commands are set up to be able to run independently of the prior stages. Be sure to change out the model_path argument for the model folder of your trained model:
--model_path "models/sparse_quantized". Once launched, you can view info over the server and the available APIs at http://0.0.0.0:5543 on the deployment machine.
deepsparse.server --task question_answering --batch_size 1 --model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none"
Neural Magic’s DeepSparse Engine and SparseML products are integrated with Hugging Face’s Transformers library to enable sparsified BERT and other Transformer models, resulting in faster, smaller, and cheaper deployable models. DeepSparse is integrated to enable easy deployments and benchmarking of Hugging Face Transformer models, while SparseML is integrated to enable easy training and model sparsification/optimization.
Now that you have worked through the entire guided experience, you may want to explore more resources below.
Guided Experience Resources for Sparse Question Answering
- SparseZoo Overview
- SparseZoo Model Stubs
- A specific SparseZoo stub was used to select the model in the training commands. Additional BERT models, including ones with higher sparsity and fewer layers, are found on the SparseZoo and can be subbed in place of the 12-layer 80% sparse model for better performance or recovery.
- The question answering use case with the Stanford Question Answering Dataset (SQuAD) was used in this end-to-end experience. To apply the same approach to your own dataset, Hugging Face has additional information for the setup of custom datasets. Once you have successfully converted your dataset into Hugging Face’s format, it can be safely plugged into these flows and used for sparse transfer learning from the pre-sparsified models.