Try it Now:
Sparse Binary Text Classification with BERT
Using sparsity and your data for maximizing CPU speeds
While this guide is specific to one NLP use case, you can apply these steps to any of your other language processing needs such as question answering, token classification, and text classification.
Introduction to Sparse Binary Text Classification
Set your sights on success with this end-to-end binary text classification experience. See how a Neural Magic sparse model simplifies the sparsification process and results in up to 14x faster and 4.1x smaller models. For the model used in this experience, you can achieve an 8.1x speedup over your current dense model while recovering to the same accuracy. Other model variations with different tradeoffs between performance and accuracy are available in the SparseZoo.
Sparsifying involves removing redundant information from neural networks using algorithms such as pruning and quantization, among others. This sparsification process results in faster inference and smaller file sizes for deployments. Neural Magic creates models and recipes that allow anyone to plug in their data and leverage a recipe-driven approach on top of Hugging Face’s robust training pipelines for the popular BERT NLP network.
In this end-to-end experience, you will:
- Start from a Neural Magic pre-trained BERT model in the ,
- Apply a private dataset using sparse transfer learning with ,
- Deploy on a CPU with the .
A sparse-quantized model that recovers to within 99% of the baseline model has been selected in this experience. If you would like to prioritize more performance with less recovery, you may use other models in the SparseZoo. You are not limited in your model selection, but our goal is to enable your success with a guided experience.
You will apply a “binary text classification” use case with the Quora Question Pairs (QQP) dataset. QQP is made up of potential question pairs from Quora with a boolean label representing whether or not the questions are duplicates.
Three basic steps will take you through a QQP use case to try out a sparsified BERT model.
For Neural Magic Support, sign up or log in to get help with your questions in our Deep Sparse Community Slack. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue.
What You Need for Sparse Binary Text Classification
The hardware you need for each step is:
- Benchmark and Inference―A CPU with a minimum of AVX2 instruction set available. AVX-512 and VNNI instructions give more performance.
- Transfer Learning―A CUDA and PyTorch-compatible GPU. It is recommended to have one with a memory of at least 16GB.
The DeepSparse Engine is tested on Python 3.6-3.9 and ONNX 1.5.0+. It is recommended to install in a virtual environment to keep your system in order.
Step 1: Benchmark
Benchmarking lets you compare models for both accuracy and inference performance. In this step, you will install the DeepSparse Engine, select a model, and benchmark. This will allow you to test the performance of Neural Magic’s sparsified BERT models on your deployment hardware to validate that it fulfills your requirements.
DeepSparse Engine Installation
At the command line, install the DeepSparse Engine with pip on your desired deployment environment:
pip install deepsparse
Note: Hugging Face’s Transformers library will not immediately install with this command. Instead, a sparsification-compatible version of Transformers will install on first invocation of the Transformers code in DeepSparse.
The DeepSparse installation additionally provides the CLI deepsparse.benchmark.
Use the help argument to see the full list of options for benchmarking in DeepSparse:
deepsparse.benchmark --help
The benchmark CLI will default to batch size 1, sequence length from the ONNX model – in this case 128, and multi-stream (asynchronous). To override these defaults and test out different configurations, use the following arguments:
- Set batch size to 32:
--batch_size 32
- Set input shape to sequence length 384:
--input_shapes "[1,384]
" - Set benchmark type to synchronous:
--scenario sync
We will use a sample question answering BERT model from the SparseZoo to test performance. This model will achieve the approximately same performance as the one you will create later when transferring to your dataset. You can view other available models in the SparseZoo that can be used instead of the default ones mentioned here.
Dense Baseline Performance
First, we’ll benchmark a dense BERT model to establish baseline performance using the following command:
deepsparse.benchmark --input_shapes "[1,128],[1,128],[1,128]" "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none"
Running on a c5.12xlarge (24 CPU cores) AWS instance achieves 72.0 items/sec. The full output is given below, using DeepSparse Engine version .10:
[INFO benchmark_model.py:202 ] Thread pinning to cores enabled
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (c2458ea3) (release) (optimized) (system=avx512, binary=avx512)
[INFO benchmark_model.py:247 ] deepsparse.engine.Engine:
onnx_file_path: /home/ubuntu/.cache/sparsezoo/fb3c7ab5-b66b-4965-82f4-115480d58be0/model.onnx
batch_size: 1
num_cores: 24
scheduler: Scheduler.multi_stream
cpu_avx_type: avx512
cpu_vnni: True
[INFO onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 128]
[INFO onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 128]
[INFO onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 128]
[INFO benchmark_model.py:264 ] num_streams default value chosen of 12. This requires tuning and may be sub-optimal
[INFO benchmark_model.py:270 ] Starting 'async' performance measurements for 10 seconds
Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none
Batch Size: 1
Scenario: multistream
Throughput (items/sec): 72.0128
Latency Mean (ms/batch): 166.5793
Latency Median (ms/batch): 166.6827
Latency Std (ms/batch): 3.5071
Iterations: 732
Sparse-Quantized Performance
To compare with the baseline, we’ll benchmark an 80% sparse-quantized BERT model, which recovers to within 99% of the baseline accuracy, using the following command:
deepsparse.benchmark --input_shapes "[1,128],[1,128],[1,128]" "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant-aggressive_95"
Running on a c5.12xlarge (24 CPU cores) AWS instance achieves 582.7 items/sec, gaining a 8.1x speedup over the dense model. The full output is given below, using DeepSparse Engine version .10:
[INFO benchmark_model.py:202 ] Thread pinning to cores enabled
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (c2458ea3) (release) (optimized) (system=avx512, binary=avx512)
[INFO benchmark_model.py:247 ] deepsparse.engine.Engine:
onnx_file_path: /home/ubuntu/.cache/sparsezoo/b3125c89-540d-4fc6-842e-808383fa0b63/model.onnx
batch_size: 1
num_cores: 24
scheduler: Scheduler.multi_stream
cpu_avx_type: avx512
cpu_vnni: True
[INFO onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 128]
[INFO onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 128]
[INFO onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 128]
[INFO benchmark_model.py:264 ] num_streams default value chosen of 12. This requires tuning and may be sub-optimal
[INFO benchmark_model.py:270 ] Starting 'async' performance measurements for 10 seconds
Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant-aggressive_95
Batch Size: 1
Scenario: multistream
Throughput (items/sec): 582.6930
Latency Mean (ms/batch): 20.5589
Latency Median (ms/batch): 20.5603
Latency Std (ms/batch): 0.4551
Iterations: 5833
Step 2: Apply Your Own Data via Transfer Learning
The second step is transferring Neural Magic’s sparsified BERT models to your dataset. Neural Magic hosts many models, including BERT models sparsified on an English text corpus with a masked language modeling training scheme. These sparse models using SparseML and recipes can then fine-tune to your dataset like you normally would with a dense model.
Training is the process of feeding a machine learning algorithm with data to help identify and learn good values for all attributes involved. Neural Magic sparse models simplify the optimization process by enabling sparse transfer learning to create highly accurate pruned BERT models.
In this step, you will install SparseML and run training commands.
Install SparseML with PyTorch
First, install SparseML with PyTorch on your desired deployment environment:
pip install sparseml[torch]
Note: Transformers will not immediately install with this command. Instead, a sparsification-compatible version of Transformers will install on the first invocation of the Transformers code in SparseML.
The SparseML installation also provides the following CLI for this use case; appending the help argument will provide a full list of options for training in SparseML:
sparseml.transformers.text_classification --help
Standard Arguments
All SparseML Transformers training CLIs contain standard arguments to enable sparsification and sparse transfer learning on standard models like BERT. The arguments are:
--output_dir:
The directory in which to store the outputs from the training runs such as results, the trained model, and supporting files.--model_name_or_path:
The path or SparseZoo stub for the model to load for training.--recipe:
The path or SparseZoo stub for the recipe to use to apply sparsification algorithms or sparse transfer learning to the model.--distill_teacher:
The path or SparseZoo stub for the teacher to load for distillation.--dataset_name or --task_name:
The dataset or task to load for training.
All commands and hyperparameters are designed for a single GPU with a minimum of 16GB of memory. If you run into out-of-memory exceptions, set --gradient_accumulation_steps 2
and lower the train and eval batch sizes by half. Applying these changes will increase training time and lower the total memory required while keeping the effective batch size the same.
Custom Data
The example given below uses a public dataset to demonstrate the methods. To work with your custom dataset, confirm the dataset conforms to Hugging Face’s dataset standards for Transformers compatibility. More information is found here.
You can then replace the dataset and task arguments in the training commands with the --train_file
and --validation_file
arguments.
Train with Binary Text Classification
As an example for text classification, we document how to transfer a sparse model to the QQP dataset for a binary text classification task, where the sparse model achieves 99% of the dense baseline. See the previous section on custom data to fit these commands and your dataset to the sparse transfer learning pipelines. Additionally, to enable high levels of recovery, you must use distillation.
Dense Teacher Creation
Distillation works very well for BERT and NLP in general to create highly sparse and accurate models for deployment. Following this sentiment, you will create a dense teacher model before applying sparse transfer learning. Note that sparse models can be transferred without using distillation from the dense teacher; however, the end model’s accuracy will be lower.
To enable distillation, you will first create a dense teacher model that the sparse model will learn from while transferring. If you already have a Transformers-compatible model, you can use this as the dense teacher in place of training one from scratch. The following command will use the dense BERT base model from the SparseZoo and fine-tune it on the QQP dataset, resulting in a model that achieves 90.84% accuracy on the validation set:
sparseml.transformers.text_classification \
--output_dir models/teacher \
--model_name_or_path zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/base-none \
--recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/base-none?recipe_type=transfer-text_classification \
--recipe_args '{"init_lr":0.00003}' \
--task_name qqp --max_seq_length 128 --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --preprocessing_num_workers 6 \
--do_train --do_eval --evaluation_strategy epoch --fp16 --seed 42 \
--save_strategy epoch --save_total_limit 1
The training command should run to completion in less than 12 hours. Once the command has completed, you will have a deployable sparse model located in models/teacher.
Transfer Learn the Model
With the dense teacher trained to convergence, you will begin the sparse transfer learning with distillation with a recipe. The dense teacher will distill knowledge into the sparse architecture, therefore increasing its performance while ideally converging to the dense solution's accuracy. The recipe encodes the hyperparameters necessary for transfer learning the sparse architecture. Specifically, it ensures that the sparsity is preserved through the training process.
Run the transfer training command in your training environment. The training command should run to completion in less than 12 hours. Once the command has completed, you will have a sparse checkpoint located in models/sparse_quantized.
Quantization-Compatible Deployment Environment
Use the Sparse-Quantized Performance subsection under the Benchmark section to check the deployment environment’s compatibility for quantization. The following command will use the 80% sparse-quantized BERT model from the SparseZoo and fine-tune it on the QQP dataset, resulting in a model that achieves 90.84% accuracy on the validation set. Keep in mind that the --distill_teacher
argument is set to pull a dense QQP model from the SparseZoo to enable it to run independent of the dense teacher step. If you trained a dense teacher, change this out for the path to your model folder:
sparseml.transformers.train.text_classification \
--output_dir models/sparse_quantized \
--model_name_or_path zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni \
--recipe "zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni?recipe_type=transfer-text_classification" \
--recipe_args '{"init_lr":0.00005}' \
--distill_teacher zoo:nlp/text_classification/bert-base/pytorch/huggingface/qqp/base-none \
--task_name qqp --max_seq_length 128 --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --preprocessing_num_workers 6 \
--do_train --do_eval --evaluation_strategy epoch --fp16 --seed 2819 \
--save_strategy epoch --save_total_limit 1
Step 3: Export and Deploy
Exporting for Inference
Once the model has been trained on the desired dataset, it must be exported into a deployment-friendly format. ONNX is a generic neural network definition that enables compact representations of models. The DeepSparse Engine uses the ONNX format to load neural networks and then deliver breakthrough performance for CPUs by leveraging the sparsity and quantization within a network.
To deploy using DeepSparse, you will first export the trained BERT model to an ONNX format. The SparseML installation additionally provided a sparseml.transformers.export_onnx
command. You will use this to load the training model folder and create a new model.onnx
file within. Be sure the --model_path
argument points to your trained model. By default, it is set to the result from transfer learning a sparse-quantized BERT model: --model_path "models/sparse_quantized"
.
sparseml.transformers.export_onnx \
--model_path models/sparse_quantized \
--task 'text-classification' --finetuning_task qqp \
--sequence_length 128
Deploying
Now that the model is in an ONNX format, it is ready for deployment with the DeepSparse Engine. Once DeepSparse is installed on your deployment environment in Step 1 (pip install deepsparse
), two options are supported for deployment: the Python API that will fit into current deployment pipelines, and a HTTP server that enables a no-code solution.
Python API
The Python code below gives an example for using the DeepSparse Python pipeline API with different use cases. The commands are set up to be able to run independent of the prior stages. Be sure to change out the model_path
argument for the model folder of your trained model: model_path="models/sparse_quantized"
.
from deepsparse.transformers import pipeline
text_pipeline = pipeline(
task="text-classification",
model_path= "zoo:nlp/text_classification/bert-base/pytorch/huggingface/qqp/base-none",
)
classification = text_pipeline(
[["What are natural numbers", "What is a least natural number"]]
)
print(classification)
HTTP Server
To use the DeepSparse Server, first install the required dependencies using pip:
pip install deepsparse[server]
Once installed, the CLI command given below for serving a BERT model is available. The commands are set up to be able to run independent of the prior stages. Be sure to change out the model_path argument for the model folder of your trained model: --model_path "models/sparse_quantized"
. Once launched, you can view info over the server and the available APIs at http://0.0.0.0:5543 on the deployment machine.
deepsparse.server --task text_classification --batch_size 1 --model_path "zoo:nlp/text_classification/bert-base/pytorch/huggingface/qqp/base-none"
Summary
Neural Magic’s DeepSparse Engine and SparseML products are integrated with Hugging Face’s Transformers library to enable sparsified BERT and other Transformer models, resulting in faster, smaller, and cheaper deployable models. DeepSparse is integrated to enable easy deployments and benchmarking of Hugging Face Transformer models, while SparseML is integrated to enable easy training and model sparsification/optimization.
Now that you have worked through the entire guided experience, you may want to explore more resources below.
Guided Experience Resources for Sparse Binary Text Classification
SparseZoo
- SparseZoo Overview
- SparseZoo Model Stubs
- A specific SparseZoo stub was used to select the model in the training commands. Additional BERT models, including ones with higher sparsity and fewer layers, are found on the SparseZoo and can be subbed in place of the 12-layer 80% sparse model for better performance or recovery.
- The binary text classification use case with the Quora Question Pairs (QQP) dataset was used in this end-to-end experience. To apply the same approach to your own dataset, Hugging Face has additional information for the setup of custom datasets. Once you have successfully converted your dataset into Hugging Face’s format, it can be safely plugged into these flows and used for sparse transfer learning from the pre-sparsified models.