Neural Magic Introduces Sparsity to MLPerf, Boosting CPU Performance 175x


Neural Magic Announces MLPerf Inference Benchmarks, Delivered Purely in Software

Somerville, Massachusetts, September 8, 2022 - Neural Magic, the company leading a new software-delivered AI movement by bringing hyper-performant and scalable ML inferencing to commodity CPU infrastructure, announced today its benchmark results for three Natural Language Processing (NLP) models submitted to the MLPerf Inference Datacenter and Edge v2.1 Open division. The results demonstrate the power of neural network sparsity and sparsity-aware inferencing, marking an important milestone in faster, more-efficient machine learning execution using commodity CPU resources.

Comparison of CPU benchmarks from 2022 MLPerf Inference v2.1 Datacenter results.
Comparison of CPU benchmarks from 2022 MLPerf Inference v2.1 Datacenter results. DeepSparse Engine is executing a sparse-quantized model. Triton Inference Server (OpenVINO) is executing a quantized model. ONNX Runtime is executing an FP32 baseline model.

“We are excited to partner with MLCommons to introduce the power of sparsity and smarter network execution to the world of machine learning,” said Brian Stevens, CEO of Neural Magic. “Sparsity and sparsity-aware execution is paving the way for efficient machine learning where individuals and businesses alike are able to deliver big model accuracy with all the perks of smaller models, using commodity CPUs that are easier to access and scale.”

Compound Sparsity Paves Way for Efficient ML Execution

Neural Magic’s MLPerf Inference Datacenter and Edge v2.1 submission shows three different methods of optimizing BERT-Large, a very accurate NLP model that’s hard and often uneconomical to deploy due to its slow inferencing times and large disk space requirements.

Enter compound sparsity, a tactic of applying various compression methods to deep learning models, including unstructured gradual pruning, quantization-aware training, and structural distillation.

Neural Magic’s MLPerf Inference Datacenter and Edge v2.1 Open division benchmarks show the true power of compound sparsity when applied to the SQuAD v1.1 question answering task by:

  • Maintaining >99% of its original F1 score (meaning <1% accuracy degradation!)
  • Decreasing model size from 1.3 GB to ~10 MB
  • Improving throughput performance by orders of magnitude from ~10 samples/second to up to 1,000 samples/second when executed in the sparsity-aware DeepSparse Engine
Neural Magic MLPerf Inference Results
Neural Magic's MLPerf Inference v2.1 submission results for the BERT-Large SQuAD v1.1 question answering task

More details on each of the models and methods can be found on GitHub under Neural Magic's BERT-Large DeepSparse MLPerf Submission.

Detailed model methods:

Resources and Next Steps

Our research on compound sparsity is open-sourced and easily reproducible via SparseML. It’s applicable to NLP and computer vision models, including BERT, ResNet, YOLO, YOLACT, and more. With only a few lines of code, you can transfer learn our optimizations to numerous tasks including question answering, text classification, token classification, object detection, image classification, image segmentation, and more.

Our sparsity-aware DeepSparse Engine is freely available for community use. It delivers the best inference performance on commodity CPUs and it fits seamlessly into your existing deployment pipelines.

If you have any questions, get direct access to our engineering teams and the wider community in the Deep Sparse Community Slack

To keep up with our mission of efficient software-delivered AI, please star our GitHub repos and subscribe to our monthly newsletter below.