ResNet-50 on CPUs: Sparsifying for Better Performance

Jul 11, 2022

Author(s)

Mark Kurtz

CTO, Neural Magic

This blog post was edited in July 2022 to reflect more-recent sparsification research, software updates, better performance numbers, and easier benchmarking and transfer learning flows.

In this post, we elaborate on how we sparsified ResNet-50 models up to 95% while retaining 99% of the baseline accuracy. Furthermore, we’ll show how we used these sparsified models to achieve GPU-class throughput and latency performance on commodity cloud CPUs. By the end of the post, you should be able to reproduce these benchmarks, apply your data via transfer learning, and deploy on CPUs at GPU speeds using free and open-source tools available in the Neural Magic GitHub repo.

ResNet-50 Throughput performance on the DeepSparse Engine and ONNXRuntime — *ResNet-50 v1 | Batch = 64 | AWS c5.12xlarge CPU*

ResNet-50 on CPUs Latency for DeepSparse Engine and ONNXRuntime — *ResNet-50 v1 | Batch = 1 | AWS c5.12xlarge CPU*

Intro to Sparsification: Pruning ResNet-50

Neural Magic’s Deep Sparse Platform provides a suite of software tools to select, optimize, and deploy sparse deep learning models on commodity CPU resources. Taking advantage of “sparsification,” there are multiple ways to plug into the DeepSparse Engine which runs sparse models like ResNet-50 at accelerated speeds on CPUs. So what is sparsification and why should you care?

Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all-encompassing including everything from inducing sparsity using pruning and quantization to enabling naturally occurring activation sparsity. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, as you will see shortly in our benchmarking exercise, pruning plus quantization can give over 7.3x improvement in performance while recovering to nearly the same baseline accuracy. Additionally, sparsification also reduces the model footprint. In the ResNet-50 example below, we reduced the model size from the original 90.3 MB to 3.3 MB while retaining 99% of the baseline accuracy!

The DeepSparse Platform builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. Simply put, you would:

Download a sparsification recipe and sparsified model from the SparseZoo.
Apply the recipe to the model with only a few lines of code using SparseML.
Finally, for GPU-level performance on CPUs, you can deploy your sparse-quantized model with the freely-available DeepSparse Engine.

ResNet-50 on CPUs: Benchmarking with the DeepSparse Engine

Approach

We started with the standard, dense ResNet-50 model and applied SOTA training-aware sparsification techniques using SparseML recipes. Specifically, we utilized the AC/DC pruning method – an algorithm developed by IST Austria in partnership with Neural Magic. This new method enabled a doubling in sparsity levels from the prior best 10% non-zero weights to 5%. Now, 95% of the weights in a ResNet-50 model are pruned away while recovering within 99% of the baseline accuracy.

Sparsification	Precision	SparseZoo Model Stub
Dense	FP32	zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none
Pruned95	FP32	zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none
Pruned85	INT8	zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned85_quant-none-vnni

Hardware Setup and Environment

The DeepSparse Engine is completely infrastructure-agnostic, meant to plug in from edge deployments to model servers.

For this exercise, these benchmarks have been run on an AWS c5.12xlarge instance that has a modern Intel CPU with support for AVX-512 Vector Neural Network Instructions (AVX-512 VNNI). It is designed to accelerate INT8 workloads, making up to 4x speedups possible going from FP32 to INT8 inference.

We used Ubuntu 20.04.1 LTS as the operating system with Python 3.8.5. All the benchmarking dependencies are contained in DeepSparse Engine, which can be installed with:

pip3 install deepsparse

More details about DeepSparse Engine and compatible hardware are available.

Benchmark Measurements

Keeping this as simple as possible, the benchmark measures the full end-to-end time of giving an input batch to the engine and receiving predicted output, with full FP32 precision.

We perform several warm-up iterations before measuring the time for each iteration to minimize noise affecting the final results.

Here is the full-timing section from deepsparse/engine.py

start = time.time()
out = self.run(batch)
end = time.time()

ResNet-50 v1 Throughput Results

For the throughput scenario, we used a batch size of 64 with random input using all available cores.

Batch 64	items/sec	ms/batch
ONNXRuntime 1.11.0	324.71	197.08
Dense FP32	351.12	182.26
Pruned95 FP32	1057,77	60.49
Pruned85 INT8	2590.45	24.69

This CLI command replicates the benchmark environment, where SPARSEZOO_MODEL_STUB is replaced from the table above.

deepsparse.benchmark

zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned85_quant-none-vnni --batch_size 64 --scenario sync

ResNet-50 v1 Latency Results

For the latency scenario, we used a batch size of 1 with random input using all available cores.

Batch 1	items/sec	ms/batch
ONNXRuntime 1.11.0	114.88	8.69
Dense FP32	138.85	7.18
Pruned95 FP32	273.38	3.65
Pruned85 INT8	656.91	1.51

Try it Now: Benchmark ResNet-50

To replicate this experience and results, here are the instructions. Once you have procured infrastructure, it should take you approximately 5 minutes to run through this exercise.

Reserve a c5.12xlarge instance on AWS; we used the Amazon Ubuntu 20.04 AMI
Install the DeepSparse Engine by running pip3 install deepsparse
Run the CLI command for benchmarking: deepsparse.benchmark [zoo model stub] --batch_size 64 --scenario sync

ResNet-50 on CPUs Next Step: Transfer Learn

You can apply your data to sparse-quantized ResNet-50 models with a few lines of code using SparseML. To do so, visit our example in GitHub.

Conclusions

Sparse-quantized models like our ResNet-50 models provide attractive performance results for those with image classification and object detection use cases. With tools readily available in GitHub, as you can see from the results, leveraging models that use techniques like pruning and quantization, can achieve speedups upwards of 7.3x when using the DeepSparse Engine with compatible hardware.

These noticeable wins do not stop there with ResNet-50. Neural Magic is constantly pushing the boundaries of what’s possible with sparsification on new models and datasets across computer vision and NLP domains. The results of these advancements are pushed into our open-source repos for all to benefit.

Resources and Learning More

Software used in benchmarking: SparseML, DeepSparse Engine
Sparse Transfer Learning with SparseML
Join the Deep Sparse Community in Slack to get direct access to our engineering teams and other people looking to accelerate ML performance.
Subscribe to Neural Magic Updates: Nerd out with us on ML Performance! (We keep the email manageable and do not share your details with anyone, ever.)

Was this article helpful?

YesNo

Author(s)

Mark Kurtz

CTO, Neural Magic