ResNet-50 on CPUs: Sparsifying for Better Performance on CPUs

|

In this post, we elaborate on how we measured, on commodity cloud hardware, the throughput and latency of five ResNet-50 v1 models optimized for CPU inference. By the end of the post, you should be able reproduce these benchmarks using tools available in the Neural Magic GitHub repo, ultimately achieving better performance for ResNet-50 on CPUs.

ResNet-50 on CPUs Throughput
ResNet-50 v1 | Batch = 64 | AWS c5.12xlarge CPU
ResNet-50 on CPUs Latency
ResNet-50 v1 | Batch = 1 | AWS c5.12xlarge CPU

Today we are releasing support for ResNet-50, with YOLOv3 support coming in a few weeks, to be followed by BERT and other transformer models in coming months. We urge you to try unsupported models and report back to us through the GitHub Issue queue as we work hard to broaden our offering of sparse and sparse-quantized models.

For more info on ResNet, how it’s typically used, current limitations, and details on how Neural Magic initially made running ResNet models more performant and cost effective, see our previous post

Intro to Sparsification: Pruning ResNet-50

Neural Magic’s Deep Sparse Platform provides a suite of software tools to select, build, and run sparse deep learning models on CPU resources. Taking advantage of “sparsification,” there are multiple ways to plug into the DeepSparse Engine which runs sparse models like ResNet-50 at accelerated speeds on CPUs. So what is sparsification and why should you care?

Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using pruning and quantization to enabling naturally occurring activation sparsity. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, as you will see shortly in our benchmarking exercise, pruning plus quantization can give over 7x improvement in performance while recovering to nearly the same baseline accuracy. 

The Deep Sparse Platform builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. Simply put, you would:

  1. Download a sparsification recipe and sparsified model from the SparseZoo.
  2. Alternatively, create a recipe for your model using Sparsify.
  3. Apply your recipe with only a few lines of code using SparseML.
  4. Finally, for GPU-level performance on CPUs, you can deploy your sparse-quantized model with the DeepSparse Engine.

Here’s the full Deep Sparse product flow and various paths to sparse acceleration. We will focus this discussion on the path of taking a SparseZoo model, namely the sparse-quantized ResNet-50, and benchmarking it with the DeepSparse Engine:

Diagram of Neural Magic's tools needed to reproduce sparse-quantized performance of ResNet-50 on CPUs

ResNet-50 on CPUs: Benchmarking with the DeepSparse Engine

Approach

To populate the SparseZoo, we started from a pre-trained baseline ResNet-50 from the torchvision models subpackage. Sparsified and quantized, the models were then fine-tuned using our replicable recipes to recover close to the baseline accuracy. 

We define three categories of recoverability to make it easy to understand the trade-offs made during the sparsification process: 

  • Conservative: accuracy maintained 100% of the baseline
  • Moderate: accuracy maintained >= 99% of the baseline
  • Aggressive: accuracy maintained >= 95% of the baseline

Each model in the SparseZoo has a specific stub that identifies the category of recoverability. Visit SparseZoo docs on models to learn more about the stub structure:

SparsificationPrecisionSparseZoo Model Stub
DenseFP32“zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none”
Pruned ConservativeFP32“zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-conservative”
Pruned ModerateFP32“zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-moderate”
Pruned ModerateINT8“zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-moderate”
Pruned AggressiveINT8“zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-aggressive”

Hardware Setup and Environment

The DeepSparse Engine is completely infrastructure-agnostic, meant to plug in from edge deployments to model servers. As long as it has the “right” CPUs (80% of the entire Intel offering today) with the correct instruction set for performance such as AVX-512, we can run on any cloud platform. The DeepSparse Engine will automatically utilize the most effective available instruction set for the task.

For this exercise, these benchmarks have been run on an AWS c5.12xlarge instance that has a modern Intel CPU with support for AVX-512 Vector Neural Network Instructions (AVX-512 VNNI). It is designed to accelerate INT8 workloads, making up to 4x speedups possible going from FP32 to INT8 inference.

We used Ubuntu 20.04.1 LTS as the operating system with Python 3.8.5. All the benchmarking dependencies are contained in DeepSparse Engine, which can be installed with

pip3 install deepsparse

More details about DeepSparse Engine and compatible hardware are available.

You can find the Python script used to generate the DeepSparse numbers on the DeepSparse Engine GitHub repo.

Benchmark Measurements

Keeping this as simple as possible, the benchmark measures the full end-to-end time of giving an input batch to the engine and receiving predicted output, with full FP32 precision.

We perform several warm up iterations before measuring the time for each iteration to minimize noise affecting the final results.

Here is the full timing section from deepsparse/engine.py

start = time.time()
out = self.run(batch)
end = time.time()

ResNet-50 v1 Throughput Results

For the throughput scenario, we used a batch size of 64 with random input using all available cores. 

ResNet-50 on CPUs Throughput
Batch 64items/secms/batch
ONNXRuntime 1.6.0296.20216.07
Dense FP32323.48197.85
Pruned Conservative FP32711.7889.92
Pruned Moderate FP32828.7877.22
Pruned Moderate INT82090.5130.61
Pruned Aggressive INT82304.6027.77

This code block replicates the benchmark environment, where SPARSEZOO_MODEL_STUB is replaced from the table above.

from deepsparse import benchmark_model
import numpy

batch_size = 64
sample_inputs = [numpy.random.randn(batch_size, 3, 224, 224).astype(numpy.float32)]

results = benchmark_model(
    “SPARSEZOO_MODEL_STUB”,
    sample_inputs,
    batch_size=batch_size,
)
print(results)

As an example substitution, this is the benchmark command for the Pruned Moderate FP32 ResNet-50:

results = benchmark_model(
"zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-moderate",
    sample_inputs,
    batch_size=batch_size,
)
print(results)

ResNet-50 v1 Latency Results

For the latency scenario, we used a batch size of 1 with random input using all available cores. 

ResNet-50 on CPUs Latency
Batch 1items/secms/batch
ONNXRuntime 1.6.0109.509.13
Dense FP32145.336.88
Pruned Conservative FP32197.295.07
Pruned Moderate FP32238.704.19
Pruned Moderate INT8510.171.96
Pruned Aggressive INT8561.571.78

This code block replicates the benchmark environment, where SPARSEZOO_MODEL_STUB is replaced from the above table.

from deepsparse import benchmark_model
import numpy

batch_size = 1
sample_inputs = [numpy.random.randn(batch_size, 3, 224, 224).astype(numpy.float32)]

results = benchmark_model(
    “SPARSEZOO_MODEL_STUB”,
    sample_inputs,
    batch_size=batch_size,
)
print(results)

Try it Now: Benchmark ResNet-50

To replicate this experience and results, here are the instructions. Once you have procured infrastructure, it should take you approximately 15 – 30 minutes to run through this exercise.

1. Reserve a c5.12xlarge instance on AWS; we used the Amazon Ubuntu 20.04 AMI

2. Install pip and venv if it isn’t already installed with

sudo apt update && sudo apt install python3-pip python3-venv

3. Create and activate a virtual environment for Python

python3 -m venv benchmark-env && source benchmark-env/bin/activate

4. Install the DeepSparse Engine by running

pip3 install deepsparse

5. Clone the DeepSparse Engine repository; it will include the benchmarking script for reproducing ResNet-50 numbers:

git clone https://github.com/neuralmagic/deepsparse.git

6. Replicate the throughput and latency scenarios by running the Python scripts: 

python3 deepsparse/examples/benchmark/resnet50_benchmark.py --batch_size=64

python3 deepsparse/examples/benchmark/resnet50_benchmark.py --batch_size=1

Both scripts will download the various ResNet-50 sparse-quantized models from SparseZoo, benchmark them for the given batch size, and print out the results of the iterations as follows:

ResNet-50 sparse-quantized script

Conclusions

Sparse-quantized models like our ResNet-50 models provide attractive performance results for those with image classification and object detection use cases. With tools readily available in GitHub, as you can see from the results, leveraging models that use techniques like pruning and quantization, can achieve speedups upwards of 7x when using the DeepSparse Engine with compatible hardware.

These noticeable wins do not stop there with ResNet-50. Neural Magic is constantly pushing the boundaries of what’s possible with sparsification on new models and datasets. The results of these advancements are pushed into our open-source repos for all to benefit from including new, performant models consistently being added to the SparseZoo and new techniques being added to Sparsify and SparseML to work with your own models.

ResNet-50 on CPUs Next Step: Transfer Learn

To transfer learn ResNet-50 to your data, visit our example in GitHub.

Resources and Learning More