YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Performance and a Smaller Footprint
Prune and quantize YOLOv5 for a 10x increase in performance with 12x smaller model files.
Neural Magic improves YOLOv5 model performance on CPUs by using state-of-the-art pruning and quantization techniques combined with the DeepSparse Engine. In this blog post, we’ll cover our general methodology and demonstrate how to:
- Leverage the Ultralytics YOLOv5 repository with SparseML’s sparsification recipes to create highly pruned and INT8 quantized YOLOv5 models;
- Train YOLOv5 on new datasets to replicate our performance with your own data leveraging pre-sparsified models in the SparseZoo;
- Reproduce our benchmarks using the aforementioned integrations and tools linked from the Neural Magic YOLOv5 model page.
We held a live discussion on August 31, centered around these three topics. You can view the recording here.
We have previously released support for ResNet-50 and YOLOv3 showing 7x and 6x better performance over CPU implementations, respectively. Today we are officially supporting YOLOv5, to be followed by BERT and other popular models in the coming weeks.
Achieving GPU-Class Performance on CPUs
In June of 2020, Ultralytics iterated on the YOLO object detection models by creating and releasing the YOLOv5 GitHub repository. The new iteration added novel contributions such as the Focus convolutional block and more standard, modern practices like compound scaling, among others, to the very successful YOLO family. The iteration also marked the first time a YOLO model was natively developed inside of PyTorch, enabling faster training at FP16 and quantization-aware training (QAT).
The new developments in YOLOv5 led to faster and more accurate models on GPUs, but added additional complexities for CPU deployments. Compound scaling–changing the input size, depth, and width of the networks simultaneously–resulted in small, memory-bound networks such as YOLOv5s along with larger, more compute-bound networks such as YOLOv5l. Furthermore, the post-processing and Focus blocks took a significant amount of time to execute due to memory movement for YOLOv5s and slowed down YOLOv5l, especially at larger input sizes. Therefore, to achieve breakthrough performance for YOLOv5 models on CPUs, additional ML and system advancements were required.
Deployment performance between GPUs and CPUs was starkly different until today. Taking YOLOv5l as an example, at batch size 1 and 640×640 input size, there is more than a 10x gap in performance:
- A T4 FP16 GPU instance on AWS running PyTorch achieved 59.3 items/sec.
- A 24-core C5 CPU instance on AWS running ONNX Runtime achieved 5.8 items/sec.
The good news is that there’s a surprising amount of power and flexibility on CPUs; we just need to utilize it to achieve better performance.
To show how a different systems approach can boost performance, we swapped ONNX Runtime with the DeepSparse Engine. DeepSparse Engine has proprietary advancements that better accommodate the advantages of CPU hardware to the YOLOv5 model architectures. These advancements execute depth-wise through the network leveraging the large caches available on the CPU. Using the same 24-core setup that we used with ONNX Runtime on the dense FP32 network, DeepSparse is able to boost base performance to 17.7 items/sec, a 3x improvement. This excludes additional performance gains we’ll be able to achieve via new algorithms under active development now. More to come in the next few releases – stay tuned.
The dense FP32 result on the DeepSparse Engine is a notable improvement, but it is still over 3x slower than the T4 GPU. So how do we close that gap to get to GPU-level performance on CPUs? Since the network is now largely compute-bound, we can leverage sparsity to gain additional performance improvements. Using SparseML’s recipe-driven approach for model sparsification, plus a lot of research for pruning deep learning networks, we successfully created highly sparse and INT8 quantized YOLOv5l and YOLOv5s models. Plugging the sparse-quantized YOLOv5l model back into the same setup with the DeepSparse Engine, we are able to achieve 52.6 items/sec — 9x better than ONNX Runtime and nearly the same level of performance as the best available T4 implementation.
A Deep Dive into the Numbers
There are three different variations of benchmarked YOLOv5s and YOLOv5l:
- Baseline (dense FP32);
- Pruned-quantized (INT8).
The mAP at an IOU of 0.5 on the validation set of COCO is reported for all these models in Table 1 below (a higher value is better). Another benefit of both pruning and quantization is that it creates smaller file sizes for deployment. The compressed file sizes for each model were additionally measured and are also found in Table 1 (a lower value is better). These models are then referenced in the later sections with full benchmark numbers for the different deployment setups.
The benchmark numbers below were run on readily available servers in AWS. The code to benchmark and create the models is open sourced in the DeepSparse repo and SparseML repo, respectively. Each benchmark includes end-to-end times, from pre-processing to the model execution to post-processing. To generate accurate numbers for each system, 25 warmups were run with the average of the resulting 80 measurements reported. Results are recorded in items per second (items/sec) where a larger value is better.
The CPU servers and core counts for each use case were chosen to ensure a balance between different deployment setups and pricing. Specifically, the AWS C5 servers were used as they are designed for computationally intensive workloads and include both AVX512 and VNNI instruction sets. Due to the general flexibility of CPU servers, the number of cores can be varied to better fit the exact deployment needs, enabling the user to balance performance and cost with ease. And to state the obvious, CPU servers are more readily available and models can be deployed closer to the end-user, cutting out costly network time.
Unfortunately, the common GPUs available in the cloud do not have support for speedup using unstructured sparsity. This is due to a lack of both hardware and software support and is an active research area. As of this writing, the new A100s do have hardware support for semi-structured sparsity but are not readily available. When support becomes available, we will update our benchmarks while continuing to release accurate, cheaper, and more environmentally friendly neural networks through model sparsification.
|Model Type||Sparsity||Precision||[email protected]||File Size (MB)|
|YOLOv5l Pruned Quantized||79.2%||INT8||62.3||11.7|
|YOLOv5s Pruned Quantized||68.2%||INT8||52.5||3.1|
For latency measurements, we use batch size 1 to represent the fastest time an image can be detected and returned. A 24-core, single-socket AWS server is used to test the CPU implementations. Table 2 below displays the measured values (and the source for Figure 1). We can see that combining the DeepSparse Engine with the pruned and quantized models improves the performance over the next best CPU implementation. Compared to PyTorch running the pruned-quantized model, DeepSparse is 6-7x faster for both YOLOv5l and YOLOv5s. Compared to GPUs, pruned-quantized YOLOv5l on DeepSparse matches the T4, and YOLOv5s on DeepSparse is 2.5x faster than the V100 and 1.5x faster than the T4.
|Inference Engine||Device||Model Type||YOLOv5l items/sec||YOLOv5s items/sec|
|PyTorch GPU||T4 FP32||Base||26.8||77.9|
|PyTorch GPU||T4 FP16||Base||59.3||75.4|
|PyTorch GPU||V100 FP32||Base||37.4||46.3|
|PyTorch GPU||V100 FP16||Base||38.5||44.6|
|PyTorch CPU||24-Core||Pruned Quantized||7.8||16.6|
|ONNX Runtime CPU||24-Core||Base||5.8||15.2|
|ONNX Runtime CPU||24-Core||Pruned||5.8||15.2|
|ONNX Runtime CPU||24-Core||Pruned Quantized||5.4||14.9|
For throughput measurements, we use batch size 64 to represent a normal, batched use case for the throughput performance benchmarking. Additionally, a batch size of 64 was sufficient to fully saturate the GPUs and CPUs performance in our testing. A 24-core, single-socket AWS server was used to test the CPU implementations as well. Table 3 below displays the measured values. We can see that the V100 numbers are hard to beat; however, pruning and quantizing combined with DeepSparse beat out the T4 performance. The combination also beats out the next best CPU numbers by 16x for YOLOv5l and 10x for YOLOv5s!
|Inference Engine||Device||Model Type||YOLOv5l items/sec||YOLOv5s items/sec|
|PyTorch GPU||T4 FP32||Base||26.9||88.8|
|PyTorch GPU||T4 FP16||Base||78.0||179.1|
|PyTorch GPU||V100 FP32||Base||113.1||239.9|
|PyTorch GPU||V100 FP16||Base||215.9||328.9|
|PyTorch CPU||24-Core||Pruned Quantized||6.0||18.5|
|ONNX Runtime CPU||24-Core||Base||4.7||12.7|
|ONNX Runtime CPU||24-Core||Pruned||4.7||12.7|
|ONNX Runtime CPU||24-Core||Pruned Quantized||4.6||12.5|
Replicate with Your Own Data
While benchmarking results above are noteworthy, Neural Magic has not seen many deployed models trained on the COCO dataset. Furthermore, deployment environments vary from private clouds to multi-cloud setups. Below we walk through additional assets and general steps that can be taken to both transfer the sparse models onto your own datasets and benchmark the models on your own deployment hardware.
Sparse Transfer Learning
Sparse transfer learning research is still ongoing; however, interesting results have been published over the past few years building off of the lottery ticket hypothesis. Papers highlighting results for computer vision and natural language processing show sparse transfer learning from being as good as pruning from scratch on the downstream task to outperforming dense transfer learning.
In this same vein, we’ve published a tutorial on how to transfer learn from the sparse YOLOv5 models onto new datasets. It’s as simple as checking out the SparseML repository, running the setup for the SparseML and YOLOv5 integration, and then kicking off a command-line command with your data. The command downloads the pre-sparsified model from the SparseZoo and begins training on your dataset. An example that transfers from the pruned quantized YOLOv5l model is given below:
python train.py --data voc.yaml --cfg ../models/yolov5l.yaml --weights zoo:cv/detection/yolov5-l/pytorch/ultralytics/coco/pruned_quant-aggressive_95?recipe_type=transfer --hyp data/hyp.finetune.yaml --recipe ../recipes/yolov5.transfer_learn_pruned_quantized.md
To reproduce our benchmarks and check DeepSparse performance on your own deployment, the code is provided as an example in the DeepSparse repo. The benchmarking script supports YOLOv5 models using DeepSparse, ONNX Runtime (CPU) and PyTorch GPU.
For a full list of options run:
python benchmark.py --help.
As an example, to benchmark DeepSparse’s pruned-quantized YOLOv5l performance on your VNNI-enabled CPU, run:
python benchmark.py zoo:cv/detection/yolov5-l/pytorch/ultralytics/coco/pruned_quant-aggressive_95 --batch-size 1 --quantized-inputs
The DeepSparse Engine combined with SparseML’s recipe-driven approach enables GPU-class performance for the YOLOv5 family of models. Inference performance improved 6-7x for latency and 16x for throughput on YOLOv5l as compared to other CPU inference engines. The transfer learning tutorial and benchmarking example enable straightforward evaluation of the performant models on your own datasets and deployments, so you can realize these gains for your own applications.
These noticeable wins do not stop there with YOLOv5. We will be maximizing what’s possible with sparsification and CPU deployments through higher sparsities, better high-performance algorithms, and cutting-edge multicore programming developments. The results of these advancements will be pushed into our open-source repos for all to benefit. Stay current by starring our GitHub repository or subscribing to our monthly ML performance newsletter here.
We urge you to try unsupported models and report back to us through the GitHub Issue queue as we work hard to broaden our sparse and sparse-quantized model offerings. And to interact with our product and engineering teams, along with other Neural Magic users and developers interested in model sparsification and accelerating deep learning inference performance, join our Slack or Discourse communities.