Gaurav Rao
09/28/20

Speeding Up Memory-Bound Object Detection Models: MobileNetV2_SSD

Gaurav Rao
09/28/20

TL;DR: Learn more about increasing performance for MobileNetV2_SSD models, via pruning and decreasing post-production time.

Photo by Luca Campioni on Unsplash

Read time: 3 minutes, 15 seconds

In many object detection scenarios, there’s not a moment to lose. A fraction of a second can mean the difference between a self-driving car hitting a dog crossing the street or narrowly missing it. Both speed and accuracy are crucial. MobileNetV2_SSD models were made for real-time object detection use cases where accuracy counts. Using Neural Magic’s software, these memory-bound models can run much faster and take up less storage space on CPUs – making them easier and cheaper to execute in production.

This blog post will provide a brief overview of MobileNetV2_SSD models and how they’re used. We’ll also discuss model optimization techniques we are applying to MobileNetV2_SSD to deliver better performance, such as pruning and decreasing post-production time. These  optimizations allow the model to run as fast, with nearly the same [smaller] storage size as SSDLite – without sacrificing accuracy. If you’re interested in similar articles on improving machine learning performance, check out our ResNet and MobileNetV2 posts.

What is MobileNetV2_SSD

Object detection models are built to detect objects of a certain class (e.g. cars, people, animals, buildings, etc.) within images. SSD (which stands for “single shot detector”) is designed for real-time object detection. The model removes delegated region proposals and uses lower resolution images to run at real-time speed. SSD is a sparse model that uses memory-bound processes, such as lightweight depthwise convolutions. The model is a part of the TensorFlow object detection API. 

MobileNetV2 is a good backbone for SSD. Simpler networks like MobileNet make feature extraction and convolutional filtering for image detection more efficient, so that for a given accuracy it will run faster than a Resnet backbone, for example. . In addition to SSD, MobileNetV2 can also be used with YOLOv3 (“You Only Look Once”), another popular real-time object detection model.

MobileNetV2_SSD is ideal for environments with constrained compute. It runs exceptionally well on CPUs instead of costly hardware accelerators like GPUs. The SSDLite version achieves even faster speeds, but at the expense of accuracy. One challenge of smaller models like SSD is slow post-processing times, but we’ll explain more about that later.

Why Use Neural Magic

As discussed earlier, Neural Magic’s machine learning tools can be used to optimize MobileNetV2_SSD models. Even though MobileNetV2_SSD is a naturally sparse model, it can still be pruned. Neural Magic uses state-of-the-art optimization techniques to sparsify the model, lowering the required storage space and improving performance, without sacrificing baseline accuracy. Quantization support is on our 2020 roadmap, which results in more efficient computations and faster speeds.

The Neural Magic Inference Engine is perfectly suited for running memory-bound models like MobileNet in production. The Inference Engine’s proprietary algorithms improve model performance in two ways: 

  • Using CPU memory more efficiently: Inference Engine works by optimizing how a neural network is executed across the available memory hierarchies in a CPU. The associated engine algorithms identify memory-bound processes within the network – such as depthwise convolutions – and apply optimization techniques to accelerate performance of those components.
  • Cutting down on post-processing time: Non-maximum suppression (NMS) is a post-processing technique that merges all detections that belong to the same object. Inference Engine speeds up NMS via smarter algorithms, making the entire run faster than any other provider. 

Benchmarking MobileNetV2_SSD with the Neural Magic Inference Engine

As you can see below, Neural Magic outperforms other CPU providers, even when using fewer CPU cores. This is due to the model optimizations and our proprietary algorithms’ efficient use of CPU memory, as discussed above.

Similarly, Neural Magic optimizations and algorithms deliver 2.7x better performance at batch size 1 when using an 18-core CPU vs. a V100 GPU. Numbers are more astonishing at batch size 64: Neural Magic is able to outperform a V100 GPU by 3.95x.

Let’s look at cost savings. To get cost-per-inference savings numbers, we looked at the cost of AWS on-demand instances for a baseline CPU deployment vs. the performance gains achieved on Neural Magic Inference Engine.

If you like what you see, and would like to dig deeper, pursue one of the next steps: